from:"Renfro, Michael"

[slurm-users] Re: Location of Slurm source packages?

2024-05-15 Thread Renfro, Michael via slurm-users

Forgot to add that Debian/Ubuntu packages are pretty much whatever version was 
stable at the time of the Debian/Ubuntu .0 release. They’ll backport security 
fixes to those older versions as needed, but they never change versions unless 
absolutely required.

The backports repositories may have looser rules, but not the core 
main/contrib/non-free repositories.

From: Renfro, Michael 
Date: Wednesday, May 15, 2024 at 10:19 AM
To: Jeffrey Layton , Lloyd Brown 
Cc: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Re: Location of Slurm source packages?
Debian/Ubuntu sources can always be found in at least two ways:

  1.  Pages like https://packages.ubuntu.com/jammy/slurm-wlm (see the .dsc, 
.orig.tar.gz, and .debian.tar.xz links there).
  2.  Commands like ‘apt-get source slurm-wlm’ (may require ‘dpkg-dev’ or other 
packages – probably easiest to install the ‘build-essential’ meta-package).

From: Jeffrey Layton via slurm-users 
Date: Wednesday, May 15, 2024 at 10:01 AM
To: Lloyd Brown 
Cc: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: Location of Slurm source packages?

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

Lloyd,

Good to hear from you! I was hoping to avoid the use of git but that may be the 
only way. The version is 21.08.5. I checked the "old" packages from SchedMD and 
they begin part way through 2024 so that won't work.

I'm very surprised Ubuntu let a package through without a source package for 
it. I'm hoping I'm not seeing the tree through the forest in finding that 
package.

Thanks for the help!

Jeff

On Wed, May 15, 2024 at 10:54 AM Lloyd Brown via slurm-users 
mailto:slurm-users@lists.schedmd.com>> wrote:

Jeff,

I'm not sure what version is in the Ubuntu packages, as I don't think they're 
provided by SchedMD, and I'm having trouble finding the right one on 
packages.ubuntu.com<http://packages.ubuntu.com/>.  Having said that, SchedMD is 
pretty good about using tags in their github repo 
(https://github.com/schedmd/slurm), to represent the releases.  For example, 
the "slurm-23-11-6-1" tag corresponds to release 23.11.6.  It's pretty 
straightforward to clone the repo, and do something like "git checkout -b 
MY_LOCAL_BRANCH_NAME TAG_NAME" to get the version you're after.

Lloyd

--

Lloyd Brown

HPC Systems Administrator

Office of Research Computing

Brigham Young University

http://rc.byu.edu<http://rc.byu.edu/>
On 5/15/24 08:35, Jeffrey Layton via slurm-users wrote:
Good morning,

I have an Ubuntu 22.04 server where I installed Slurm from the Ubuntu packages. 
I now want to install pyxis but it says I need the Slurm sources. In Ubuntu 
22.04, is there a package that has the source code? How to download the sources 
I need from github?

Thanks!

Jeff

--
slurm-users mailing list -- 
slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>
To unsubscribe send an email to 
slurm-users-le...@lists.schedmd.com<mailto:slurm-users-le...@lists.schedmd.com>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Location of Slurm source packages?

2024-05-15 Thread Renfro, Michael via slurm-users

Debian/Ubuntu sources can always be found in at least two ways:

  1.  Pages like https://packages.ubuntu.com/jammy/slurm-wlm (see the .dsc, 
.orig.tar.gz, and .debian.tar.xz links there).
  2.  Commands like ‘apt-get source slurm-wlm’ (may require ‘dpkg-dev’ or other 
packages – probably easiest to install the ‘build-essential’ meta-package).

From: Jeffrey Layton via slurm-users 
Date: Wednesday, May 15, 2024 at 10:01 AM
To: Lloyd Brown 
Cc: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: Location of Slurm source packages?

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

Lloyd,

Good to hear from you! I was hoping to avoid the use of git but that may be the 
only way. The version is 21.08.5. I checked the "old" packages from SchedMD and 
they begin part way through 2024 so that won't work.

I'm very surprised Ubuntu let a package through without a source package for 
it. I'm hoping I'm not seeing the tree through the forest in finding that 
package.

Thanks for the help!

Jeff

On Wed, May 15, 2024 at 10:54 AM Lloyd Brown via slurm-users 
mailto:slurm-users@lists.schedmd.com>> wrote:

Jeff,

I'm not sure what version is in the Ubuntu packages, as I don't think they're 
provided by SchedMD, and I'm having trouble finding the right one on 
packages.ubuntu.com.  Having said that, SchedMD is 
pretty good about using tags in their github repo 
(https://github.com/schedmd/slurm), to represent the releases.  For example, 
the "slurm-23-11-6-1" tag corresponds to release 23.11.6.  It's pretty 
straightforward to clone the repo, and do something like "git checkout -b 
MY_LOCAL_BRANCH_NAME TAG_NAME" to get the version you're after.

Lloyd

--

Lloyd Brown

HPC Systems Administrator

Office of Research Computing

Brigham Young University

http://rc.byu.edu
On 5/15/24 08:35, Jeffrey Layton via slurm-users wrote:
Good morning,

I have an Ubuntu 22.04 server where I installed Slurm from the Ubuntu packages. 
I now want to install pyxis but it says I need the Slurm sources. In Ubuntu 
22.04, is there a package that has the source code? How to download the sources 
I need from github?

Thanks!

Jeff

--
slurm-users mailing list -- 
slurm-users@lists.schedmd.com
To unsubscribe send an email to 
slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: [EXT] Re: SLURM configuration help

2024-04-04 Thread Renfro, Michael via slurm-users

Yep, from your scontrol show node output:

CfgTRES=cpu=64,mem=2052077M,billing=64
AllocTRES=cpu=1,mem=2052077M

The running job (77) has allocated 1 CPU and all the memory on the node. That’s 
probably due to the partition using the default DefMemPerCPU value [1], which 
is unlimited.

Since all our nodes are shared, and our workloads vary widely, we set our 
DefMemPerCPU value to something considerably lower than 
mem_in_node/cores_in_node . That way, most jobs will leave some memory 
available by default, and other jobs can use that extra memory as long as CPUs 
are available.

[1] https://slurm.schedmd.com/slurm.conf.html#OPT_DefMemPerCPU

From: Alison Peterson 
Date: Thursday, April 4, 2024 at 11:58 AM
To: Renfro, Michael 
Subject: Re: [EXT] Re: [slurm-users] SLURM configuration help

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


Here is the info:
sma@cusco:/data/work/sma-scratch/tohoku_wOcean$ scontrol show node cusco

NodeName=cusco Arch=x86_64 CoresPerSocket=32
   CPUAlloc=1 CPUTot=64 CPULoad=0.02
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:4
   NodeAddr=cusco NodeHostName=cusco Version=19.05.5
   OS=Linux 5.4.0-172-generic #190-Ubuntu SMP Fri Feb 2 23:24:22 UTC 2024
   RealMemory=2052077 AllocMem=2052077 FreeMem=1995947 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=mainpart
   BootTime=2024-03-01T17:06:26 SlurmdStartTime=2024-03-01T17:06:53
   CfgTRES=cpu=64,mem=2052077M,billing=64
   AllocTRES=cpu=1,mem=2052077M
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

sma@cusco:/data/work/sma-scratch/tohoku_wOcean$ squeue

 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)
78  mainpart CF1090_w  sma PD   0:00  1 (Resources)
77  mainpart CF_w  sma  R   0:26  1 cusco
sma@cusco:/data/work/sma-scratch/tohoku_wOcean$ scontrol show job 78

JobId=78 JobName=CF1090_wOcean500m.shell
   UserId=sma(1008) GroupId=myfault(1001) MCS_label=N/A
   Priority=4294901720 Nice=0 Account=(null) QOS=(null)
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2024-04-04T09:55:34 EligibleTime=2024-04-04T09:55:34
   AccrueTime=2024-04-04T09:55:34
   StartTime=2024-04-04T10:55:28 EndTime=2024-04-04T11:55:28 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-04-04T09:55:58
   Partition=mainpart AllocNode:Sid=newcusco:2450574
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=cusco
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
   Command=/data/work/sma-scratch/tohoku_wOcean/CF1090_wOcean500m.shell
   WorkDir=/data/work/sma-scratch/tohoku_wOcean
   StdErr=/data/work/sma-scratch/tohoku_wOcean/slurm-78.out
   StdIn=/dev/null
   StdOut=/data/work/sma-scratch/tohoku_wOcean/slurm-78.out
   Power=

On Thu, Apr 4, 2024 at 8:57 AM Renfro, Michael 
mailto:ren...@tntech.edu>> wrote:
What does “scontrol show node cusco” and “scontrol show job PENDING_JOB_ID” 
show?

On one job we currently have that’s pending due to Resources, that job has 
requested 90 CPUs and 180 GB of memory as seen in its ReqTRES= value, but the 
node it wants to run on only has 37 CPUs available (seen by comparing its 
CfgTRES= and AllocTRES= values).

From: Alison Peterson via slurm-users 
mailto:slurm-users@lists.schedmd.com>>
Date: Thursday, April 4, 2024 at 10:43 AM
To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com> 
mailto:slurm-users@lists.schedmd.com>>
Subject: [slurm-users] SLURM configuration help

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


I am writing to seek assistance with a critical issue on our single-node system 
managed by Slurm. Our jobs are queued and marked as awaiting resources, but 
they are not starting despite seeming availability. I'm new with SLURM and my 
only experience was a class on installing it so I have no experience, running 
it or using it.

Issue Summary:

Main Problem: Jobs submitted only one run and the second says NODELIST(REASON) 
(Resources). I've checked that our single node has enough RAM (2TB) and CPU's 
(64) available.

# COMPUTE NODES
NodeName=cusco CPUs=64 Sockets=2 CoresPerSocket=32 ThreadsPerCore=1 
RealMemory=2052077 Gres

[slurm-users] Re: SLURM configuration help

2024-04-04 Thread Renfro, Michael via slurm-users

What does “scontrol show node cusco” and “scontrol show job PENDING_JOB_ID” 
show?

On one job we currently have that’s pending due to Resources, that job has 
requested 90 CPUs and 180 GB of memory as seen in its ReqTRES= value, but the 
node it wants to run on only has 37 CPUs available (seen by comparing its 
CfgTRES= and AllocTRES= values).

From: Alison Peterson via slurm-users 
Date: Thursday, April 4, 2024 at 10:43 AM
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] SLURM configuration help

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

I am writing to seek assistance with a critical issue on our single-node system 
managed by Slurm. Our jobs are queued and marked as awaiting resources, but 
they are not starting despite seeming availability. I'm new with SLURM and my 
only experience was a class on installing it so I have no experience, running 
it or using it.

Issue Summary:

Main Problem: Jobs submitted only one run and the second says NODELIST(REASON) 
(Resources). I've checked that our single node has enough RAM (2TB) and CPU's 
(64) available.

# COMPUTE NODES
NodeName=cusco CPUs=64 Sockets=2 CoresPerSocket=32 ThreadsPerCore=1 
RealMemory=2052077 Gres=gpu:1,gpu:1,gpu:1,gpu:1
PartitionName=mainpart Default=YES MinNodes=1 DefaultTime=00:60:00 
MaxTime=UNLIMITED AllowAccounts=ALL Nodes=ALL State=UP OverSubscribe=Force

System Details: We have a single-node setup with Slurm as the workload manager. 
The node appears to have sufficient resources for the queued jobs.
Troubleshooting Performed:
Configuration Checks: I have verified all Slurm configurations and the system's 
resource availability, which should not be limiting job execution.
Service Status: The Slurm daemon slurmdbd is active and running without any 
reported issues. System resource monitoring shows no shortages that would 
prevent job initiation.

Any guidance and help will be deeply appreciated!

--
Alison Peterson
IT Research Support Analyst
Information Technology
apeters...@sdsu.edu
O: 619-594-3364
San Diego State University | SDSU.edu
5500 Campanile Drive | San Diego, CA 92182-8080
[Image removed by sender.]

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: SLURM configuration for LDAP users

2024-02-04 Thread Renfro, Michael via slurm-users

“An LDAP user can login to the login, slurmctld and compute nodes, but when 
they try to submit jobs, slurmctld logs an error about invalid account or 
partition for user.”

Since I don’t think it was mentioned below, does a non-LDAP user get the same 
error, or does it work by default?

We don’t use LDAP explicitly, but we’ve used sssd with Slurm and Active 
Directory for 6.5 years without issue. We’ve always added users to sacctmgr so 
that we could track usage by research group or class, so we never used a 
default account for all users.

From: Richard Chang via slurm-users 
Date: Saturday, February 3, 2024 at 11:41 PM
To: slurm-us...@schedmd.com 
Subject: [slurm-users] SLURM configuration for LDAP users

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.



Hi,

I am a little new to this, so please pardon my ignorance.

I have configured slurm in my cluster and it works fine with local users. But I 
am not able to get it working with LDAP/SSSD authentication.

User logins using ssh are working fine. An LDAP user can login to the login, 
slurmctld and compute nodes, but when they try to submit jobs, slurmctld logs 
an error about invalid account or partition for user.

Someone said we need to add the user manually into the database using the 
sacctmgr command. But I am not sure we need to do this for each and every LDAP 
user. Yes, it does work if we add the LDAP user manually using sacctmgr. But I 
am not convinced this manual way is the way to do.

The documentation is not very clear about using LDAP accounts.

Saw somewhere in the list about using UsePAM=1 and copying or creating a 
softlink for slurm PAM module under /etc/pam.d . But it didn't work for me.

Saw somewhere else that we need to specifying LaunchParameters=enable_nss_slurm 
in the slurm.conf file and put slurm keyword in passwd/group entry in the 
/etc/nsswitch.conf file. Did these, but didn't help either.

I am bereft of ideas at present. If anyone has real world experience and can 
advise, I will be grateful.

Thank you,

Richard

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Re: [slurm-users] Slurp for sw builds

2024-01-03 Thread Renfro, Michael

You can attack this in a few different stages. A lot of what you’re interested 
in will be found at various university or national lab sites (I Googled “sbatch 
example” for the one below)


  1.  If you’re good with doing a “make -j” to parallelize a make compilation 
over multiple CPUs in a single computer to build one executable, you can adapt 
the first example from the Multi-Threaded SMP Job  section of [1], using the 
same number of cpus-per-task as you use for your -j flag.
  2.  Once that’s going, you can use a script similar to the one at [2] to 
submit an array of concurrent build jobs, and they can spread out across 
multiple computers as needed. The built-in job array support will only change 
an environment variable SLURM_ARRAY_TASK_ID for each job in the array, but you 
can use that value to select a folder to cd into, or to do other things.
  3.  For a higher-level abstraction that will submit a bunch of build jobs, 
wait for them all to finish, and then archive the resulting artifacts, a 
pipeline tool like Snakemake [3] can track the individual tasks, and work with 
batch jobs or with local programs on the submission host. Two Slurm-related 
profiles for Snakemake are at [4] (simpler) and [5] (more comprehensive).

[1] https://help.rc.ufl.edu/doc/Multi-Threaded_%26_Message_Passing_Job_Scripts
[2] https://help.rc.ufl.edu/doc/Sample_SLURM_Scripts#Array_job
[3] https://snakemake.readthedocs.io/en/stable/
[4] https://github.com/jdblischak/smk-simple-slurm
[5] https://github.com/Snakemake-Profiles/slurm

From: slurm-users  on behalf of Duane 
Ellis 
Date: Wednesday, January 3, 2024 at 2:41 PM
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Slurp for sw builds
External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.



In my case I would like to use a slurm cluster for a sw ci/cd like solution for 
building sw images

My current scripted full system build  takes 3-5 hours and is done serially we 
could easily find places where we can choose to build things in parallel hence 
the idea is to spawn parallel builds on the Linux slurm cluster

example: we have a list of images we iterate over in a for loop to build each 
thing the steps are: cd somedir then type make or run a shell script in that 
directory

The last step after the for loop would be wait for all of the child builds to 
complete

Once all child jobs are done we have a single job that combines or packages all 
the intermediate images

Really want to use slurm because our FPGA team will have a giant slurm linux 
cluster for Xilinix FPGA builds and those nodes can do what we need for sw 
purposes (reusing the existing cluster is a huge win for us)

My question is this:

Can somebody point me to some sw build examples for or using slurm? All I can 
seem to find is how to install

I see the srun and sbatch command man pages but no good examples

Bonus would be something that integrates into a gitlab runner example or 
Jenkins in some way

All I can seem to find is how to install and administer slurm not a how to use 
slurm


Sent from my iPhone

Re: [slurm-users] Reproducible irreproducible problem (timeout?)

2023-12-20 Thread Renfro, Michael

Is this Northwestern’s Quest HPC or another one? I know at least a few of the 
people involved with Quest, and I wouldn’t have thought they’d be in dire need 
of coaching.

And to follow on with Davide’s point, this really sounds like a case for 
submitting multiple jobs with dependencies between them, as per [1, 2, 3].

[1] https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=1795
[2] 
https://bioinformaticsworkbook.org/Appendix/HPC/SLURM/submitting-dependency-jobs-using-slurm.html#gsc.tab=0
[3] https://slurm.schedmd.com/sbatch.html#OPT_dependency

From: slurm-users  on behalf of Laurence 
Marks 
Date: Wednesday, December 20, 2023 at 1:40 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] Reproducible irreproducible problem (timeout?)

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

It is a University "supercomputer", not a national facility. Hence they are not 
that expert, which is why I am asking here. I am pretty certain that it is some 
form of communication issue, but beyond that it is not clear.

If I get suggestions such as "why don't they look for ABC in XYZ" then I may 
persuade them to look at specifics. They will need the coaching, alas.

On Wed, Dec 20, 2023 at 1:25 PM Gerhard Strangar 
mailto:g...@arcor.de>> wrote:
Laurence Marks wrote:

> After some (irreproducible) time, often one of the three slow tasks hangs.
> A symptom is that if I try and ssh into the main node of the subtask (which
> is running 128 mpi on the 4 nodes) I get "Authentication failed".

How about asking an admin to check why it hangs?

--
Emeritus Professor Laurence Marks (Laurie)
Northwestern University
Webpage and Google Scholar 
link
"Research is to see what everybody else has seen, and to think what nobody else 
has thought", Albert Szent-Györgyi

Re: [slurm-users] Guidance on which HPC to try our "OpenHPC or TrintyX " for novice

2023-10-03 Thread Renfro, Michael

I’d probably default to OpenHPC just for the community around it, but I’ll also 
note that TrinityX might not have had any commits in their GitHub for an 
18-month period (unless I’m reading something wrong).

On Oct 3, 2023, at 5:51 AM, John Joseph  wrote:



External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


Dear All,
Good afternoon
I would like to install and study  and administer HPC, as first step planning 
to install one of the HPC. When I check the docs I can see OpenHPC and TrintyX 
both of them have slurm in built

Like to get advice, which one would be better for me (have knowledge in Linux 
command line and administration) . Which will be easier for me to install 
OpenHPC or TrinityX

Your guidance would help me to choose my path and much appreciated
thanks
Joseph John

Re: [slurm-users] extended list of nodes allocated to a job

2023-08-17 Thread Renfro, Michael

Given a job ID:

scontrol show hostnames $(scontrol show job some_job_id | grep ' NodeList=' | 
cut -d= -f2) | paste -sd,

Maybe there’s something more built-in than this, but it gets the job done.

From: slurm-users  on behalf of Alain O' 
Miniussi 
Date: Thursday, August 17, 2023 at 7:46 AM
To: Slurm User Community List 
Subject: [slurm-users] extended list of nodes allocated to a job
External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

Hi,

I'm looking for a way to get the list of nodes where a given job is running in 
a uncompressed way.
That is, I'd like to have node1,node2,node3 instead of node1-node3.
Is there way to achieve that ?
I need the information outside the script.

Thanks

Alain Miniussi
DSI, Pôles Calcul et Genie Log.
Observatoire de la Côte d'Azur
Tél. : +33609650665

Re: [slurm-users] On the ability of coordinators

2023-05-17 Thread Renfro, Michael

If there’s a fairshare component to job priorities, and there’s a share 
assigned to each user under the account, wouldn’t the light user’s jobs move 
ahead of any of the heavy user’s pending jobs automatically?

From: slurm-users  on behalf of "Groner, 
Rob" 
Reply-To: Slurm User Community List 
Date: Wednesday, May 17, 2023 at 1:09 PM
To: "slurm-users@lists.schedmd.com" 
Subject: Re: [slurm-users] On the ability of coordinators

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

Ya, I found they had the power to hold jobs just be experimentation.  Maybe it 
will turn out I had something misconfigured and coordinators don't have that 
ability either.  I hope that's not the case, since being able to hold jobs in 
their account gives them some usefulness.

My interest in this was solely focused on what coordinators could do to jobs 
within their account.  So, I accepted as ok that a coordinator couldn't move 
jobs in their account to a higher priority than jobs in other accounts.  I just 
wanted the coordinator to be able to move jobs in their account to a higher 
priority over other jobs within the same account.  Being able to use 
hold/release seems like what we're looking for.  I just wonder why coordinators 
can't use "top" as well, for jobs within their coordinated account.  I guess 
"top" is meant to move them to the top of the entire pending queue, and in my 
case, I was only interested in the coordinator moving certain jobs in their 
accounts to the top of the account-related queue.  But of course, there ISN'T 
an account-related queue, so maybe that's why top doesn't work for a 
coordinator.  I think I just answered my own question.

From: slurm-users  on behalf of Brian 
Andrus 
Sent: Wednesday, May 17, 2023 2:00 PM
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] On the ability of coordinators

Coordinator permissions from the man pages:

coordinator
A special privileged user, usually an account manager, that can add users or 
sub-accounts to the account they are coordinator over. This should be a trusted 
person since they can change limits on account and user associations, as well 
as cancel, requeue or reassign accounts of jobs inside their realm.

So, I read that as it manages accounts in slurmdb with minimal access to the 
jobs themselves. So you would be stuck with cancel/requeue. I see no mention of 
hold, but if that is one of the permissions, I would say, yes, our approach 
does what you want within the limits of what the default permissions of a 
coordinator can do.

Of course, that still may not work if there are other accounts/partitions/users 
with higher priority jobs than User B. Specifically if those jobs can use the 
same resources A's jobs are running on.

Brian Andrus

On 5/17/2023 10:49 AM, Groner, Rob wrote:
I'm not sure what you mean by "if they have the permissions".  I'm talking 
about someone who is specifically designated as "coordinator" of an account in 
slurm.  With that designation, and no other admin level changes, I'm not aware 
that they can directly change the priority of jobs associated with the account.

If you're talking about additional permissions or admin levels...we're not 
looking into that as an option.  We want to purely use the coordinator role to 
have them manipulate stuff.

From: slurm-users 

 on behalf of Brian Andrus 
Sent: Wednesday, May 17, 2023 12:58 PM
To: slurm-users@lists.schedmd.com 

Subject: Re: [slurm-users] On the ability of coordinators

If they have the permissions, you can just raise the priority of user B's jobs 
to be higher than whatever A's currently are. Then they will run next.

That will work if you are able to wait for some jobs to finish and you can 
'skip the line' for the priority jobs.

If you need to preempt running jobs, that would take a bit more effort to set 
up, but is an alternative.

Brian Andrus

On 5/17/2023 6:40 AM, Groner, Rob wrote:
I was asked to see if coordinators could do anything in this scenario:

  *   Within the account that they coordinated, User A submitted 1000s of jobs 
and left for the day.
  *   Within the same account, User B wanted to run a few jobs really quickly.  
Once submitted, his jobs were of course behind User A's jobs.
  *   The coordinator wanted to see the results of User B's runs.
Reading the docs and doing some experiments, here is what I determined:

  *   The coordinator could put a hold on all of User A's jobs in the pending 
queue.  This won't affect any jobs User A has that aren't tied to the 
coordinated account.
  *   With User A's jobs held, then User B's jobs would be next to run.
  *

Re: [slurm-users] Allow regular users to make reservations

2022-08-08 Thread Renfro, Michael

Going in a completely different direction than you’d planned, but for the same 
goal, what about making a script (shell, Python, or otherwise) that could 
validate all the constraints and call the scontrol program if appropriate, and 
then run that script via “sudo” as one of the regular users?

From: slurm-users  on behalf of Paolo 
Viviani 
Date: Monday, August 8, 2022 at 9:49 AM
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Allow regular users to make reservations

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

Hello,
I’m planning to develop a plugin for SLURM that would allow regular users to 
create reservations respecting some specific constraints on time/resources 
requested.
Do you think it would be feasible to implement it as a plugin, or that would 
necessarily require modification of SLURM code? In case of a plugin, are there 
different entry point other than job submit?

Thanks in advance!
Paolo

Re: [slurm-users] Changing a user's default account

2022-08-05 Thread Renfro, Michael

This should work:

sacctmgr add user someuser account=newaccount # adds user to new account

sacctmgr modify user where user=someuser set defaultaccount=newaccount # change 
default

sacctmgr remove user where user=someuser and account=oldaccount # remove from 
old account

From: slurm-users  on behalf of Chip 
Seraphine 
Date: Friday, August 5, 2022 at 9:56 AM
To: Slurm User Community List 
Subject: [slurm-users] Changing a user's default account
External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.



I have a user U who is in association with account A, and I want to change that 
to account B.   The obvious thing does not work:

$ sacctmgr modify user where user=”U” set defaultaccount=”B”
Can't modify because these users aren't associated with new default account “B”…

OK, fair enough.But I can’t find a good way to meet this requirement!   
“sacctmgr create assoc” does not seem to be a thing.   Googling around I see a 
lot of wags deleting and recreating the user in this situation, which I 
definitely do _not_ want to do.

How does one change the account that a user is tied to?


--

Chip Seraphine
Linux Admin (Grid)

This e-mail and any attachments may contain information that is confidential 
and proprietary and otherwise protected from disclosure. If you are not the 
intended recipient of this e-mail, do not read, duplicate or redistribute it by 
any means. Please immediately delete it and any attachments and notify the 
sender that you have received it by mistake. Unintended recipients are 
prohibited from taking action on the basis of information in this e-mail or any 
attachments. The DRW Companies make no representations that this e-mail or any 
attachments are free of computer viruses or other defects.

Re: [slurm-users] Sharing a GPU

2022-04-03 Thread Renfro, Michael

Someone else may see another option, but NVIDIA MIG seems like the 
straightforward option. That would require both a Slurm upgrade and the 
purchase of MIG-capable cards.

https://slurm.schedmd.com/gres.html#MIG_Management

Would be able to host 7 users per A100 card, IIRC.

On Apr 3, 2022, at 4:20 PM, Kamil Wilczek  wrote:

Hello!

I am an administrator of a GPU cluster (Slurm version 19.05.5).

Could someone help me a little bit and explain if a single
GPU can be shared between multiple users? My experience and
documentation tells me that it is not possible. But even after
some time Slurm is still a beast to me and I find myself
struggling :)

* I setup the cluster to assign GPUs on multi-GPU servers
 to different users using GRES. This works fine and several
 users can work on a multi-GPU machine (--gres=gpu:N/--gpu:N).

* But sometimes I have requests to allow a group of students
 to work simultaneously, interactively on a small partition,
 where there is more users than GPUs. So I thought that maybe
 an MPS is a solutions, but the docs says that MPS is a way
 to run multiple jobs of *the same* user on a single GPU.
 When another user is requesting a GPU by MPS, the job is enqueued
 and waiting for the first users' MPS server to finish.
 So, this is not a solution for a multi-user, simultaneous/parallel
 environment, right?

Is there a way to share a GPU between multiple users?
The requirement is, say:

* 16 users working interactively, simultaneously
* 4 GPUs partition

Kind Regards
--
Kamil Wilczek  [https://keys.openpgp.org/]
[D415917E84B8DA5A60E853B6E676ED061316B69B]


OpenPGP_signature
Description: OpenPGP_signature

Re: [slurm-users] Performance with hybrid setup

2022-03-13 Thread Renfro, Michael

Slurm supports a l3_cache_as_socket [1] parameter in recent releases. That 
would make an Epyc system, for example, appear to have many more sockets than 
physically exist, and that should help ensure threads in a single task share a 
cache.

You’d want to run slurmd -C on a node with that setting enabled to generate the 
new NodeName parameters, and replace the old entries in the overall slurm.conf 
with the updates values.

[1] https://slurm.schedmd.com/slurm.conf.html#OPT_l3cache_as_socket

On Mar 13, 2022, at 1:43 PM, vicentesmith  wrote:



External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


Hello,
I'm performing some tests (CPU-only systems) in order to compare MPI versus 
hybrid setup. The system is running OpenMPIv4.1.2 so that a job submission 
reads:
  mpirun -np 48 foo.exe
or
  export OMP_NUM_THREADS=8
  mpirun -np 6 foo.exe
In our system, the latter runs slightly faster (about 5 to 10%) but any 
performance gain/loss will depend on the system & app.
In the same system and for the same app, the first SLURM script reads:
  #!/bin/bash
  #SBATCH --job-name=***
  #SBATCH --output=*
  #SBATCH --ntasks=48
  mpirun foo.exe
This script runs fine. Then, and for the hybrid job, the script reads:
  #!/bin/bash
  #SBATCH --job-name=***hybrid
  #SBATCH --output=***
  #SBATCH --ntasks=6
  #SBATCH --cpus-per-task=8
  export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
  mpirun foo.exe
However, this runs much slower and it seems to slow down even more as it moves 
forward. Something is obviously not clicking correctly for the latter case. My 
only explanation is that the threads are not forked out correctly (by this I 
mean that the 8 threads are not assigned to the cores sharing the same L3). 
OpenMPI is supposed to choose the path of least resistance but I was wondering 
if I might need to recompile OpenMPI with some extra flags or modify the SLURM 
script somehow.
Thanks.

Re: [slurm-users] Can job submit plugin detect "--exclusive" ?

2022-02-22 Thread Renfro, Michael

For later reference, [1] should be the (current) authoritative source on data 
types for the job_desc values: some strings, some numbers, some booleans.

[1] 
https://github.com/SchedMD/slurm/blob/4c21239d420962246e1ac951eda90476283e7af0/src/plugins/job_submit/lua/job_submit_lua.c#L450

From: slurm-users  on behalf of 
Christopher Benjamin Coffey 
Date: Tuesday, February 22, 2022 at 11:02 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] Can job submit plugin detect "--exclusive" ?
External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

Hi Greg,

Thank you! The key was to use integer boolean instead of true/false. It seems 
this is inconsistent for job_desc elements as some use true/false. Have a great 
one!

Best,
Chris

--
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167

On 2/18/22, 9:09 PM, "slurm-users on behalf of Greg Wickham" 

wrote:

Hi Chris,

You mentioned “But trials using this do not seem to be fruitful so far.” . 
. why?

In our job_submit.lua there is:

if job_desc.shared == 0 then
  slurm.user_msg("exclusive access is not permitted with GPU jobs.")
  slurm.user_msg("Remove '--exclusive' from your job submission script")
  return ESLURM_NOT_SUPPORTED
end

and testing:

$ srun --exclusive --time 00:10:00 --gres gpu:1 --pty /bin/bash -i
srun: error: exclusive access is not permitted with GPU jobs.
srun: error: Remove '--exclusive' from your job submission script
srun: error: Unable to allocate resources: Requested operation is presently 
disabled

In slurm.h the job_descriptor struct has:

uint16_t shared;/* 2 if the job can only share nodes with 
other
 *   jobs owned by that user,
 * 1 if job can share nodes with other jobs,
 * 0 if job needs exclusive access to the 
node,
 * or NO_VAL to accept the system default.
 * SHARED_FORCE to eliminate user control. 
*/

If there’s a case where using “.shared” isn’t working please let us know.

   -Greg

From: slurm-users  on behalf of 
Christopher Benjamin Coffey 
Date: Saturday, 19 February 2022 at 3:17 am
To: slurm-users 
Subject: [EXTERNAL] [slurm-users] Can job submit plugin detect 
"--exclusive" ?

Hello!

The job_submit plugin doesn't appear to have a way to detect whether a user 
requested "--exclusive". Can someone confirm this? Going through the code: 
src/plugins/job_submit/lua/job_submit_lua.c I don't see anything related. 
Potentially "shared" could be possible
 in some way. But trials using this do not seem to be fruitful so far.

If a user requests --exclusive, I'd like to append "--exclude=" on 
to their job request to keep them off of certain nodes. For instance, we have 
our gpu nodes in a default partition with a high priority so that jobs don't 
land on them until last. And
 this is the same for our highmem nodes. Normally this works fine, but if 
someone asks for "--exclusive" this will land on these nodes quite often 
unfortunately.

Any ideas? Of course, I could take these nodes out of the partition, yet 
I'd like to see if something like this would be possible.

Thanks! :)

Best,
Chris

--
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167

Re: [slurm-users] Fairshare within a single Account (Project)

2022-02-01 Thread Renfro, Michael

At least from our experience, the default user share within an account is 1, so 
they'd all stay at the same share within that account. Except for the one 
faculty who wanted a much higher share than the students within their account, 
I've never had to modify shares for any users otherwise. So adding and removing 
users has been a non-issue.

From: slurm-users  on behalf of Tomislav 
Maric 
Date: Tuesday, February 1, 2022 at 1:40 PM
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Fairshare within a single Account (Project)

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

Thanks for the help!

Is it possible to use FairTree  
(https://slurm.schedmd.com/fair_tree.html<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Ffair_tree.html=04%7C01%7Crenfro%40tntech.edu%7C257b7bd173204ac8483d08d9e5baa945%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C0%7C637793412158657990%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000=Fadf2vKHsnEqTztN68CQWDLPKe1ZLIwhRIaSlVtId%2FQ%3D=0>)
 to ensure that all users always have equal fairshare. On this account, we have 
users coming and going relatively often and having fairshare automatically 
adjusted would simplify the administration.

Dr.-Ing. Tomislav Maric

Mathematical Modeling and Analysis

TU Darmstadt

Tel: +49 6151 16-21469

Alarich-Weiss-Straße 10

64287 Darmstadt

Office: L2|06 410
On 1/30/22 21:14, Renfro, Michael wrote:
You can. We use:

sacctmgr show assoc where account=researchgroup format=user,share

to see current fairshare within the account, and:

sacctmgr modify user where name=someuser account=researchgroup 
set fairshare=N

to modify a particular user's fairshare within the account.

From: slurm-users 
<mailto:slurm-users-boun...@lists.schedmd.com>
 on behalf of Tomislav Maric 
<mailto:ma...@mma.tu-darmstadt.de>
Date: Sunday, January 30, 2022 at 5:32 AM
To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com> 
<mailto:slurm-users@lists.schedmd.com>
Subject: [slurm-users] Fairshare within a single Account (Project)

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

Hello everyone,

We are a small research group that shares an account on the cluster and we 
thought we would be able to use usage reports to balance the CPUh from 
different users: we were wrong.

Is it possible to set up 
Fairshare<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Ffair_tree.html=04%7C01%7Crenfro%40tntech.edu%7C257b7bd173204ac8483d08d9e5baa945%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C0%7C637793412158657990%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000=Fadf2vKHsnEqTztN68CQWDLPKe1ZLIwhRIaSlVtId%2FQ%3D=0>
 within a single Account (Project)? I am reading the documentation, but as a 
non-admin, it is difficult for me to know what the configuration steps for 
something like this would be. If someone would outline the configuration steps 
in this case, it would save me a great deal of time...

Kind regards,

Tomislav Maric

--

Dr.-Ing. Tomislav Maric

Mathematical Modeling and Analysis

TU Darmstadt

Tel: +49 6151 16-21469

Alarich-Weiss-Straße 10

64287 Darmstadt

Office: L2|06 410

Re: [slurm-users] Fairshare within a single Account (Project)

2022-01-30 Thread Renfro, Michael

You can. We use:

sacctmgr show assoc where account=researchgroup format=user,share

to see current fairshare within the account, and:

sacctmgr modify user where name=someuser account=researchgroup 
set fairshare=N

to modify a particular user's fairshare within the account.

From: slurm-users  on behalf of Tomislav 
Maric 
Date: Sunday, January 30, 2022 at 5:32 AM
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Fairshare within a single Account (Project)

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.



Hello everyone,

We are a small research group that shares an account on the cluster and we 
thought we would be able to use usage reports to balance the CPUh from 
different users: we were wrong.

Is it possible to set up 
Fairshare
 within a single Account (Project)? I am reading the documentation, but as a 
non-admin, it is difficult for me to know what the configuration steps for 
something like this would be. If someone would outline the configuration steps 
in this case, it would save me a great deal of time...

Kind regards,

Tomislav Maric

--

Dr.-Ing. Tomislav Maric

Mathematical Modeling and Analysis

TU Darmstadt

Tel: +49 6151 16-21469

Alarich-Weiss-Straße 10

64287 Darmstadt

Office: L2|06 410

Re: [slurm-users] how to allocate high priority to low cpu and memory jobs

2022-01-25 Thread Renfro, Michael

Since there's only 9 factors to assign priority weights to, one way around this 
might be to set up separate partitions for high memory and low memory jobs 
(with a max memory allowed for the low memory partition), and then use 
partition weights to separate those jobs out.

From: slurm-users  on behalf of 
z1...@arcor.de 
Date: Tuesday, January 25, 2022 at 3:20 PM
To: slurm-users 
Subject: [slurm-users] how to allocate high priority to low cpu and memory jobs
External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

Dear all,

how can I reverse the priority, so that jobs with high cpu and memory
have a low priority?

The Priority/Multifactor plugin it is possible to calculate high
priority for high cpu and memory jobs.

With PriorityFavorSmall, jobs with a lower cpu number have a high
priority, but this only works for cpu, not memory.

Thanks,

Mike

Re: [slurm-users] Questions about default_queue_depth

2022-01-12 Thread Renfro, Michael

Not answering every question below, but for (1) we're at 200 on a cluster with 
a few dozen nodes and around 1k cores, as per 
https://lists.schedmd.com/pipermail/slurm-users/2021-June/007463.html -- there 
may be other settings in that email that could be beneficial. We had a lot of 
idle resources that could have been backfilled with short, lower-priority jobs, 
and this basically resolved it.

For (3), I think https://slurm.schedmd.com/sprio.html would be my first stop.

For (4), as far as I know, that's a setting for all partitions.

From: slurm-users  on behalf of David 
Henkemeyer 
Date: Wednesday, January 12, 2022 at 11:27 AM
To: Slurm User Community List 
Subject: [slurm-users] Questions about default_queue_depth

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


Hello,

A few weeks ago, we tested Slurm against about 50K jobs, and observed at least 
one instance where a node went idle, while there were jobs on the queue that 
could have run on the idle node.  The best guess as to why this occurred, at 
this point, is that the default_queue_depth was set to the default value of 
100, and that the queued jobs were likely not in the first 100 jobs in the 
queue.  Based on this, I have a few questions:
1) What is a reasonable value for default_queue_depth?  Would 1000 be ok, in 
terms of performance?
2) How can we better debug why queued jobs are not being selected?
3) Is there a way to see the order of the jobs in the queue?  Perhaps squeue 
lists the jobs in order?
3) If we had several partitions, would the default_queue_dpeth apply to all 
partitions?

Thank you
David

Re: [slurm-users] work with sensitive data

2021-12-17 Thread Renfro, Michael

Untested, but given a common service account with a GPG key pair, a user with a 
GPG key pair, and the EncFS encrypted with a password, the user could encrypt a 
password with their own private key and the service account's public key, and 
leave it alongside the EncFS.

If the service account is monitoring a common area for new files, it can grab 
the EncFS and the doubly-encrypted password, decrypt the password with its own 
private key and the user's public key, unlock the EncFS, and run the job.

Afterwards, the service account can re-lock the EncFS and let the user unlock 
it for viewing final results.

From: slurm-users  on behalf of Michał 
Kadlof 
Date: Friday, December 17, 2021 at 4:41 PM
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] work with sensitive data

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.



On 15.12.2021 10:29, Hermann Schwärzler wrote:
We are currently looking into telling our users to use EncFS 
(https://en.wikipedia.org/wiki/EncFS)
 for this.

This looks good to me. However it looks like it still require interactive job 
to provide password manually. Would be great if anyone could point out how to 
decrypt it with "sbatch".

Do you know what happens with "decrypted" mount point after job run out of 
time, or is killed for other reason? Is it then unmounted automatically? Is it 
remain safe when left mounted permanently (for example on access node)?
--
best regards
Michał Kadlof

Re: [slurm-users] Reserving cores without immediately launching tasks on all of them

2021-11-26 Thread Renfro, Michael

Nodes are probably misconfigured in slurm.conf, yes. You can use the output of 
'slurmd -C' on a compute node to get started on what your NodeName entry in 
slurm.conf should be:


[root@node001 ~]# slurmd -C
NodeName=node001 CPUs=28 Boards=1 SocketsPerBoard=2 CoresPerSocket=14 
ThreadsPerCore=1 RealMemory=64333
UpTime=161-22:35:13

[root@node001 ~]# grep -i 'nodename=node\[001' /etc/slurm/slurm.conf
NodeName=node[001-022]  CoresPerSocket=14 RealMemory=62000 Sockets=2 
ThreadsPerCore=1 Weight=10201


Make sure that RealMemory in slurm.conf is no larger than what 'slurmd -C' 
reports. If I recall correctly, my slurm.conf settings are otherwise 
equivalent, but not word-for-word identical, with what 'slurmd -C' reports (I 
just specified sockets instead of both boards and socketsperboard, for example).

From: slurm-users  on behalf of Mccall, 
Kurt E. (MSFC-EV41) 
Date: Friday, November 26, 2021 at 1:22 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] Reserving cores without immediately launching tasks 
on all of them

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


Mike,

I’m working through your suggestions.   I tried

$ salloc –ntasks=20 --cpus-per-task=24 --verbose myscript.bash

but salloc says that the resources are not available:

salloc: defined options
salloc:  
salloc: cpus-per-task   : 24
salloc: ntasks  : 20
salloc: verbose : 1
salloc:  
salloc: end of defined options
salloc: Linear node selection plugin loaded with argument 4
salloc: select/cons_res loaded with argument 4
salloc: Cray/Aries node selection plugin loaded
salloc: select/cons_tres loaded with argument 4
salloc: Granted job allocation 34299
srun: error: Unable to create step for job 34299: Requested node configuration 
is not available

$ scontrol show nodes  /* oddly says that there is one core per socket.  could 
our nodes be misconfigured? */

NodeName=n020 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUTot=24 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=n020 NodeHostName=n020 Version=20.02.3
   OS=Linux 4.18.0-305.7.1.el8_4.x86_64 #1 SMP Mon Jun 14 17:25:42 EDT 2021
   RealMemory=1 AllocMem=0 FreeMem=126431 Sockets=24 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=normal,low,high
   BootTime=2021-11-18T08:43:44 SlurmdStartTime=2021-11-18T08:44:31
   CfgTRES=cpu=24,mem=1M,billing=24
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s



From: slurm-users  On Behalf Of Renfro, 
Michael
Sent: Friday, November 26, 2021 8:15 AM
To: Slurm User Community List 
Subject: [EXTERNAL] Re: [slurm-users] Reserving cores without immediately 
launching tasks on all of them

The end of the MPICH section at [1] shows an example using salloc [2].

Worst case, you should be able to use the output of “scontrol show hostnames” 
[3] and use that data to make mpiexec command parameters to run one rank per 
node, similar to what’s shown at the end of the synopsis section of [4].

[1] 
https://slurm.schedmd.com/mpi_guide.html#mpich2<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fmpi_guide.html%23mpich2=04%7C01%7Crenfro%40tntech.edu%7Cc9123a18a2934ad9e8a008d9b111b224%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C0%7C637735513496482886%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000=rBK8XedubO1pmIa8dSHCCAnM713gruugH9pSamSvpX4%3D=0>
[2] 
https://slurm.schedmd.com/salloc.html<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fsalloc.html=04%7C01%7Crenfro%40tntech.edu%7Cc9123a18a2934ad9e8a008d9b111b224%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C0%7C637735513496492881%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000=oUelRHVc3ZrU0p9WMj9PiTNow9apx0Bc%2Fp0Kkg4aZic%3D=0>
[3] 
https://slurm.schedmd.com/scontrol.html<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fscontrol.html=04%7C01%7Crenfro%40tntech.edu%7Cc9123a18a2934ad9e8a008d9b111b224%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C0%7C637735513496502874%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000=q6VP%2FZGZil%2Fj2uweeCis1Wz5z94gNpdnBB9k2ojbf9U%3D=0>
[4] 
https://www.mpich.org/static/docs/v3.1/www1/mpiexec.html<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mpich.org%2Fstatic%2Fdocs%2Fv3.1%2Fwww1%2Fmpiexec.html=04%7C01%7Crenfro%40tntech.edu%7Cc9123a18a2934ad9e8a008d9b111b224%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C0%7C6377355

Re: [slurm-users] Reserving cores without immediately launching tasks on all of them

2021-11-26 Thread Renfro, Michael

The end of the MPICH section at [1] shows an example using salloc [2].

Worst case, you should be able to use the output of “scontrol show hostnames” 
[3] and use that data to make mpiexec command parameters to run one rank per 
node, similar to what’s shown at the end of the synopsis section of [4].

[1] https://slurm.schedmd.com/mpi_guide.html#mpich2
[2] https://slurm.schedmd.com/salloc.html
[3] https://slurm.schedmd.com/scontrol.html
[4] https://www.mpich.org/static/docs/v3.1/www1/mpiexec.html

--
Mike Renfro, PhD  / HPC Systems Administrator, Information Technology Services
931 372-3601  / Tennessee Tech University

On Nov 25, 2021, at 12:45 PM, Mccall, Kurt E. (MSFC-EV41) 
 wrote:



External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


I want to launch an MPICH job with sbatch with one task per node (each a 
manager), while also reserving a certain number of cores on each node for the 
managers to fill up with spawned workers (via MPI_Comm_spawn).   I’d like to 
avoid using –exclusive.

I tried the arguments –ntasks=20 –cpus-per-task=24, but it appears that 20 * 24 
tasks will be launched.   Is there a way to reserve cores without immediately 
launching tasks on them?   Thanks for any help.

sbatch: defined options
sbatch:  
sbatch: cpus-per-task   : 24
sbatch: ignore-pbs  : set
sbatch: ntasks  : 20
sbatch: test-only   : set
sbatch: verbose : 1
sbatch:  
sbatch: end of defined options
sbatch: Linear node selection plugin loaded with argument 4
sbatch: select/cons_res loaded with argument 4
sbatch: Cray/Aries node selection plugin loaded
sbatch: select/cons_tres loaded with argument 4
sbatch: Job 34274 to start at 2021-11-25T12:15:05 using 480 processors on nodes 
n[001-020] in partition normal

Re: [slurm-users] EXTERNAL-Re: Block jobs on GPU partition when GPU is not specified

2021-09-27 Thread Renfro, Michael

On a quick read, it did look correct.

From: slurm-users  on behalf of 
Ratnasamy, Fritz 
Date: Monday, September 27, 2021 at 1:59 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] EXTERNAL-Re: Block jobs on GPU partition when GPU is 
not specified

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

Does the script below look correct?
function slurm_job_submit(job_desc, part_list, submit_uid)

if job_desc.partition == 'gpu' then
 if  (job_desc.gres == nil) then
  slurm.log_info("User did not specified gres=gpu: 
")
  slurm.user_msg("You have to specify gres=gpu:x  
where x is number of GPUs.")
  return slurm.ERROR
 end
end
end
Fritz Ratnasamy
Data Scientist
Information Technology
The University of Chicago
Booth School of Business
5807 S. Woodlawn
Chicago, Illinois 60637
Phone: +(1) 773-834-4556

On Mon, Sep 27, 2021 at 1:40 PM Renfro, Michael 
mailto:ren...@tntech.edu>> wrote:
Might need a restart of slurmctld at most, I expect.

From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 on behalf of Ratnasamy, Fritz 
mailto:fritz.ratnas...@chicagobooth.edu>>
Date: Monday, September 27, 2021 at 12:32 PM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] EXTERNAL-Re: Block jobs on GPU partition when GPU is 
not specified

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

Hi Michael Renfro,

Thanks for your reply. Based on your answers, would this work:
1/ a function job_submit.lua with the following contents (just need a function 
that errored when gres:gpu is not specified in srun or in sbatch):
function slurm_job_submit(job_desc, part_list, submit_uid)

if job_desc.partition == 'gpu' then
 if  (job_desc.gres == nil) then
  slurm.log_info("User did not specified gres=gpu: 
")
  slurm.user_msg("You have to specify gres=gpu:x  
where x is number of GPUs.")
  return slurm.ERROR
 end
end
end

4/  I found out a file  the file job_submit_lua.so in our controller in 
/lib64/slurm/ and also the lua lib seems to be installed:
 sudo rpm -qa | grep lua
lua-5.3.4-11.el8.x86_64
lua-libs-5.3.4-11.el8.x86_64
lua-devel-5.3.4-11.el8.x86_64

 so I guess for now I just need to create job_submit.lua, uncomment the job 
plugin in slurm.conf/ is there any Slurm service to restart after that?

Thanks again
Fritz Ratnasamy
Data Scientist
Information Technology
The University of Chicago
Booth School of Business
5807 S. Woodlawn
Chicago, Illinois 60637
Phone: +(1) 773-834-4556

On Sat, Sep 25, 2021 at 11:08 AM Renfro, Michael 
mailto:ren...@tntech.edu>> wrote:
If you haven't already seen it there's an example Lua script from SchedMD at 
[1], and I've got a copy of our local script at [2]. Otherwise, in the order 
you asked:

  1.  That seems reasonable, but our script just checks if there's a gres at 
all. I don't *think* any gres other than gres=gpu would let the job run, since 
our GPU nodes only have Gres=gpu:2 entries. Same thing for asking for more GPUs 
than are in the node: if someone asked for gres=gpu:3 or higher, the job would 
get blocked.

The above might be an annoyance to your users if their job just sits in the 
queue with no other notice, but it hasn't really been an issue here. The big 
benefit from your side would be that you could simplify the if statement down 
to something like 'if (job_desc.gres ~= nil)'.
  2.  yes, uncomment JobSubmitPlugins=lua
  3.  Far as I know, if you uncomment the JobSubmitPlugin line and have a 
job_submit.lua file in the same folder as your slurm.conf, the Lua script 
should get executed automatically.
  4.  Our RPM installations of Slurm contained the job_submit_lua.so, both for 
Bright 8 and for OpenHPC.

[1] 
https://github.com/SchedMD/slurm/blob/master/contribs/lua/job_submit.lua<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSchedMD%2Fslurm%2Fblob%2Fmaster%2Fcontribs%2Flua%2Fjob_submit.lua=04%7C01%7Crenfro%40tntech.edu%7C82afe58bec4f4e9074d508d981e8e784%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C0%7C637683659596087929%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=iaXsHeUHlDOr8vfcEB05EYlrmrozBeSOGiA8AUASCtw%3D=0>
[2] 
https://gist.github.com/mikerenfro/df89fac5052a45cc2c1651b9a30978e0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.co

Re: [slurm-users] EXTERNAL-Re: Block jobs on GPU partition when GPU is not specified

2021-09-27 Thread Renfro, Michael

Might need a restart of slurmctld at most, I expect.

From: slurm-users  on behalf of 
Ratnasamy, Fritz 
Date: Monday, September 27, 2021 at 12:32 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] EXTERNAL-Re: Block jobs on GPU partition when GPU is 
not specified

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

Hi Michael Renfro,

Thanks for your reply. Based on your answers, would this work:
1/ a function job_submit.lua with the following contents (just need a function 
that errored when gres:gpu is not specified in srun or in sbatch):
function slurm_job_submit(job_desc, part_list, submit_uid)

if job_desc.partition == 'gpu' then
 if  (job_desc.gres == nil) then
  slurm.log_info("User did not specified gres=gpu: 
")
  slurm.user_msg("You have to specify gres=gpu:x  
where x is number of GPUs.")
  return slurm.ERROR
 end
end
end

4/  I found out a file  the file job_submit_lua.so in our controller in 
/lib64/slurm/ and also the lua lib seems to be installed:
 sudo rpm -qa | grep lua
lua-5.3.4-11.el8.x86_64
lua-libs-5.3.4-11.el8.x86_64
lua-devel-5.3.4-11.el8.x86_64

 so I guess for now I just need to create job_submit.lua, uncomment the job 
plugin in slurm.conf/ is there any Slurm service to restart after that?

Thanks again
Fritz Ratnasamy
Data Scientist
Information Technology
The University of Chicago
Booth School of Business
5807 S. Woodlawn
Chicago, Illinois 60637
Phone: +(1) 773-834-4556

On Sat, Sep 25, 2021 at 11:08 AM Renfro, Michael 
mailto:ren...@tntech.edu>> wrote:
If you haven't already seen it there's an example Lua script from SchedMD at 
[1], and I've got a copy of our local script at [2]. Otherwise, in the order 
you asked:

  1.  That seems reasonable, but our script just checks if there's a gres at 
all. I don't *think* any gres other than gres=gpu would let the job run, since 
our GPU nodes only have Gres=gpu:2 entries. Same thing for asking for more GPUs 
than are in the node: if someone asked for gres=gpu:3 or higher, the job would 
get blocked.

The above might be an annoyance to your users if their job just sits in the 
queue with no other notice, but it hasn't really been an issue here. The big 
benefit from your side would be that you could simplify the if statement down 
to something like 'if (job_desc.gres ~= nil)'.
  2.  yes, uncomment JobSubmitPlugins=lua
  3.  Far as I know, if you uncomment the JobSubmitPlugin line and have a 
job_submit.lua file in the same folder as your slurm.conf, the Lua script 
should get executed automatically.
  4.  Our RPM installations of Slurm contained the job_submit_lua.so, both for 
Bright 8 and for OpenHPC.

[1] 
https://github.com/SchedMD/slurm/blob/master/contribs/lua/job_submit.lua<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSchedMD%2Fslurm%2Fblob%2Fmaster%2Fcontribs%2Flua%2Fjob_submit.lua=04%7C01%7Crenfro%40tntech.edu%7C886c88a5003e47f499a708d981dcc1ca%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C0%7C637683607421596971%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=Stz6nQQcXulKbCkYJ%2Bza8ki4%2FinuQ260y4fjiBfjo%2F0%3D=0>
[2] 
https://gist.github.com/mikerenfro/df89fac5052a45cc2c1651b9a30978e0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2Fmikerenfro%2Fdf89fac5052a45cc2c1651b9a30978e0=04%7C01%7Crenfro%40tntech.edu%7C886c88a5003e47f499a708d981dcc1ca%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C0%7C637683607421606966%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=8ArhZQt1mUCFs%2FRLm%2FokvJ0vVpKQdPz1mgtwEOErH0Y%3D=0>

From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 on behalf of Ratnasamy, Fritz 
mailto:fritz.ratnas...@chicagobooth.edu>>
Date: Saturday, September 25, 2021 at 12:23 AM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: [slurm-users] Block jobs on GPU partition when GPU is not specified

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

Hi,

I would like to block jobs submitted in our GPU partition when gres=gpu:1 (or 
any number between 1 and 4) is not specified when submitting a job through 
sbatch or requesting an interactive session with srun.
Currently, /etc/slurm/slurm.conf has JobSumitPlugins=lua commented.
The liblua.so is now installed.
I would like to use something similar as the example mentioned at the end of 
the page:
https://slurm.schedmd.com/resource_limits.html
<https://nam11.safelinks.

Re: [slurm-users] Block jobs on GPU partition when GPU is not specified

2021-09-25 Thread Renfro, Michael

If you haven't already seen it there's an example Lua script from SchedMD at 
[1], and I've got a copy of our local script at [2]. Otherwise, in the order 
you asked:


  1.  That seems reasonable, but our script just checks if there's a gres at 
all. I don't *think* any gres other than gres=gpu would let the job run, since 
our GPU nodes only have Gres=gpu:2 entries. Same thing for asking for more GPUs 
than are in the node: if someone asked for gres=gpu:3 or higher, the job would 
get blocked.

The above might be an annoyance to your users if their job just sits in the 
queue with no other notice, but it hasn't really been an issue here. The big 
benefit from your side would be that you could simplify the if statement down 
to something like 'if (job_desc.gres ~= nil)'.

  2.  yes, uncomment JobSubmitPlugins=lua

  3.  Far as I know, if you uncomment the JobSubmitPlugin line and have a 
job_submit.lua file in the same folder as your slurm.conf, the Lua script 
should get executed automatically.

  4.  Our RPM installations of Slurm contained the job_submit_lua.so, both for 
Bright 8 and for OpenHPC.

[1] https://github.com/SchedMD/slurm/blob/master/contribs/lua/job_submit.lua
[2] https://gist.github.com/mikerenfro/df89fac5052a45cc2c1651b9a30978e0

From: slurm-users  on behalf of 
Ratnasamy, Fritz 
Date: Saturday, September 25, 2021 at 12:23 AM
To: Slurm User Community List 
Subject: [slurm-users] Block jobs on GPU partition when GPU is not specified

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


Hi,

I would like to block jobs submitted in our GPU partition when gres=gpu:1 (or 
any number between 1 and 4) is not specified when submitting a job through 
sbatch or requesting an interactive session with srun.
Currently, /etc/slurm/slurm.conf has JobSumitPlugins=lua commented.
The liblua.so is now installed.
I would like to use something similar as the example mentioned at the end of 
the page:
https://slurm.schedmd.com/resource_limits.html
Can I use the following code 
:


function slurm_job_submit(job_desc, part_list, submit_uid)

   if (job_desc.gres ~= nil)

   then

  for g in job_desc.gres:gmatch("[^,]+")

  do

 bad = string.match(g,'^gpu[:]*[0-9]*$')

 if (bad ~= nil)

 then

slurm.log_info("User specified gpu GRES without type: %s", bad)

slurm.user_msg("You must always specify a type when requesting gpu 
GRES")

return slurm.ERROR

 end

  end

   end

end
I do not need to check if the model is specified though. In that case,
1/ Should I change the line bad = string.match(g,'^gpu[:]*[0-9]*$') to 
string.match(g,'^gpu[:]*[0-9]')
2/ Do I need to uncomment  JobSumitPlugins=lua
3/ Where to specify the function call slurm_job_submit so I make sure the check 
to see if gres=gpu:1 is happening?
4/ I would need job_submit_lua.so, where can I find that library and if it is 
not there, how can i dowload it?

Thanks for your help. I am new to regular expressions, lua and Slurm so I 
apologize if my questions do not make sense.


Fritz Ratnasamy
Data Scientist
Information Technology
The University of Chicago
Booth School of Business
5807 S. Woodlawn
Chicago, Illinois 60637
Phone: +(1) 773-834-4556

Re: [slurm-users] Regarding job in pending state

2021-09-16 Thread Renfro, Michael

If you're not the cluster admin, you'll want to check with them, but that 
should be related to a limit in how many node-hours an association (a unique 
combination of user, cluster, partition, and account) can have in running or 
pending state. Further jobs would get blocked to allow others' jobs to run, and 
keep the limited association from monopolizing the cluster for extended periods.

https://slurm.schedmd.com/resource_limits.html

From: slurm-users  on behalf of pravin 
pawar 
Date: Thursday, September 16, 2021 at 6:11 AM
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Regarding job in pending state

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


Dear all,

In my cluster some of the user's jobs in pending state and showing the reason 
is AssocGrpNodeRunMinutes. Some users are run there the first job and still, 
their jobs are in pending state and reason is same.  Please help me to 
understand this.

Thanks

Re: [slurm-users] estimate queue time using 'sbatch --test-only'

2021-09-15 Thread Renfro, Michael

I can imagine at least the following causing differences in the estimated time 
and the actual start time:


  *   If running users have overestimated their job times, and their jobs 
finish earlier than expected, the original estimate will be high.
  *   If another user's job submission gets higher priority than yours while 
your job is still pending (because of scheduler policy including fairshare), 
your job can get pushed back, and the original estimate will be low.
  *   If the test-only scheduling code doesn't account for backfill, the 
original estimate could be high.

Haven't looked at the code to see if the test-only parameter goes through a 
complete scheduling cycle before returning the estimate, but I can guarantee 
that the first two items above happen all the time on my much simpler cluster 
here.

From: slurm-users  on behalf of Feng Li 

Date: Wednesday, September 15, 2021 at 3:14 PM
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] estimate queue time using 'sbatch --test-only'
Hi and thanks for reading this!

I am trying to estimate the queue time of a job of a certain size and walltime 
limit. I am doing this because our project considers multiple HPC resources and 
needs estimated queue time information to decide where to actually submit the 
job.

>From the man page of ‘sbatch’, I found that the “test-only” option can be used 
>to “validate the batch script and return an estimate of when a job would be 
>scheduled to run given the current  job queue and all the other arguments 
>specifying the job requirements”. This looks very promising to us.

I tried several launches in IU BigRed3 and TACC Stampede2 HPCs, the recorded 
results are shown below. (the last two columns are the estimated queue time and 
actual queue time). From the results, it looks like the estimated time is quite 
inaccurate (can be either over-estimated or under-estimated):

-start of output
site
slurm version
partition
JobID
node
np
walltime_mins
timestamp_estimate
estimated_start
submit_time
actual_start
estimated_wait
actual_wait
stampede2
18.08.5-2
skx-normal
8436162
1
48
10
9/9/2021 16:05
9/11/2021 23:29
9/9/2021 16:08
9/9/2021 16:11
55:23:56
0:02:49
Stampede2
18.08.5-2
skx-normal
8436369
1
48
10
9/9/2021 16:51
9/12/2021 0:04
9/9/2021 16:51
9/9/2021 16:52
55:13:00
0:00:58
Stampede2
18.08.5-2
normal
8436193
1
48
10
9/9/2021 16:17
9/9/2021 18:02
9/9/2021 16:19
9/9/2021 16:19
1:45:26
0:00:02
Stampede2
18.08.5-2
normal
8436308
2
48
10
9/9/2021 16:40
9/9/2021 18:25
9/9/2021 16:41
9/9/2021 16:41
1:45:00
0:00:04
Bigred3
20.11.7
general
1727144
1
24
10
9/9/2021 17:57
9/10/2021 12:39
9/9/2021 17:59
9/9/2021 17:59
18:42:00
0:00:00
Bigred3
20.11.7
general
1734075
1
24
60
9/15/2021 14:54
9/15/2021 14:54
9/15/2021 14:54
9/15/2021 15:01
0:00:00
0:07:11
Bigred3
20.11.7
general
1734079
1
24
20
9/15/2021 15:09
9/15/2021 15:09
9/15/2021 15:09
9/15/2021 15:09
0:00:00
0:00:01
Bigred3
20.11.7
general
1734081
4
24
60
9/15/2021 15:11
9/15/2021 15:11
9/15/2021 15:11
9/15/2021 15:34
0:00:00
0:22:15
-end of output

Could you suggest better ways to estimating the queue time? Or are there any 
specific configurations/situations on those systems on those systems that might 
affect the qeueue time estimation?  (e.g. fair sharing and site-specific QoS 
settings?)

Below is an example of my measurement for your information:

-begin of example
lifen@elogin1(:):~$date && sbatch --test-only -n 24 -N 4 -p general -t 00:60:00 
--wrap "hostname"
Wed Sep 15 15:11:49 EDT 2021
sbatch: Job 1734080 to start at 2021-09-15T15:11:49 using 24 processors on 
nodes nid00[935-938] in partition general
lifen@elogin1(:):~$date && sbatch -n 24 -N 4 -p general -t 00:60:00 --wrap 
"hostname"
Wed Sep 15 15:11:58 EDT 2021
Submitted batch job 1734081
lifen@elogin1(:):~$sacct 
--format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist
 -j 1734081
 UserJobIDJobName  Partition  State  Timelimit  
 Start EndElapsed MaxRSS  MaxVMSize   NNodes  NCPUS 
   NodeList
-  -- -- -- -- 
--- --- -- -- -- 
 -- ---
lifen 1734081wrapgeneral  COMPLETED   01:00:00 
2021-09-15T15:34:13 2021-09-15T15:34:13   00:00:00  
4 24 nid00[169,883,+
  1734081.bat+  batch COMPLETED
2021-09-15T15:34:13 2021-09-15T15:34:13   00:00:00  2136K226420K
1 18nid00169
  1734081.ext+ extern COMPLETED
2021-09-15T15:34:13 2021-09-15T15:34:13   00:00:00 4K 4K
4 24 nid00[169,883,+
-end of example

Thanks,
Feng Li

Re: [slurm-users] scancel gpu jobs when gpu is not requested

2021-08-26 Thread Renfro, Michael

Not a solution to your exact problem, but we document partitions for 
interactive, debug, and batch, and have a job_submit.lua [1] that routes 
GPU-reserving jobs to gpu-interactive, gpu-debug, and gpu partitions 
automatically. Since our GPU nodes have extra memory slots, and have tended to 
run at less than 100% CPU usage during GPU jobs, they also serve as our 
large-memory and small interactive job targets.

[1] https://gist.github.com/mikerenfro/df89fac5052a45cc2c1651b9a30978e0

From: slurm-users  on behalf of 
Ratnasamy, Fritz 
Date: Tuesday, August 24, 2021 at 9:59 PM
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] scancel gpu jobs when gpu is not requested

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


Hello,

I have written a script in my prolog.sh that cancels any slurm job if the 
parameter gres=gpu is not present. This is the script i added to my prolog.sh

if [ $SLURM_JOB_PARTITION == "gpu" ]; then
if [ ! -z "${GPU_DEVICE_ORDINAL}" ]; then
echo "GPU ID used is ID: $GPU_DEVICE_ORDINAL "
list_gpu=$(echo "$GPU_DEVICE_ORDINAL" | sed -e "s/,//g")
Ngpu=$(expr length $list_gpu)
else
echo "No GPU selected"
Ngpu=0
fi

   # if  0 gpus were allocated, cancel the job
if [ "$Ngpu" -eq "0" ]; then
  scancel ${SLURM_JOB_ID}  
fi
fi

What the code does is look at the number of gpus allocated, and if it is 0, 
cancel the job ID. It working fine if a user use sbatch submit.sh (and the 
submit.sh do not have the value --gres=gpu:1). However, when requesting an 
interactive session without gpus, the job is getting killed and the job hangs 
for 5-6 mins before getting killed.


jlo@mfe01:~ $ srun --partition=gpu --pty bash --login

srun: job 4631872 queued and waiting for resources

srun: job 4631872 has been allocated resources

srun: Force Terminated job 4631872 ...the killing hangs for 5-6minutes
Is there anything wrong with my script? Why only when scancel an interactive 
session, I am seeing this hanging. I would like to remove the hanging
Thanks
Fritz Ratnasamy
Data Scientist
Information Technology
The University of Chicago
Booth School of Business
5807 S. Woodlawn
Chicago, Illinois 60637
Phone: +(1) 773-834-4556

Re: [slurm-users] Compact scheduling strategy for small GPU jobs

2021-08-10 Thread Renfro, Michael

Did Diego's suggestion from [1] not help narrow things down?

[1] https://lists.schedmd.com/pipermail/slurm-users/2021-August/007708.html

From: slurm-users  on behalf of Jack 
Chen 
Date: Tuesday, August 10, 2021 at 10:08 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] Compact scheduling strategy for small GPU jobs

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

Does anyone have any ideas on this?

On Fri, Aug 6, 2021 at 2:52 PM Jack Chen 
mailto:scs...@gmail.com>> wrote:
I'm using slurm15.08.11, when I submit several 1 gpu jobs, slurm doesn't 
allocate nodes using compact strategy. Anyone know how to solve this? Will 
upgrading slurm latest version help ?

For example, there are two nodes A and B with 8 gpus per node, I submitted 8 1 
gpu jobs, slurm will allocate first 6 jobs on node A, then last 2 jobs on node 
B. Then when I submit one job with 8 gpus, it will pending because of gpu 
fragments: nodes A has 2 idle gpus, node b 6 idle gpus

Thanks in advance!

Re: [slurm-users] Slurm Scheduler Help

2021-06-11 Thread Renfro, Michael

Not sure it would work out to 60k queued jobs, but we're using:

SchedulerParameters=bf_window=43200,bf_resolution=2160,bf_max_job_user=80,bf_continue,default_queue_depth=200

in our setup. bf_window is driven by our 30-day max job time, bf_resolution is 
at 5% of that time, and the other values are just what we landed on. This did 
manage to address some backfill issues we had in previous years.

From: slurm-users  on behalf of Dana, 
Jason T. 
Date: Friday, June 11, 2021 at 12:27 PM
To: slurm-us...@schedmd.com 
Subject: [slurm-users] Slurm Scheduler Help

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


Hello,

I currently manage a small cluster separated into 4 partitions. I am 
experiencing unexpected behavior with the scheduler when the queue has been 
flooded with a large number of jobs by a single user (around 6) to a single 
partition. We have each user bound to a global grptres CPU limit. Once this 
user reaches their CPU limit the jobs are queued with reason 
“AssocGroupCpuLimit” but after a few hundred or so of the jobs it seems to 
switch to “Priority”. The issue is that once this switch occurs it appears to 
also impact all other partitions. Currently if any job is submitted to any of 
the partitions, regardless of resources available, they are all queued by the 
scheduler with the reason of “Priority”. We had the scheduler initially 
configured for backfill but have also tried switching to builtin and it did not 
seem to make a difference. I tried increasing the default_queue_depth to 10 
and it didn’t seem to help. The scheduler log is also unhelpful as it simply 
lists the accounting-limited jobs and never mentions the “Priority” queued jobs:

sched: [2021-06-11T13:21:53.993] JobId=495780 delayed for accounting policy
sched: [2021-06-11T13:21:53.997] JobId=495781 delayed for accounting policy
sched: [2021-06-11T13:21:54.001] JobId=495782 delayed for accounting policy
sched: [2021-06-11T13:21:54.005] JobId=495783 delayed for accounting policy
sched: [2021-06-11T13:21:54.005] loop taking too long, breaking out

I’ve gone through all the documentation I’ve found on the scheduler and cannot 
seem to resolve this. I’m hoping I’m simply missing something.

Any help would be great. Thank you!

Jason

Re: [slurm-users] Kill job when child process gets OOM-killed

2021-06-09 Thread Renfro, Michael

Yep, those are reasons not to create the array of 100k jobs.

>From https://www.mail-archive.com/slurm-users@lists.schedmd.com/msg04092.html 
>, deeper in the thread from one of your references, there's a mention of using 
>both 'set -o errexit' inside the job script alongside setting an sbatch 
>parameter of '-K' or '--kill-on-bad-exit' to have a job exit if any of its 
>processes exit with a non-zero error code.

Assuming all your processes exit with code 0 when things are running normally, 
that could be an option.

From: slurm-users  on behalf of Arthur 
Gilly 
Date: Tuesday, June 8, 2021 at 10:00 PM
To: 'Slurm User Community List' 
Subject: Re: [slurm-users] Kill job when child process gets OOM-killed

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

I could say that the limit on max array sizes is lower on our cluster, and we 
start to see I/O problems very fast as parallelism scales (which we can limit 
with % as you mention). But the actual reason is simpler, as I mentioned we 
have an entire collection of scripts which were written for a previous LSF 
system where the “kill job on OOM” setting was active. What you are suggesting 
would lead to us rewriting all these scripts so that each submitted job is 
granular (executes only 1 atomic task) and orchestrate all of it using SLURM 
dependencies etc. This is a huge undertaking and I’d rather just find this 
setting, which I’m sure exists.

-
Dr. Arthur Gilly
Head of Analytics
Institute of Translational Genomics
Helmholtz-Centre Munich (HMGU)
-

From: slurm-users  On Behalf Of Renfro, 
Michael
Sent: Tuesday, 8 June 2021 20:12
To: Slurm User Community List 
Subject: Re: [slurm-users] Kill job when child process gets OOM-killed

Any reason *not* to create an array of 100k jobs and let the scheduler just 
handle things? Current versions of Slurm support arrays of up to 4M jobs, and 
you can limit the number of jobs running simultaneously with the '%' specifier 
in your array= sbatch parameter.

From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 on behalf of Arthur Gilly 
mailto:arthur.gi...@helmholtz-muenchen.de>>
Date: Tuesday, June 8, 2021 at 4:12 AM
To: 'Slurm User Community List' 
mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] Kill job when child process gets OOM-killed

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

Thank you Loris!

Like many of our jobs, this is an embarrassingly parallel analysis, where we 
have to strike a compromise between what would be a completely granular array 
of >100,000 small jobs or some kind of serialisation through loops. So the 
individual jobs where I noticed this behaviour are actually already part of an 
array :)

Cheers,

Arthur

-
Dr. Arthur Gilly
Head of Analytics
Institute of Translational Genomics
Helmholtz-Centre Munich (HMGU)
-

From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 On Behalf Of Loris Bennett
Sent: Tuesday, 8 June 2021 16:05
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] Kill job when child process gets OOM-killed

Dear Arthur,

Arthur Gilly 
mailto:arthur.gi...@helmholtz-muenchen.de>> 
writes:

> Dear Slurm users,
>
>
>
> I am looking for a SLURM setting that will kill a job immediately when any 
> subprocess of that job hits an OOM limit. Several posts have touched upon 
> that, e.g: 
> https://www.mail-archive.com/slurm-users@lists.schedmd.com/msg04091.html<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mail-archive.com%2Fslurm-users%40lists.schedmd.com%2Fmsg04091.html=04%7C01%7Crenfro%40tntech.edu%7Cc3bc0c4af2fe4eacf38808d92af2ad25%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C0%7C637588044490972285%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=AmfXDtCVNH5EbxNAnkkiLimt6g5ZXP4eSJFhV0J9Qo4%3D=0>
>  and
> https://www.mail-archive.com/slurm-users@lists.schedmd.com/msg04190.html<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mail-archive.com%2Fslurm-users%40lists.schedmd.com%2Fmsg04190.html=04%7C01%7Crenfro%40tntech.edu%7Cc3bc0c4af2fe4eacf38808d92af2ad25%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C0%7C637588044490982280%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=GvBruGVoTdrwFB4dlJSPbI%2Fd2ExywZuMtFq%2ByyF2ias%3D=0>
>  or

Re: [slurm-users] Kill job when child process gets OOM-killed

2021-06-08 Thread Renfro, Michael

Any reason *not* to create an array of 100k jobs and let the scheduler just 
handle things? Current versions of Slurm support arrays of up to 4M jobs, and 
you can limit the number of jobs running simultaneously with the '%' specifier 
in your array= sbatch parameter.

From: slurm-users  on behalf of Arthur 
Gilly 
Date: Tuesday, June 8, 2021 at 4:12 AM
To: 'Slurm User Community List' 
Subject: Re: [slurm-users] Kill job when child process gets OOM-killed

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


Thank you Loris!

Like many of our jobs, this is an embarrassingly parallel analysis, where we 
have to strike a compromise between what would be a completely granular array 
of >100,000 small jobs or some kind of serialisation through loops. So the 
individual jobs where I noticed this behaviour are actually already part of an 
array :)

Cheers,

Arthur

-
Dr. Arthur Gilly
Head of Analytics
Institute of Translational Genomics
Helmholtz-Centre Munich (HMGU)
-

From: slurm-users  On Behalf Of Loris 
Bennett
Sent: Tuesday, 8 June 2021 16:05
To: Slurm User Community List 
Subject: Re: [slurm-users] Kill job when child process gets OOM-killed

Dear Arthur,

Arthur Gilly 
mailto:arthur.gi...@helmholtz-muenchen.de>> 
writes:

> Dear Slurm users,
>
>
>
> I am looking for a SLURM setting that will kill a job immediately when any 
> subprocess of that job hits an OOM limit. Several posts have touched upon 
> that, e.g: 
> https://www.mail-archive.com/slurm-users@lists.schedmd.com/msg04091.html
>  and
> https://www.mail-archive.com/slurm-users@lists.schedmd.com/msg04190.html
>  or 
> https://bugs.schedmd.com/show_bug.cgi?id=3216
>  but I cannot find an answer that works in our setting.
>
>
>
> The two options I have found are:
>
> 1 Set shebang to #!/bin/bash -e, which we don’t want to do as we’d need to 
> change this for hundreds of scripts from another cluster where we had a 
> different scheduler, AND it would kill tasks for other runtime errors (e.g. 
> if one command in the
> script doesn’t find a file).
>
> 2 Set KillOnBadExit=1. I am puzzled by this one. This is supposed to be 
> overridden by srun’s -K option. Using the example below, srun -K --mem=1G 
> ./multalloc.sh would be expected to kill the job at the first OOM. But it 
> doesn’t, and happily
> keeps reporting 3 oom-kill events. So, will this work?
>
>
>
> The reason we want this is that we have script that execute programs in 
> loops. These programs are slow and memory intensive. When the first one 
> crashes for OOM, the next iterations also crash. In the current setup, we are 
> wasting days
> executing loops where every iteration crashes after an hour or so due to OOM.

Not an answer to your question, but if your runs are independent, would
using a job array help you here?

Cheers,

Loris

> We are using cgroups (and we want to keep them) with the following config:
>
> CgroupAutomount=yes
>
> ConstrainCores=yes
>
> ConstrainDevices=yes
>
> ConstrainKmemSpace=no
>
> ConstrainRAMSpace=yes
>
> ConstrainSwapSpace=yes
>
> MaxSwapPercent=10
>
> TaskAffinity=no
>
>
>
> Relevant bits from slurm.conf:
>
> SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
>
> SelectType=select/cons_tres
>
> GresTypes=gpu,mps,bandwidth
>
>
>
>
>
> Very simple example:
>
> #!/bin/bash
>
> # multalloc.sh – each line is a very simple cpp program that allocates a 8Gb 
> vector and fills it with random floats
>
> echo one
>
> ./alloc8Gb
>
> echo two
>
> ./alloc8Gb
>
> echo three
>
> ./alloc8Gb
>
> echo done.

Re: [slurm-users] Exposing only requested CPUs to a job on a given node.

2021-05-14 Thread Renfro, Michael

Untested, but prior experience with cgroups indicates that if things are 
working correctly, even if your code tries to run as many processes as you have 
cores, those processes will be confined to the cores you reserve.

Try a more compute-intensive worker function that will take some seconds or 
minutes to complete, and watch the reserved node with 'top' or a similar 
program. If for example, the job reserved only 1 core and tried to run 20 
processes, you'd see 20 processes in 'top', each at 5% CPU time.

To make the code a bit more polite, you can import the os module and create a 
new variable from the SLURM_CPUS_ON_NODE environment variable to guide Python 
into starting the correct number of processes:

cpus_reserved = int(os.environ['SLURM_CPUS_ON_NODE'])

From: slurm-users  on behalf of Rodrigo 
Santibáñez 
Date: Friday, May 14, 2021 at 5:17 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] Exposing only requested CPUs to a job on a given 
node.

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


Hi you all,

I'm replying to have notifications answering this question. I have a user whose 
python script used almost all CPUs, but configured to use only 6 cpus per task. 
I reviewed the code, and it doesn't have an explicit call to multiprocessing or 
similar. So the user is unaware of this behavior (and also me).

Running slurm 20.02.6

Best!

On Fri, May 14, 2021 at 1:37 PM Luis R. Torres 
mailto:lrtor...@gmail.com>> wrote:
Hi Folks,

We are currently running on SLURM 20.11.6 with cgroups constraints for memory 
and CPU/Core.  Can the scheduler only expose the requested number of CPU/Core 
resources to a job?  We have some users that employ python scripts with the 
multi processing modules, and the scripts apparently use all of the CPU/Cores 
in a node, despite using options to constraint a task to just a given number of 
CPUs.We would like several multiprocessing jobs to run simultaneously on 
the nodes, but not step on each other.

The sample script I use for testing is below; I'm looking for something similar 
to what can be done with the GPU Gres configuration where only the number of 
GPUs requested are exposed to the job requesting them.




#!/usr/bin/env python3

import multiprocessing



def worker():

print("Worker on CPU #%s" % multiprocessing.current_process

().name)

result=0

for j in range(20):

  result += j**2

print ("Result on CPU {} is {}".format(multiprocessing.curr

ent_process().name,result))

return



if __name__ == '__main__':

pool = multiprocessing.Pool()

jobs = []

print ("This host exposed {} CPUs".format(multiprocessing.c

pu_count()))

for i in range(multiprocessing.cpu_count()):

p = multiprocessing.Process(target=worker, name=i).star

t()

Thanks,
--

Luis R. Torres

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-12 Thread Renfro, Michael

By the strictest definition of abandonware, not really, they released 9.5rc4 
last week [1]. Won't argue any of the other points, since that's out of my 
depth, but there's a very low-volume mailing list at 
ccr-xdmod-l...@listserv.buffalo.edu<mailto:ccr-xdmod-l...@listserv.buffalo.edu> 
you could inquire at.

[1] https://github.com/ubccr/xdmod/releases/tag/v9.5.0-rc.4

From: Diego Zuccato 
Date: Wednesday, May 12, 2021 at 8:37 AM
To: Renfro, Michael 
Cc: Slurm User Community List 
Subject: Re: [slurm-users] Cluster usage, filtered by partition
Il 12/05/21 13:30, Diego Zuccato ha scritto:

> Anyway, at a first glance, it uses a bit too many technologies for my
> taste (php, java, js...) and could be a problem integrating it in a
> vhost managed by one of our ISPConfig instances. But I'll try it.
> Somehow I'll make it work :)
The more I look at it, the more it smells dead: PhantomJS is officially
abadonware. Too many things that can go wrong and can't be patched, IMVHO.

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-12 Thread Renfro, Michael

Not sure which raw numbers you’re looking for, but I’m also getting a CSV 
export from XDMoD to calculate the total number of jobs and CPU hours we’ve 
completed. Doing that through the API as well.

There’s a publicly-accessible XDMoD using data from NSF XSEDE facilities at 
https://xdmod.ccr.buffalo.edu/ — may be the easiest way to explore it.

On May 12, 2021, at 3:52 AM, Diego Zuccato  wrote:

Il 11/05/21 21:20, Renfro, Michael ha scritto:

In a word, nothing that's guaranteed to be stable. I got my start from
this reply on the XDMoD list in November 2019. Worked on 8.0:
Tks for the hint.
XDMoD seems interesting and I'll try to have a look. But a scientific
report w/o access to the bare numbers is definitely a no-no :)

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-11 Thread Renfro, Michael

In a word, nothing that's guaranteed to be stable. I got my start from this 
reply on the XDMoD list in November 2019. Worked on 8.0:

Mike,

The recommended way of doing this would be to use XDMoD's Report Generator to 
periodically email you a document containing the chart images.

https://xdmod.ccr.buffalo.edu/user_manual/?t=Report%20Generator

This will only get you the images, though and not the numerical values.

The more complex alternative is to use curl to query XDMoD directly. An example 
of how to download chart images is in the automated regression tests that 
verify image export:

https://github.com/ubccr/xdmod/blob/xdmod9.0/tests/regression/lib/Controllers/UsageChartsTest.php

See the chartSettingsProvider() for how to create the data to send and 
testChartSettings() for where to POST it. To get the raw numbers you can change 
the 'format'
setting from 'png' to 'csv' to get the raw data in csv format. Note that you 
would be accessing an internal XDMoD api which could change or even be removed 
in new releases.

From: slurm-users  on behalf of Kilian 
Cavalotti 
Date: Tuesday, May 11, 2021 at 1:57 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] Cluster usage, filtered by partition
On Tue, May 11, 2021 at 5:55 AM Renfro, Michael  wrote:
>
> XDMoD [1] is useful for this, but it’s not a simple script. It does have some 
> user-accessible APIs if you want some report automation. I’m using that to 
> create a lightning-talk-style slide at [2].
>
> [1] 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fopen.xdmod.org%2Fdata=04%7C01%7Crenfro%40tntech.edu%7C406944c7097041e8402f08d914ae978f%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C0%7C637563562374746702%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=Mla%2BlItr2M4XNEzP7UoAAP2z3P%2FPPmfE2%2B8zTwTE4W4%3Dreserved=0
> [2] 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmikerenfro%2Fone-page-presentation-hpcdata=04%7C01%7Crenfro%40tntech.edu%7C406944c7097041e8402f08d914ae978f%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C0%7C637563562374746702%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=pkOtpZ7L7qyiK8kICUmJGPioIwzxvCRC0i%2BQOC9usHM%3Dreserved=0

Oh, that looks useful! Is the XDMoD API documented somewhere?

Thanks,
--
Kilian

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-11 Thread Renfro, Michael

XDMoD [1] is useful for this, but it’s not a simple script. It does have some 
user-accessible APIs if you want some report automation. I’m using that to 
create a lightning-talk-style slide at [2].

[1] https://open.xdmod.org/
[2] https://github.com/mikerenfro/one-page-presentation-hpc

On May 11, 2021, at 5:18 AM, Diego Zuccato  wrote:

Il 11/05/21 11:21, Ole Holm Nielsen ha scritto:

Tks for the very fast answer.

I have written some accounting tools which are in
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FOleHolmNielsen%2FSlurm_tools%2Ftree%2Fmaster%2Fslurmacctdata=04%7C01%7Crenfro%40tntech.edu%7C18a22c9efe664a5841ca08d9146602bc%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C0%7C637563250978957632%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=IN73eK1jj6OByPmObJyDjAHxnxe9ONVgjEKPsSgnXi8%3Dreserved=0
Maybe you can use the "topreports" tool?
Testing it just now. I'll probably have to do some changes (re field
witdh: our usernames are quite long, being from AD), but first I have to
check if it extracts the info our users want to see :)

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

Re: [slurm-users] Testing Lua job submit plugins

2021-05-06 Thread Renfro, Michael

I’ve used the structure at 
https://gist.github.com/mikerenfro/92d70562f9bb3f721ad1b221a1356de5 to handle 
basic test/production branching. I can isolate the new behavior down to just a 
specific set of UIDs that way.

Factoring out code into separate functions helps, too.

I’ve seen others go so far as to put the functions into separate files, but I 
haven’t needed that yet.

On May 6, 2021, at 12:11 PM, Michael Robbert  wrote:



External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


I’m wondering if others in the Slurm community have any tips or best practices 
for the development and testing of Lua job submit plugins. Is there anything 
that can be done prior to deployment on a production cluster that will help to 
ensure the code is going to do what you think it does or at the very least not 
prevent any jobs from being submitted? I realize that any configuration change 
in slurm.conf could break everything, but I feel like adding Lua code adds 
enough complexity that I’m a little more hesitant to just throw it in. Any way 
to run some kind of linting or sanity tests on the Lua script? Additionally, 
does the script get read in one time at startup or reconfig or can it be 
changed on the fly just by editing the file?
Maybe a separate issue, but does anybody have an recipes to build a local test 
cluster in Docker that could be used to test this? I was working on one, but 
broke my local Docker install and thought I’d send this note out while I was 
working on rebuilding it.

Thanks in advance,
Mike Robbert

Re: [slurm-users] [External] Slurm Configuration assistance: Unable to use srun after installation (slurm on fedora 33)

2021-04-19 Thread Renfro, Michael

You'll definitely need to get slurmd and slurmctld working before proceeding 
further. slurmctld is the Slurm controller mentioned when you do the srun.

Though there's probably some other steps you can take to make the slurmd and 
slurmctld system services available, it might be simpler to do the rpmbuild and 
rpm commands listed on https://slurm.schedmd.com/quickstart_admin.html , right 
below the instructions you were following. Those two commands will both run 
steps 3-8 of your original procedure, and will almost definitely put the 
systemd service files in the correct location.

From: slurm-users  on behalf of Johnsy 
K. John 
Date: Monday, April 19, 2021 at 7:18 AM
To: Slurm User Community List , 
fzill...@lenovo.com , johnsy john 
Subject: Re: [slurm-users] [External] Slurm Configuration assistance: Unable to 
use srun after installation (slurm on fedora 33)

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


Hi Florian,
Thanks for the valuable reply and help.

My answers to you are in green.

*  Do you have an active support contract with SchedMD? AFAIK they only offer 
paid support.
I don't have an active support contact. I just started learning slurm by 
installing it on my fedora machine. This is the first time I am installing and 
experimenting with slurm kind of software.

*  The error message is pretty straight forward, slurmctld is not running. Did 
you start it (systemctl start slurmctld)?
I did: systemctl start slurmctld and got this message: Failed to start 
slurmctld.service: Unit slurmctld.service not found.

*  slurmd needs to run on the node(s) you want to run on as well, and as I'm 
guessing you are using localhost for the controller and want to run jobs on 
localhost, so slurmctld and slurmd need to be running on localhost.
systemctl start slurmd
Failed to start slurmd.service: Unit slurmd.service not found.
Similar to slurmctrld

*  Is munge running?
Yes. Here is the status:
[johnsy@homepc ~]$ systemctl status munge
munge.service - MUNGE authentication service
 Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor 
preset: disabled)
 Active: active (running) since Mon 2021-04-19 07:49:13 EDT; 13min ago <-- 
it is always enabled after restart. This log is just after a restart.
   Docs: man:munged(8)
Process: 1070 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
   Main PID: 1072 (munged)
  Tasks: 4 (limit: 76969)
 Memory: 1.4M
CPU: 8ms
 CGroup: /system.slice/munge.service
 └─1072 /usr/sbin/munged

*  May I ask why you're chown-ing pid and logfiles? The slurm user (typically 
"slurm") needs to have access to those files. Munge for instance checks for 
ownership and complains if something is not correct.
I tried to follow some instructions mentioned in: 
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#copy-slurm-conf-to-all-nodes
I thought, as I am installing the slurm as root, the user "johnsy" has to have 
ownership permissions.

*  "srun /proc/cpuinfo" will fail, even if slurmctld and slurmd are running, 
because /proc/cpuinfo is not an executable file. You may want to insert "cat" 
after srun. Another simple test would be "srun hostname"
I tried : srun hostname and got the following error message:
srun: error: Unable to allocate resources: Unable to contact slurm controller 
(connect failure)

Also tried:
systemctl status slurmctld
Unit slurmctld.service could not be found.

Also I tried installing the packaged version: 
https://src.fedoraproject.org/rpms/slurm
 using dnf.
The same problem exists.

Any help in this regard will be appreciated.

Thanks a lot.
Johnsy


On Mon, Apr 19, 2021 at 5:04 AM Florian Zillner 
mailto:fzill...@lenovo.com>> wrote:
Hi Johnsy,

  1.  Do you have an active support contract with SchedMD? AFAIK they only 
offer paid support.
  2.  The error message is pretty straight forward, slurmctld is not running. 
Did you start it (systemctl start slurmctld)?
  3.  slurmd needs to run on the node(s) you want to run on as well, and as I'm

Re: [slurm-users] Grid engine slaughtering parallel jobs when any one of them fails (copy)

2021-04-16 Thread Renfro, Michael

I can't speak to what happens on node failure, but I can at least get you a 
greatly simplified pair of scripts that will run only one copy on each node 
allocated:


#!/bin/bash
# notarray.sh
#SBATCH --nodes=28
#SBATCH --ntasks-per-node=1
#SBATCH --no-kill
echo "notarray.sh is running on $(hostname)"
srun --no-kill somescript.sh


and


#!/bin/bash
# somescript.sh
echo "somescript.sh is running on $(hostname)"


I can verify that after submitting the job with "sbatch notarray.sh":

  *   notarray.sh ran on only one allocated node, and
  *   somescript.sh ran once on each of the 28 nodes allocated, including the 
one that notarray.sh ran on.

No need to pass srun a set of parameters for how many tasks to run, since it 
can figure that out from the sbatch context.

From: slurm-users  on behalf of Robert 
Peck 
Date: Friday, April 16, 2021 at 2:40 PM
To: slurm-us...@schedmd.com 
Subject: [slurm-users] Grid engine slaughtering parallel jobs when any one of 
them fails (copy)

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


Excuse me, I am trying to run some software on a cluster which uses the SLURM 
grid engine. IT support at my institution have exhausted their knowledge of 
SLURM in trying to debug this rather nasty bug with a specific feature of the 
grid engine and suggested I try here for tips.

I am using jobs of the form:

#!/bin/bash
#SBATCH --job-name=name   # Job name
#SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, 
ALL)
#SBATCH --mail-user=my_email@thing.thing # Where to send mail

#SBATCH --mem=2gb# Job memory request, not hugely 
intensive
#SBATCH --time=47:00:00  # Time limit hrs:min:sec, the sim 
software being run from within the bash script is quite slow, extra memory 
can't speed it up and it can't run multi-core, hence long runs on weak nodes

#SBATCH --nodes=100
#SBATCH --ntasks=100
#SBATCH --cpus-per-task=1

#SBATCH --output=file_output_%j.log# Standard output and error log
#SBATCH --account=code   # Project account
#SBATCH --ntasks-per-core=1 #only 1 task per core, must not be more
#SBATCH --ntasks-per-node=1 #only 1 task per node, must not be more
#SBATCH --ntasks-per-socket=1 #guessing here but fairly sure I don't want 
multiple instances trying to use same socket

#SBATCH --no-kill # supposedly prevents restart of other jobs on other nodes if 
one of the 100 gets a NODE_FAIL


echo My working directory is `pwd`
echo Running job on host:
echo -e '\t'`hostname` at `date`
echo

module load toolchain/foss/2018b
cd scratch
cd further_folder
chmod +x my_bash_script.sh

srun --no-kill -N "${SLURM_JOB_NUM_NODES}" -n "${SLURM_NTASKS}" 
./my_bash_script.sh

wait
echo
echo Job completed at `date`

I use a bash script to launch my special software and stuff which actually 
handles each job, this software is a bit weird and two copies of it WILL NOT 
EVER play nicely if made to share a node. Hence this job acts to launch 100 
copies on 100 nodes, each of which does its own stuff and writes out to a 
separate results file. I later proces the results files.

In my scenario I want 100 jobs to run, but if one or two failed and I only got 
99 or 95 back then I could work fine for further processing with just 99 or 95 
result files. Getting back a few less jobs then I want is no tragedy for my 
type of work.

But the problem is that when any one node has a failure, not that rare when 
you're calling for 100 nodes simultaneously, SLURM would by default murder the 
WHOLE LOT of jobs, and even more confusingly then restart a bunch of them which 
ends up with a very confusing pile of results files. I thought the --no-kill 
flag should prevent this fault, but instead of preventing the killing of all 
jobs due to a single failure it only prevents the restart, now I get a 
misleading message from the cluster telling me of a good exit code when such 
slaughter occurs, but when I log in to the cluster I discover a grid engine 
massacre of my jobs, all because just one of them failed.

I understand that for interacting jobs on many nodes then killing all of them 
because of one failure can be necessary, but my jobs are strictly parallel, no 
cross-interaction between them at all. each is an utterly separate simulation 
with different starting parameters. I need to ensure that if one job fails and 
must be killed then the rest are not affected.

I have been advised that due to the simulation software being such as to refuse 
to run >1 copy properly on any given node at once I am NOT able to use "array 
jobs" and must stick to this sort of job which requests 100 nodes this way.

Please can anyone suggest how to instruct SLURM not to massacre ALL my jobs 
because ONE (or a few) node(s) fails?

All my research is being put on hold by this bug which is making getting large 
runs

Re: [slurm-users] derived counters

2021-04-13 Thread Renfro, Michael

I'll never miss an opportunity to plug XDMoD for anyone who doesn't want to 
write custom analytics for every metric. I've managed to get a little bit into 
its API to extract current values for number of jobs completed and the number 
of CPU-hours provided, and insert those into a single slide presentation for 
introductory meetings.

You can see a working version of it for the NSF XSEDE facilities at 
https://xdmod.ccr.buffalo.edu

From: slurm-users  on behalf of Hadrian 
Djohari 
Date: Tuesday, April 13, 2021 at 8:11 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] derived counters

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


Hi Frank,

A way to get "how long jobs wait in the queue" is to import the data to XDMOD 
(https://open.xdmod.org/9.0/index.html).
The nifty reporting tool has many features to make it easier for us to report 
out the cluster usage.

Hadrian

On Tue, Apr 13, 2021 at 8:08 AM Heckes, Frank 
mailto:hec...@mps.mpg.de>> wrote:
Hello Ole,

> >> -Original Message-
> >>>* (average) queue length for a certain partition
>
> I wonder what exactly does your question mean?  Maybe the number of jobs or
> CPUs in the Pending state?  Maybe relative to the number of CPUs in the
> partition?
>
This result from a mgmt. - question. How long jobs have to wait (in s, min, h, 
day) before they getting executed and
how many jobs are waiting (are queued) for each partition in a certain time 
interval.
The first one is easy to find with sacct and submit, start counts + difference 
+ averaging.
The second is a bit cumbersome, so I wonder whether a 'solution' is already 
around. The easiest way is to monitor from the beginning and store the squeue 
ouput for later evaluation. Unfortunately I didn’t do that.

Cheers,
-Frank

> The "slurmacct" command prints (possibly for a specified partition) the
> average job waiting time while Pending in the queue, but not the queue length
> information.
>
> It may be difficult to answer your question from the Slurm database.  The 
> sacct
> command displays accounting data for all jobs and job steps, but not directly
> for partitions.
>
> There are other Slurm monitoring tools which perhaps can supply the data you
> are looking for.  You could ask this list again.
>
> /Ole


--
Hadrian Djohari
Manager of Research Computing Services, [U]Tech
Case Western Reserve University
(W): 216-368-0395
(M): 216-798-7490

Re: [slurm-users] [External] Autoset job TimeLimit to fit in a reservation

2021-03-30 Thread Renfro, Michael

I'd probably write a shell function that would calculate the time required, and 
add it as a command-line parameter to sbatch. We do a similar thing for easier 
interactive shells in our /etc/profile.d folder on the login node:

function hpcshell() {
  srun --partition=interactive $@ --pty bash -i
}

So something like (untested):

weekendjob () {
  WEEKRESEND=$(scontrol show res | head -n1 | awk '{print $3}' | cut -f2 -d= | 
xargs -I {} date +%s --date "{}" )
  NOW=$(date +%s)
  HOURS=$(((WEEKRESEND - NOW) / 3600))
  sbatch --time=${HOURS}:00:00 $@
}

and:

  weekendjob myjob.sh

could work, with the possible downside that anyone using this early in the 
reservation period would potentially reserve resources for the entire weekend.

From: slurm-users  on behalf of Jeremy 
Fix 
Date: Tuesday, March 30, 2021 at 2:24 AM
To: Florian Zillner , slurm-users@lists.schedmd.com 

Subject: Re: [slurm-users] [External] Autoset job TimeLimit to fit in a 
reservation

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


Hi Florian, and first of all , thanks for this single line command which is 
already quite a lot; To give it a bit more sense, I will compute the remaining 
time until the reservation ends, so just changing $2 in $3 in your command

# WEEKRESEND=$(scontrol show res | head -n1 | awk '{print $3}' | cut -f2 -d= | 
xargs -I {} date +%s --date "{}" )
# NOW=$(date +%s)
# echo "$(((WEEKRESEND - NOW) / 3600)) hours left until reservation ends"
178 hours left until reservation ends

Trying to give some sense, here is the use case. We use slurm for allocating 
resources for teaching purposes. We want to allow the students to allocate as 
they wish during the week ends, starting friday evening up to monday early 
morning. For that we think about creating a reservation that is regularly 
scheduled every week and assigned to a partition with part of the nodes we want 
the students to be able to work with.   Then comes the reason why we need to 
set the timelimit.The partition have their timelimit and we set it to, say, 
2 days and a half for a job submitted on friday evening to be allowed to run 
until monday morning. Now, suppose a user starts a job on  sunday evening 
without specifying the timelimit of its jobs, the default will be the 
partition's one (2 days and a half) and if there is a reservation scheduled on 
monday morning, for a practical say, if I'm not wrong, slurm will not allow the 
allocation because it will consider that sunday evening + the default partition 
timelimit will overlap with a planned reservation, which makes sense.   The 
first answer could be to ask the user to specify the timelimit on its job so 
that it's allocation does not overlap with the next reservation but we thought 
about automatically setting that, hence my question.

So, in our use case, it's not a problem to kill a job by the end of the 
reservation even if the job is not completed because we may need the resources 
for a practical on monday morning.

Now, back to your proposal; I was thinking about putting that line in a job 
prolog but  maybe that's was you meant by "putting the cart before the 
horse", I may run into troubles : for the prolog to be executed, the job 
has to the allocated but It will not with the default timelimit of the 
partition because slurm may not allow it (given a practical may be reserved on 
monday morning) .  Do you think there is any other way than asking the user to 
set the timelimit himself in its srun/sbatch command ?

Best;

Jeremy.

On 29/03/2021 22:09, Florian Zillner wrote:
Hi,

well, I think you're putting the cart before the horse, but anyway, you could 
write a script that extracts the next reservation and does some simple math to 
display the time in hours or else to the user. It's the users job to set the 
time their job needs to finish. Auto-squeezing a job that takes 2 days to 
complete into a remaining 2 hour window until the reservation starts doesn't 
make any sense to me.

# NEXTRES=$(scontrol show res | head -n1 | awk '{print $2}' | cut -f2 -d= | 
xargs -I {} date +%s --date "{}" )
# NOW=$(date +%s)
# echo "$(((NEXTRES - NOW) / 3600)) hours left until reservation begins"
178 hours left until reservation begins

Cheers,
Florian



From: slurm-users 

 on behalf of Jeremy Fix 

Sent: Monday, 29 March 2021 10:48
To: slurm-users@lists.schedmd.com 

Subject: [External] [slurm-users] Autoset job TimeLimit to fit in a reservation

Hi,

I'm wondering if there is any built-in option to autoset a job TimeLimit
to fit within a defined reservation.

For now, it seems to me that the timelimit must be explicitely provided,
in a agreement with

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Renfro, Michael

Just a starting guess, but are you certain the MATLAB script didn’t try to 
allocate enormous amounts of memory for variables? That’d be about 16e9 
floating point values, if I did the units correctly.

On Mar 15, 2021, at 12:53 PM, Chin,David  wrote:



External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


Hi, all:

I'm trying to understand why a job exited with an error condition. I think it 
was actually terminated by Slurm: job was a Matlab script, and its output was 
incomplete.

Here's sacct output:

   JobIDJobName  User  PartitionNodeListElapsed 
 State ExitCode ReqMem MaxRSS  MaxVMSize
AllocTRES AllocGRE
 -- - -- --- -- 
--  -- -- -- 
 
   83387 ProdEmisI+  foobdef node001   03:34:26 
OUT_OF_ME+0:125  128Gn   
billing=16,cpu=16,node=1
 83387.batch  batch  node001   03:34:26 
OUT_OF_ME+0:125  128Gn   1617705K   7880672K  
cpu=16,mem=0,node=1
83387.extern extern  node001   03:34:26 
 COMPLETED  0:0  128Gn   460K153196K 
billing=16,cpu=16,node=1

Thanks in advance,
Dave

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode



Drexel Internal Data

Re: [slurm-users] Managing Multiple Dependencies

2021-03-02 Thread Renfro, Michael

There may be prettier ways, but this gets the job done. Captures the output 
from each sbatch command to get a job ID, colon separates the ones in the 
second group, and removes the trailing colon before submitting the last job:


#!/bin/bash
JOB1=$(sbatch job1.sh | awk '{print $NF}')
echo "Submitted job 1, id ${JOB1}"
JOB2N=""
JOB2N_FILES="job2.sh job3.sh"
for job in ${JOB2N_FILES}; do
  JOB2N="${JOB2N}$(sbatch --dependency=afterok:$JOB1 $job | awk '{print $NF}'):"
  echo "Submitted $job, list is now ${JOB2N}"
done
JOB2N=$(echo ${JOB2N} | sed 's/:$//g')
echo "Submitting last job"
sbatch --dependency=afterok:$JOB2N joblast.sh



From: slurm-users  on behalf of Jason 
Simms 
Date: Tuesday, March 2, 2021 at 1:18 PM
To: Slurm User Community List 
Subject: [slurm-users] Managing Multiple Dependencies

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


Hello all,

I am relatively new to the nuances of handling complex dependencies in Slurm, 
so I'm hoping the hive mind can help. I have a user wanting to accomplish the 
following:

  *   submit one job
  *   submit multiple jobs that are dependent on the output from the first job 
(so they just need to launch once the first job has completed)
  *   submit one final job dependent on all the previous jobs completing
Is there a way to do this cleanly? So it's a three stage process. I have ideas 
in my head of writing Slurm JobIDs to a file, reading them out, and managing 
dependencies that way, but perhaps there is a more efficient way (or perhaps 
not!).

Warmest regards,
Jason

--
Jason L. Simms, Ph.D., M.P.H.
Manager of Research and High-Performance Computing
XSEDE Campus Champion
Lafayette College
Information Technology Services
710 Sullivan Rd | Easton, PA 18042
Office: 112 Skillman Library
p: (610) 330-5632

Re: [slurm-users] using resources effectively?

2020-12-16 Thread Renfro, Michael

We have overlapping partitions for GPU work and some kinds non-GPU work (both 
large memory and regular memory jobs).

For 28-core nodes with 2 GPUs, we have:

PartitionName=gpu MaxCPUsPerNode=16 … Nodes=gpunode[001-004]
PartitionName=any-interactive MaxCPUsPerNode=12 … 
Nodes=node[001-040],gpunode[001-004]
PartitionName=bigmem MaxCPUsPerNode=12 … Nodes=gpunode[001-003]
PartitionName=hugemem MaxCPUsPerNode=12 … Nodes=gpunode004

Worst case, non-GPU jobs could reserve up to 24 of the 28 cores on a GPU node, 
but only for a limited time (our any-interactive partition has a 2 hour time 
limit). In practice, it's let us use a lot of otherwise idle CPU capacity in 
the GPU nodes for short test runs.

From: slurm-users 
Date: Wednesday, December 16, 2020 at 1:04 PM
To: Slurm User Community List 
Subject: [slurm-users] using resources effectively?
External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.



Hi,

Say if I have a Slurm node with 1 x GPU and 112 x CPU cores, and:

 1) there is a job running on the node using the GPU and 20 x CPU cores

 2) there is a job waiting in the queue asking for 1 x GPU and 20 x
CPU cores

Is it possible to a) let a new job asking for 0 x GPU and 20 x CPU cores
(safe for the queued GPU job) start immediately; and b) let a new job
asking for 0 x GPU and 100 x CPU cores (not safe for the queued GPU job)
wait in the queue? Or c) is it doable to put the node into two Slurm
partitions, 56 CPU cores to a "cpu" partition, and 56 CPU cores to a
"gpu" partition, for example?

Thank you in advance for any suggestions / tips.

Best,

Weijun

===
Weijun Gao
Computational Research Support Specialist
Department of Psychology, University of Toronto Scarborough
1265 Military Trail, Room SW416
Toronto, ON M1C 1M2
E-mail: weijun@utoronto.ca

Re: [slurm-users] FairShare

2020-12-02 Thread Renfro, Michael

Yesterday, I posted 
https://docs.rc.fas.harvard.edu/kb/fairshare/
 in response to a similar question. If you want the simplest general 
explanation for FairShare values, it's that they range from 0.0 to 1.0, values 
above 0.5 indicate that account or user has used less than their share of the 
resource, and values below 0.5 indicate that that account or user has used more 
than their share of the resource.

Since all your users have the same RawShares value and are entitled to the same 
share of the resource, you can see that bdehaven has the most RawUsage and the 
lowest FairShare value, followed by ajoel and xtsao with almost identical 
RawUsage and FairShare, and finally ahantau with very little usage and the 
highest FairShare value.

We use FairShare here as the dominant factor in priorities for queued jobs: if 
you're a light user, we bump up your priority over heavier users, and your job 
starts quicker than those for heavier users, assuming all other job attributes 
are equal.

All these values are relative: in our setup, we'd bump ahantau's pending jobs 
ahead of the others, and put bdehaven's at the end. But if root needed to run a 
job outside the sray account, they'd get an enormous bump ahead since the sray 
account has used far more than its fair share of the resource.

From: slurm-users 
Date: Wednesday, December 2, 2020 at 11:23 AM
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] FairShare

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


I've read the manual and I re-read the other link. What they boil down to is 
Fair Share is calculated based on a recondite "rooted plane tree", which I do 
not have the background in discrete math to understand.

I'm hoping someone can explain it so my little kernel can understand.

From: slurm-users  on behalf of Micheal 
Krombopulous 
Sent: Wednesday, December 2, 2020 9:32 AM
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] FairShare

Can someone tell me how to calculate fairshare (under fairtree)? I can't figure 
it out. I would have thought it would be the same score for all users in an 
account. E.g., here is one of my accounts:

Account User  RawShares  NormSharesRawUsage   NormUsage  EffectvUsage   
 LevelFS  FairShare
 -- -- --- --- --- 
- -- --
root   0.00  611349 
 1.00
 root  root 10.076923   0
0.00  0.00inf   1.00
 sray  10.076923  30921 
0.505582  0.505582   0.152147
  sray phedge10.05   00.00  
0.00inf   0.181818
  srayraab  10.05   0
0.00  0.00inf   0.181818
  sraybenequist  10.05   00.00  
0.00inf   0.181818
  sray bosch   10.05   0
0.00  0.00inf   0.181818
  srayrjenkins 10.05   0
0.00  0.00inf   0.181818
  sray  esmith10.05   00.00 
 0.00 1.7226e+07   0.054545
  sray  gheinz10.05   00.00 
 0.00 1.9074e+14   0.072727
  sray  jfitz 10.05   0
0.00  0.00 8.0640e+20   0.081818
  sray   ajoel  10.05   42449
0.069465  0.137396   0.363913   0.018182
  sray  jmay   10.05   0
0.00  0.00inf   0.181818
  sray aferrier10.05   0
0.00  0.00inf   0.181818
  sraybdehaven 10.05  2250020.367771
  0.727420   0.068736   0.009091
  sraymsmythe  10.05   00.00
  0.00inf   0.181818
  sray gfink   10.05   0
0.00  0.00 2.0343e+05   0.045455
  srayahantau   10.05  310.51   
   0.000102

Re: [slurm-users] Doubts with Fairshare

2020-12-01 Thread Renfro, Michael

Harvard's Arts & Sciences Research Computing group has a good explanation of 
these columns at https://docs.rc.fas.harvard.edu/kb/fairshare/ -- might not 
answer your exact question, but it does go into how the FairShare column is 
calculated.

From: slurm-users 
Date: Tuesday, December 1, 2020 at 5:13 AM
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Doubts with Fairshare
Hello,

My SLURM cluster is applying “FairShare” with these values:
PriorityType=priority/multifactor
PriorityDecayHalfLife=7-0
PriorityCalcPeriod=5
PriorityUsageResetPeriod=QUARTERLY
PriorityFavorSmall=NO
PriorityMaxAge=7-0
PriorityWeightAge=1
PriorityWeightFairshare=100
PriorityWeightJobSize=1000
PriorityWeightPartition=1000
PriorityWeightQOS=0

However, what I get with “sshare -l -u my_user” and “sacctmgr list associations 
user=my_user format=fairshare” differs...

[root@server ~]# sacctmgr list associations user=my_user format=fairshare
Share
-
1


[root@server ~]# sshare -l -u my_user
 Account   User  RawShares  NormSharesRawUsage   NormUsage  
EffectvUsage  FairShareLevelFS
 -- -- --- --- --- 
- -- --
root  1.00  829845  
1.00   0.50   0.00
dept10.20  2257940.271566   
   0.271566   0.390169   0.00
  d1 10.018182   939200.113260  
0.127651   0.007701   0.00
  d2 10.018182   00.00  
0.024688   0.390169   0.00
  d3 10.018182   00.00  
0.024688   0.390169   0.00
  d4 10.018182   00.00  
0.024688   0.390169   0.00
  d5 10.018182   00.00  
0.024688   0.390169   0.00
  d6 10.018182   00.00  
0.024688   0.390169   0.00
  d7 10.018182   00.00  
0.024688   0.390169   0.00
  d8 10.018182   817440.097854  
0.113646   0.013134   0.00
  d9 10.018182   501290.060452  
0.079644   0.048013   0.00
  d1010.018182   00.00  
0.024688   0.390169   0.00
  d1110.018182   00.00  
0.024688   0.390169   0.00
members 10.20  5407650.652118   
   0.652118   0.104343   0.00
  members  my_user   10.015385   20.04  
0.050166   0.104328   0.00
master  10.20   632850.076317   
   0.076317   0.767595   0.00
  m1 10.10   00.00  
0.038158   0.767595   0.00
  m2 10.10   632850.076317  
0.076317   0.589202   0.00
tfg 10.20   00.00   
   0.00   1.00   0.00


Could anybody explain it? I though that value in column “FairShare” sshare 
column must be the same that in “Share” sacctmgr column... but no...

Thanks.

Re: [slurm-users] sbatch overallocation

2020-10-10 Thread Renfro, Michael

I think the answer depends on why you’re trying to prevent the observed 
behavior:


  *   Do you want to ensure that one job requesting 9 tasks (and 1 CPU per 
task) can’t overstep its reservation and take resources away from other jobs on 
those nodes? Cgroups [1] should be able to confine the job to its 9 CPUs, and 
even if 8 processes get started at once in the job, they’ll only drive up the 
nodes’ load average, and not affect others’ performance.
  *   Are you trying to define a workflow where these 8 jobs can be run in 
parallel, and you want to wait until they’ve all completed before starting 
another job? Job dependencies using the --dependency flag to sbatch [2] should 
be able to handle that.

[1] https://slurm.schedmd.com/cgroups.html
[2] https://slurm.schedmd.com/sbatch.html

From: slurm-users  on behalf of Max 
Quast 
Reply-To: Slurm User Community List 
Date: Saturday, October 10, 2020 at 6:06 AM
To: 
Subject: [slurm-users] sbatch overallocation

Dear slurm-users,

I built a slurm system consisting of two nodes (Ubuntu 20.04.1, slurm 20.02.5):

# COMPUTE NODES
GresTypes=gpu
NodeName=lsm[216-217] Gres=gpu:tesla:1 CPUs=64 
RealMemory=192073 Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 State=UNKNOWN
PartitionName=admin Nodes=lsm[216-217] Default=YES 
MaxTime=INFINITE State=UP

The slurmctl is running on a separate Ubuntu system where no slurmd is 
installed.

If a user executes this script (sbatch srun2.bash)

#!/bin/bash
#SBATCH -N 2 -n9
srun pimpleFoam -case /mnt/NFS/users/quast/channel395-10 
-parallel > /dev/null &
srun pimpleFoam -case /mnt/NFS/users/quast/channel395-11 
-parallel > /dev/null &
srun pimpleFoam -case /mnt/NFS/users/quast/channel395-12 
-parallel > /dev/null &
srun pimpleFoam -case /mnt/NFS/users/quast/channel395-13 
-parallel > /dev/null &
srun pimpleFoam -case /mnt/NFS/users/quast/channel395-14 
-parallel > /dev/null &
srun pimpleFoam -case /mnt/NFS/users/quast/channel395-15 
-parallel > /dev/null &
srun pimpleFoam -case /mnt/NFS/users/quast/channel395-16 
-parallel > /dev/null &
srun pimpleFoam -case /mnt/NFS/users/quast/channel395-17 
-parallel > /dev/null &
wait

8 jobs with 9 threads are launched and distributed on two nodes.

If more such scripts get started at the same time, all the srun commands will 
be executed even though no free cores are available. So the nodes are 
overallocated.
How can this be prevented?

Thx :)

Greetings
max

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Renfro, Michael

From any node you can run scontrol from, what does ‘scontrol show node 
GPUNODENAME | grep -i gres’ return? Mine return lines for both “Gres=” and 
“CfgTRES=”.

From: slurm-users  on behalf of Sajesh 
Singh 
Reply-To: Slurm User Community List 
Date: Thursday, October 8, 2020 at 3:33 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] CUDA environment variable not being set


External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


It seems as though the modules are loaded as when I run lsmod I get the 
following:

nvidia_drm 43714  0
nvidia_modeset   1109636  1 nvidia_drm
nvidia_uvm935322  0
nvidia  20390295  2 nvidia_modeset,nvidia_uvm

Also the nvidia-smi command returns the following:

nvidia-smi
Thu Oct  8 16:31:57 2020
+-+
| NVIDIA-SMI 440.64.00Driver Version: 440.64.00CUDA Version: 10.2 |
|---+--+--+
| GPU  NamePersistence-M| Bus-IdDisp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute M. |
|===+==+==|
|   0  Quadro M5000Off  | :02:00.0 Off |  Off |
| 33%   21CP045W / 150W |  0MiB /  8126MiB |  0%  Default |
+---+--+--+
|   1  Quadro M5000Off  | :82:00.0 Off |  Off |
| 30%   17CP045W / 150W |  0MiB /  8126MiB |  0%  Default |
+---+--+--+

+-+
| Processes:   GPU Memory |
|  GPU   PID   Type   Process name Usage  |
|=|
|  No running processes found |
+-+

--

-SS-

From: slurm-users  On Behalf Of Relu 
Patrascu
Sent: Thursday, October 8, 2020 4:26 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] CUDA environment variable not being set

EXTERNAL SENDER


That usually means you don't have the nvidia kernel module loaded, probably 
because there's no driver installed.

Relu
On 2020-10-08 14:57, Sajesh Singh wrote:
Slurm 18.08
CentOS 7.7.1908

I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and 
gres.conf of the cluster, but if I launch a job requesting GPUs the environment 
variable CUDA_VISIBLE_DEVICES Is never set and I see the following messages in 
the slurmd.log file:

debug:  common_gres_set_env: unable to set env vars, no device files configured

Has anyone encountered this before?

Thank you,

SS

Re: [slurm-users] Simple free for all cluster

2020-10-02 Thread Renfro, Michael

Depending on the users who will be on this cluster, I'd probably adjust the 
partition to have a defined, non-infinite MaxTime, and maybe a lower 
DefaultTime. Otherwise, it would be very easy for someone to start a job that 
reserves all cores until the nodes get rebooted, since all they have to do is 
submit a job with no explicit time limit (which would then use DefaultTime, 
which itself has a default value of MaxTime). 

On 10/2/20, 7:37 AM, "slurm-users on behalf of John H" 
 wrote:

Hi All

Hope you are all keeping well in these difficult times.

I have setup a small Slurm cluster of 8 compute nodes (4 x 1-core CPUs, 
16GB RAM) without scheduling or accounting as it isn't really needed.

I'm just looking for confirmation it's configured correctly to allow the 
controller to 'see' all resource and allocate incoming jobs to the most readily 
available node in the cluster. I can see
jobs are being delivered to different nodes but want to ensure I haven't 
inadvertently done anything to render it sub optimal (even in such a simple use 
case!)

Thanks very much for any assistance, here is my cfg:

#
# SLURM.CONF
ControlMachine=slnode1
BackupController=slnode2
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm-llnl
SwitchType=switch/none
TaskPlugin=task/none
#
# TIMERS
MinJobAge=86400
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_MEMORY
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#
# COMPUTE NODES
NodeName=slnode[1-8] CPUs=4 Boards=1 SocketsPerBoard=4 CoresPerSocket=1 
ThreadsPerCore=1 RealMemory=16017
PartitionName=sl Nodes=slnode[1-8] Default=YES MaxTime=INFINITE State=UP

John

Re: [slurm-users] Running gpu and cpu jobs on the same node

2020-09-30 Thread Renfro, Michael

I could have missed a detail on my description, but we definitely don’t enable 
oversubscribe, or shared, or exclusiveuser. All three of those are set to “no” 
on all active queues.

Current subset of slurm.conf and squeue output:

=

# egrep '^PartitionName=(gpu|any-interactive) ' /etc/slurm/slurm.conf
PartitionName=gpu Default=NO MinNodes=1 DefaultTime=1-00:00:00 
MaxTime=30-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 
DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF 
ReqResv=NO DefMemPerCPU=2000 AllowAccounts=ALL AllowQos=ALL LLN=NO 
MaxCPUsPerNode=16 ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP 
TRESBillingWeights=CPU=3.00,Mem=1.024G,GRES/gpu=30.00 Nodes=gpunode[001-004]
PartitionName=any-interactive Default=NO MinNodes=1 MaxNodes=4 
DefaultTime=02:00:00 MaxTime=02:00:00 AllowGroups=ALL PriorityJobFactor=3 
PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 
PreemptMode=OFF ReqResv=NO DefMemPerCPU=2000 AllowAccounts=ALL AllowQos=ALL 
LLN=NO MaxCPUsPerNode=12 ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 
State=UP TRESBillingWeights=CPU=3.00,Mem=1.024G,GRES/gpu=30.00 
Nodes=node[001-040],gpunode[001-004]
# squeue -o "%6i %.15P %.10j %.5u %4C %5D %16R %6b" | grep gpunode002
778462 gpu CNN_GRU.sh miibr 11 gpunode002   gpu:1
778632 any-interactive   bash rnour 11 gpunode002   N/A

=

From: slurm-users  on behalf of Relu 
Patrascu 
Reply-To: Slurm User Community List 
Date: Wednesday, September 30, 2020 at 4:02 PM
To: "slurm-users@lists.schedmd.com" 
Subject: Re: [slurm-users] Running gpu and cpu jobs on the same node

If you don't use OverSubscribe then resources are not shared. What resources a 
job gets allocated is not available to other jobs, regardless of partition.

Relu
On 2020-09-30 16:12, Ahmad Khalifa wrote:
I have a machine with 4 rtx2080ti and a core i9. I submit jobs to it through 
MPI PMI2 (from Relion).

If I use 5 MPI and 4 threads, then basically I'm using all 4 GPUs and 20 
threads of my cpu.

My question is, my current configuration allows submitting jobs to the same 
node, but with a different partition, but I'm not sure if I use #SBATCH 
--partition=cpu that the submitted jobs will only use the remaining 2 cores (4 
threads) or is it going to share resources with my gpu job?!

Thanks.

Re: [slurm-users] Limit a partition or host to jobs less than 4 cores?

2020-09-30 Thread Renfro, Michael

Untested, but a combination of a QOS with MaxTRESPerJob=cpu=X and a partition 
that allows or denies that QOS may work. A job_submit.lua should be able to 
adjust the QOS of a submitted job, too.

On 9/30/20, 10:50 AM, "slurm-users on behalf of Paul Edmon" 

wrote:

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

Probably the best way to accomplish this is via a job_submit.lua
script.  That way you can reject at submission time.  There isn't a
feature in the partition configurations that I am aware that can
accomplish this but a custom job_submit script certainly can.

-Paul Edmon-

On 9/30/2020 11:44 AM, Jim Kilborn wrote:
> Does anyone know if there is a way to limit a partition (or a host in
> a partition) to only allow jobs less than x number of cores. It would
> be preferable to not have to move the host to a seperate partition,
> but we could if necessary. I just want to have a place that only small
> jobs can run. I cant find a parameter in slurm.conf that allows this,
> or I am overlooking something.
>
> Thanks in advance!
>

Re: [slurm-users] Mocking SLURM to debug job_submit.lua

2020-09-23 Thread Renfro, Michael

Not having a separate test environment, I put logic into my job_submit.lua to 
use either the production settings or the ones under development or testing, 
based off the UID of the user submitting the job:

=

function slurm_job_submit(job_desc, part_list, submit_uid)
   test_user_table = {}
   test_user_table[ENTER_USER_UID_HERE] = 'a_username'

   test_enabled = (test_user_table[submit_uid] ~= nil)
   -- test_enabled = false
   if (test_enabled) then -- use logic for testing
  slurm.log_info("testing mode enabled")
  -- call other functions as needed for testing
   else -- use default logic for production
  slurm.log_info("production mode enabled")
  -- call other functions as needed for production
   end -- detect if testing or production

   return slurm.SUCCESS
end

=

I can set test_enabled to false if I want to disable testing entirely, 
otherwise, I can make a table of my test user population. I don't think I've 
had anyone's production jobs fail when I'm making changes this way, but I'm not 
100% certain that a syntax error in functions used in the testing branch would 
be ignored in the production branch. And I've tried to modularize as much as I 
can into separate functions outside the slurm_job_submit function.

On 9/23/20, 11:09 AM, "slurm-users on behalf of SJTU" 
 
wrote:

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.



Hi,

Modifying and testing  job_submit.lua on a production SLURM system may lead 
to temporary failure of job submission, which halts new scheduling strategies 
being applied. Is it possible to mock a SLURM system to debug job_submit.lua so 
that it can be updated to the production system confidently later?


Thank you!

Jianwen

Re: [slurm-users] Question/Clarification: Batch array multiple tasks on nodes

2020-09-01 Thread Renfro, Michael

We set DefMemPerCPU in each partition to approximately the amount of RAM in a 
node divided by the number of cores in the node. For heterogeneous partitions, 
we use a lower limit, and we always reserve a bit of RAM for the OS, too. So 
for a 64 GB node with 28 cores, we default to 2000 M per CPU, and set the 
node’s realmemory to around 62 GB.

--
Mike Renfro, PhD  / HPC Systems Administrator, Information Technology Services
931 372-3601  / Tennessee Tech University

On Sep 1, 2020, at 5:01 PM, Dana, Jason T.  wrote:

Spencer,
Thank you for your response!
It does appear that the memory allocation was the issue. When I specify 
--mem=1, I am able to queue jobs on a single node.
That being said, I was under the impression that the DefMemPerCPU, 
DefMemPerNode (what sbatch claims to default to), etc. values defaulted to 0 
which was interpreted as unlimited. I understood this to mean that the 
job/task, when not explicitly defining a memory request, had unlimited access 
to the memory resource. I’m assuming that’s incorrect? Is this possibly related 
to the scheduler configuration I have defined (making cores AND memory 
consumable resources):
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK
Thank you again for the help!
Jason Dana
JHUAPL
REDD/RA2
Lead Systems Administrator/Software Engineer
jason.d...@jhuapl.edu
240-564-1045 (w)

Need Support from REDD?  You can enter a ticket using the new REDD Help Desk 
Portal (https://help.rcs.jhuapl.edu) if you have 
an active account or e-mail 
redd-h...@outermail.jhuapl.edu.

From: slurm-users  on behalf of Spencer 
Bliven 
Reply-To: Slurm User Community List 
Date: Tuesday, September 1, 2020 at 5:11 PM
To: Slurm User Community List 
Subject: [EXT] Re: [slurm-users] Question/Clarification: Batch array multiple 
tasks on nodes

APL external email warning: Verify sender slurm-users-boun...@lists.schedmd.com 
before clicking links or attachments


Jason,

The array jobs are designed to behave like independent jobs (but are stored 
more efficiently internally to avoid straining the controller). So in principle 
slurm could schedule them one per node or multiple per node. The --nodes and 
--ntasks parameters apply to individual jobs in the array; thus setting 
--nodes=1 would definitely force jobs to run on different nodes.

The fact that they queue when forced to a single node is suspicious. Maybe you 
set up the partition as --exclusive? Or maybe jobs are requesting some other 
limited resource (e.g. if DefMemPerCPU is set to all the memory) preventing 
slurm from scheduling them simultaneously. If you're struggling with the array 
syntax, try just submitting two jobs to the same node and checking that you can 
get them to run simultaneously.

Best of luck,
-Spencer



On 1 September 2020 at 18:50:30, Dana, Jason T. 
(jason.d...@jhuapl.edu) wrote:
Hello,

I am new to Slurm and I am working on setting up a cluster. I am testing out 
running a batch execution using an array and am seeing only one task executed 
in the array per node. Even if I specify in the sbatch command that only one 
node should be used, it executes a single task on each of the available nodes 
in the partition. I was under the impression that it would continue to execute 
tasks until the resources on the node or for the user were at their limit. Am I 
missing something or have I misinterpreted how sbatch and/or the job scheduling 
should work?

Here is one of the commands I have run:

sbatch --array=0-15 --partition=htc-amd --wrap 'python3 -c "import time; 
print(\"working\"); time.sleep(5)"'

The htc-amd partition has 8 nodes and the results of this command are a single 
task being run on each node while the others are queued waiting for them to 
finish. As I mentioned before, if I specify --nodes=1 it will still execute a 
single task on every node in the partition. The only way I have gotten it to 
use on a single node was to use --nodelist, which worked but only to execute a 
single task and queued the rest. I have also tried specifying --ntasks and 
--ntasks-per-node. It appears to reserve resources, as I can cause it to hit 
the QOS core/cpu limit, but it does not affect the number of tasks executed on 
each node.

Thank you for any help you can offer!

Jason

Re: [slurm-users] Jobs getting StartTime 3 days in the future?

2020-08-31 Thread Renfro, Michael

One pending job in this partition should have a reason of “Resources”. That job 
has the highest priority, and if your job below would delay the 
highest-priority job’s start, it’ll get pushed back like you see here.

On Aug 31, 2020, at 12:13 PM, Holtgrewe, Manuel  
wrote:

Dear all,

I'm seeing some user's job getting a StartTime 3 days in the future although 
there are plenty of resources available in the the partition (and the user is 
well below maxTRESPU of the partition).

Attached is our slurm.conf and the dump of "sacctmgr list qos -P". I'd be 
grateful for any insight and happy to provide more information.

The scontrol show job output is as follows:

JobId=2902252 JobName=X
   UserId=X(X GroupId=X(X MCS_label=N/A
   Priority=796 Nice=0 Account=hpc-ag-kehr QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:00 TimeLimit=23:59:00 TimeMin=N/A
   SubmitTime=2020-08-31T16:34:16 EligibleTime=2020-08-31T16:34:16
   AccrueTime=2020-08-31T16:34:16
   StartTime=2020-09-03T12:43:58 EndTime=2020-09-04T12:42:58 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-08-31T19:11:13
   Partition=medium AllocNode:Sid=med0107:7749
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,mem=112000M,node=1,billing=16
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=7000M MinTmpDiskNode=0
   Features=skylake DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=X
   StdErr=X
   StdIn=/dev/null
   StdOut=X
   Power=
   MailUser=(null) MailType=NONE


Best wishes,
Manuel

--
Dr. Manuel Holtgrewe, Dipl.-Inform.
Bioinformatician
Core Unit Bioinformatics – CUBI
Berlin Institute of Health / Max Delbrück Center for Molecular Medicine in the 
Helmholtz Association / Charité – Universitätsmedizin Berlin

Visiting Address: Invalidenstr. 80, 3rd Floor, Room 03 028, 10117 Berlin
Postal Address: Chariteplatz 1, 10117 Berlin

E-Mail: manuel.holtgr...@bihealth.de
Phone: +49 30 450 543 607
Fax: +49 30 450 7 543 901
Web: cubi.bihealth.org  www.bihealth.org  www.mdc-berlin.de  www.charite.de

Re: [slurm-users] Adding Users to Slurm's Database

2020-08-18 Thread Renfro, Michael

The PowerShell script I use to provision new users adds them to an Active 
Directory group for HPC, ssh-es to the management node to do the sacctmgr 
changes, and emails the user. Never had it fail, and I've looped over entire 
class sections in PowerShell. Granted, there are some inherent delays due to 
the non-sacctmgr tasks, but it's on the order of seconds per user rather than 
minutes.


From: slurm-users  on behalf of Jason 
Simms 
Sent: Tuesday, August 18, 2020 10:36 AM
To: Slurm User Community List 
Subject: [slurm-users] Adding Users to Slurm's Database

Hello everyone! We have a script that queries our LDAP server for any users 
that have an entitlement to use the cluster, and if they don't already have an 
account on the cluster, one is created for them. In addition, they need to be 
added to the Slurm database (in order to track usage, FairShare, etc.).

I've been doing this manually with a command like this:

sacctmgr add user  Account=root DefaultAccount=root

I would like to add that command to the user creation script, but I'm warned 
off by the Slurm docs that say never to call sacctmgr in a script/loop. I 
understand the reasons why doing so multiple times in rapid succession can be a 
bad idea. In our case, however, it would be rare to have more than one new user 
at a time (our script runs in 15-min. intervals). Is there really a concern in 
a case like ours?

How do you all handle adding users to Slurm's DB? Manually? Or, if not by 
script or some automated means...??

Warmest regards,
Jason

--
Jason L. Simms, Ph.D., M.P.H.
Manager of Research and High-Performance Computing
XSEDE Campus Champion
Lafayette College
Information Technology Services
710 Sullivan Rd | Easton, PA 18042
Office: 112 Skillman Library
p: (610) 330-5632

Re: [slurm-users] scheduling issue

2020-08-14 Thread Renfro, Michael

We’ve run a similar setup since I moved to Slurm 3 years ago, with no issues. 
Could you share partition definitions from your slurm.conf?

When you see a bunch of jobs pending, which ones have a reason of “Resources”? 
Those should be the next ones to run, and ones with a reason of “Priority” are 
waiting for higher priority jobs to start (including the ones marked 
“Resources”). The only time I’ve seen nodes sit idle is when there’s an MPI job 
pending with “Resources”, and if any smaller jobs started, it would delay that 
job’s start.

--
Mike Renfro, PhD  / HPC Systems Administrator, Information Technology Services
931 372-3601  / Tennessee Tech University

On Aug 14, 2020, at 4:20 AM, Erik Eisold  wrote:

Our node topology is a bit special where almost all our nodes are in one
common partition a subset of all those nodes are then in another
partition and this repeats once more the only difference between the
partitions except the nodes in it are the maximum run time. The reason I
originally set it up this way was to ensure that users with shorter jobs
had a quicker response time and the whole cluster wouldn't be clogged up
with long running jobs for days on end this and I was new to the whole
cluster setup and Slurm itself. I have attached a rough visualization of
this setup to this mail. There are 2 more totally separate partitions
that are not in this image.

My idea for a solution would be to move all nodes to one common
partition and using partition QOS to implement time and resource
restrictions because I think the scheduler is not really meant to handle
the type of setup we choose in the beginning.

Re: [slurm-users] Only 2 jobs will start per GPU node despite 4 GPU's being present

2020-08-07 Thread Renfro, Michael

I’ve only got 2 GPUs in my nodes, but I’ve always used non-overlapping CPUs= or 
COREs= settings. Currently, they’re:

  NodeName=gpunode00[1-4] Name=gpu Type=k80 File=/dev/nvidia[0-1] COREs=0-7,9-15

and I’ve got 2 jobs currently running on each node that’s available.

So maybe:

  NodeName=c0005 Name=gpu File=/dev/nvidia[0-3] CPUs=0-10,11-21,22-32,33-43

would work?

> On Aug 7, 2020, at 12:40 PM, Jodie H. Sprouse  wrote:
> 
> External Email Warning
> 
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> 
> 
> 
> HI Tina,
> Thank you so much for looking at this.
> slurm 18.08.8
> 
> nvidia-smi topo -m
> !sysGPU0GPU1GPU2GPU3mlx5_0  CPU Affinity
> GPU0 X  NV2 NV2 NV2 NODE
> 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42
> GPU1NV2  X  NV2 NV2 NODE
> 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42
> GPU2NV2 NV2  X  NV2 SYS 
> 1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27,29-29,31-31,33-33,35-35,37-37,39-39,41-41,43-43
> GPU3NV2 NV2 NV2  X  SYS 
> 1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27,29-29,31-31,33-33,35-35,37-37,39-39,41-41,43-43
> mlx5_0  NODENODESYS SYS  X
> 
> I have tried in the gres.conf (without success; only 2 gpu jobs run per node; 
> no cpu jobs are currently running):
> NodeName=c0005 Name=gpu File=/dev/nvidia0 CPUs=[0,2,4,6,8,10]
> NodeName=c0005 Name=gpu File=/dev/nvidia1 CPUs=[0,2,4,6,8,10]
> NodeName=c0005 Name=gpu File=/dev/nvidia2 CPUs=[1,3,5,7,11,13,15,17,29]
> NodeName=c0005 Name=gpu File=/dev/nvidia3 CPUs=[1,3,5,7,11,13,15,17,29]
> 
> I also tried your suggetions of 0-13, 14-27, and a combo.
> I still only get 2 jobs to run on gpus at a time. If I take off the “CPUs=“, 
> I do get 4 jobs running per node.
> 
> Jodie
> 
> 
> On Aug 7, 2020, at 12:18 PM, Tina Friedrich  
> wrote:
> 
> Hi Jodie,
> 
> what version of SLURM are you using? I'm pretty sure newer versions pick the 
> topology up automatically (although I'm on 18.08 so I can't verify that).
> 
> Is what you're wanting to do - basically - forcefully feed a 'wrong' 
> gres.conf to make SLURM assume all GPUs are on one CPU? (I don't think I've 
> ever tried that!).
> 
> I have no idea, unfortunately, what CPU SLURM assigns first - it will not (I 
> don't think) assign cores on the non-GPU CPU first (other people please 
> correct me if I'm wrong!).
> 
> My gres.conf files get written by my config management from the GPU topology, 
> I don't think I've ever written one of them manually. And I've never tried to 
> make them anything wrong, i.e. I've never tried to deliberately give a
> 
> The GRES conf would probably need to look something like
> 
> Name=gpu Type=tesla File=/dev/nvidia0 CPUs=0-13
> Name=gpu Type=tesla File=/dev/nvidia1 CPUs=0-13
> Name=gpu Type=tesla File=/dev/nvidia2 CPUs=0-13
> Name=gpu Type=tesla File=/dev/nvidia3 CPUs=0-13
> 
> or maybe
> 
> Name=gpu Type=tesla File=/dev/nvidia0 CPUs=14-27
> Name=gpu Type=tesla File=/dev/nvidia1 CPUs=14-27
> Name=gpu Type=tesla File=/dev/nvidia2 CPUs=14-27
> Name=gpu Type=tesla File=/dev/nvidia3 CPUs=14-27
> 
> to 'assign' all GPUs to the first 14 CPUs or second 14 CPUs (your config 
> makes me think there are two 14 core CPUs, so cores 0-13 would probably be 
> CPU1 etc?)
> 
> (What is the actual topology of the system (according to, say 'nvidia-smi 
> topo -m')?)
> 
> Tina
> 
> On 07/08/2020 16:31, Jodie H. Sprouse wrote:
>> Tina,
>> Thank you. Yes, jobs will run on all 4 gpus if I submit with: 
>> --gres-flags=disable-binding
>> Yet my goal is to have the gpus bind to a cpu in order to allow a cpu-only  
>> job to never run on that particular cpu (having it  bound to the gpu and 
>> always free for a gpu job) and give the cpu job the maxcpus minus the 4.
>> 
>> * Hyperthreading is turned on.
>> NodeName=c000[1-5] Gres=gpu:tesla:4 Boards=1 SocketsPerBoard=2 
>> CoresPerSocket=14 ThreadsPerCore=2 RealMemory=19
>> 
>> PartitionName=gpu Nodes=c000[1-5] Default=NO DefaultTime=1:00:00 
>> MaxTime=168:00:00 State=UP OverSubscribe=NO 
>> TRESBillingWeights="CPU=.25,Mem=0.25G,gres/gpu=2.0"
>> PartitionName=cpu Nodes=c000[1-5] Default=NO DefaultTime=1:00:00 
>> MaxTime=168:00:00 State=UP OverSubscribe=NO 
>> TRESBillingWeights="CPU=.25,Mem=0.25G" MaxCPUsPerNode=48
>> 
>> I have played tried variations for gres.conf such as:
>> NodeName=c0005 Name=gpu File=/dev/nvidia[0-1] CPUs=0,2
>> NodeName=c0005 Name=gpu File=/dev/nvidia[2-3] CPUs=1,3
>> 
>> as well as trying CORES= (rather than CPUSs) with NO success.
>> 
>> 
>> I’ve battled this all week. Any suggestions would be greatly appreciated!
>> Thanks for any suggestions!
>>

Re: [slurm-users] Correct way to give srun and sbatch different MaxTime values?

2020-08-04 Thread Renfro, Michael

Untested, but you should be able to use a job_submit.lua file to detect if the 
job was started with srun or sbatch:

  *   Check with (job_desc.script == nil or job_desc.script == '')
  *   Adjust job_desc.time_limit accordingly

Here, I just gave people a shell function "hpcshell", which automatically drops 
them in a time-limited partition. Easier for them, fewer idle resources for 
everyone:

hpcshell ()
{
srun --partition=interactive $@ --pty $SHELL -i
}

From: slurm-users  on behalf of Jaekyeom 
Kim 
Sent: Tuesday, August 4, 2020 5:35 AM
To: slurm-us...@schedmd.com 
Subject: [slurm-users] Correct way to give srun and sbatch different MaxTime 
values?


External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


Hi,

I'd like to prevent my Slurm users from taking up resources with dummy shell 
process jobs left unaware/intentionally.
To that end, I simply want to put a tougher maximum time limit for srun only.
One possible way might be to wrap the srun binary.
But could someone tell me if there is any proper way to do it, please?

Best,
Jaekyeom

Re: [slurm-users] Internet connection loss with srun to a node

2020-08-02 Thread Renfro, Michael

Probably unrelated to slurm entirely, and most likely has to do with 
lower-level network diagnostics. I can guarantee that it’s possible to access 
Internet resources from a compute node. Notes and things to check:

1. Both ping and http/https are IP protocols, but are very different (ping 
isn’t even TCP or UDP, it’s ICMP), so even if you needed proxy variables for 
http and https to work, they shouldn’t affect ping.

2. Do http or https transfers work from a compute node? A github clone, a test 
with curl or wget to a nearby web server? Do your proxy variables exist on the 
compute node, and most importantly, is there a proxy server listening and 
functional on the host and port that the variables point to?

3. What’s the default gateway for your compute nodes? Does that gateway provide 
network address translation (NAT) for the nodes, or does it work as a 
traditional router?

Get Outlook for iOS

From: slurm-users  on behalf of Mahmood 
Naderan 
Sent: Sunday, August 2, 2020 7:52:52 AM
To: Slurm User Community List 
Subject: [slurm-users] Internet connection loss with srun to a node
 Hi
A frontend machine is connected to the internet and from that machine, I use 
srun to get a bash on another node. But it seems that the node is unable to 
access the internet. The http_proxy and https_proxy are defined in ~/.bashrc

mahmood@main-proxy:~$ ping google.com
PING google.com (216.58.215.238) 56(84) bytes of data.
64 bytes from zrh11s02-in-f14.1e100.net 
(216.58.215.238): icmp_seq=1 ttl=114 time=1.38 ms
^C
--- google.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.384/1.384/1.384/0.000 ms
mahmood@main-proxy:~$ srun -p gpu_part --gres=gpu:titanv:1  --pty /bin/bash
mahmood  @fry0:~$ ping google.com
PING google.com (216.58.215.238) 56(84) bytes of data.
^C
--- google.com ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2026ms



I guess that is related to slurm and srun.
Any idea for that?







Regards,
Mahmood

Re: [slurm-users] slurm array with non-numeric index values

2020-07-15 Thread Renfro, Michael

If the 500 parameters happened to be filenames, you could do adapt like 
(appropriated from somewhere else, but I can’t find the reference quickly:

=

#!/bin/bash
 # get count of files in this directory
NUMFILES=$(ls -1 *.inp | wc -l)
# subtract 1 as we have to use zero-based indexing (first element is 0)
ZBNUMFILES=$(($NUMFILES - 1))
# submit array of jobs to SLURM
if [ $ZBNUMFILES -ge 0 ]; then
  sbatch --array=0-$ZBNUMFILES array_job.sh
else
  echo "No jobs to submit, since no input files in this directory.”
fi

=

with:

=

#!/bin/bash
#SBATCH --nodes=1  --ntasks-per-node=1 --cpus-per-task=1
#SBATCH --time=00:01:00
#SBATCH --job-name array_demo_2
 
echo "All jobs in this array have:"
echo "- SLURM_ARRAY_JOB_ID=${SLURM_ARRAY_JOB_ID}"
echo "- SLURM_ARRAY_TASK_COUNT=${SLURM_ARRAY_TASK_COUNT}"
echo "- SLURM_ARRAY_TASK_MIN=${SLURM_ARRAY_TASK_MIN}"
echo "- SLURM_ARRAY_TASK_MAX=${SLURM_ARRAY_TASK_MAX}"
 
echo "This job in the array has:"
echo "- SLURM_JOB_ID=${SLURM_JOB_ID}"
echo "- SLURM_ARRAY_TASK_ID=${SLURM_ARRAY_TASK_ID}"

# grab our filename from a directory listing
FILES=($(ls -1 *.inp))
FILENAME=${FILES[$SLURM_ARRAY_TASK_ID]}
echo "My input file is ${FILENAME}”

# make new directory, change into it, and run
mkdir ${FILENAME}_out
cd ${FILENAME}_out
echo "First 10 lines of ../${FILENAME} are:" > ${FILENAME}_results.out
head ../${FILENAME} >> ${FILENAME}_results.out

=

If the 500 parameters were lines in a file, the same logic would apply:

- subtract 1 from the number of lines in the file to determine the array limit
- add 1 to ${SLURM_ARRAY_TASK_ID} to get a line number for a specific parameter
- something like "sed -n ‘${TASK_ID_PLUS_ONE}p’ filename” to retrieve that 
parameter
- run the Python script with that value

> On Jul 15, 2020, at 3:13 PM, c b  wrote:
> 
> External Email Warning
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> I'm trying to run an embarrassingly parallel experiment, with 500+ tasks that 
> all differ in one parameter.  e.g.:
> 
> job 1 - script.py foo
> job 2 - script.py bar
> job 3 - script.py baz
> and so on.
> 
> This seems like a case where having a slurm array hold all of these jobs 
> would help, so I could just submit one job to my cluster instead of 500 
> individual jobs.  It seems like sarray is only set up for varying an integer 
> index parameter.  How would i do this for non-numeric values (say, if the 
> parameter I'm varying is a string in a given list) ?
> 
>

Re: [slurm-users] CPU allocation for the GPU jobs.

2020-07-13 Thread Renfro, Michael

“The SchedulerType configuration parameter specifies the scheduler plugin to 
use. Options are sched/backfill, which performs backfill scheduling, and 
sched/builtin, which attempts to schedule jobs in a strict priority order 
within each partition/queue.”

https://slurm.schedmd.com/sched_config.html

If you’re using the builtin scheduler, lower priority jobs have no way to run 
ahead of higher priority jobs. If you’re using the backfill scheduler, your 
jobs will need specific wall times specified, since the idea with backfill is 
to run lower priority jobs ahead of time if and only if they can complete 
without delaying the estimated start time of higher priority jobs.

On Jul 13, 2020, at 4:18 AM, navin srivastava  wrote:

Hi Team,

We have separate partitions for the GPU nodes and only CPU nodes .

scenario: the jobs submitted in our environment is 4CPU+1GPU  as well as 4CPU 
only in  nodeGPUsmall and nodeGPUbig. so when all the GPU exhausted and rest 
other jobs are in queue waiting for the availability of GPU resources.the job 
submitted with only CPU is not going through even though plenty of CPU 
resources are available but the job which is only looking CPU, also on pend 
because of these GPU based jobs( priority of GPU jobs is higher than CPU one).

Is there any option here we can do,so that when all GPU resources are exhausted 
then it should allow the CPU jobs. Is there a way to deal with it? or some 
custom solution which we can think of.  There is no issue with CPU only 
partitions.

Below is the my slurm configuration file


NodeName=node[1-12] NodeAddr=node[1-12] Sockets=2 CoresPerSocket=10 
RealMemory=128833 State=UNKNOWN
NodeName=node[13-16] NodeAddr=node[13-16] Sockets=2 CoresPerSocket=10 
RealMemory=515954 Feature=HIGHMEM State=UNKNOWN
NodeName=node[28-32]  NodeAddr=node[28-32] Sockets=2 CoresPerSocket=28 
RealMemory=257389
NodeName=node[32-33]  NodeAddr=node[32-33] Sockets=2 CoresPerSocket=24 
RealMemory=773418
NodeName=node[17-27]  NodeAddr=node[17-27] Sockets=2 CoresPerSocket=18 
RealMemory=257687 Feature=K2200 Gres=gpu:2
NodeName=node[34]  NodeAddr=node34 Sockets=2 CoresPerSocket=24 
RealMemory=773410 Feature=RTX Gres=gpu:8


PartitionName=node Nodes=node[1-10,14-16,28-33,35]  Default=YES 
MaxTime=INFINITE State=UP Shared=YES
PartitionName=nodeGPUsmall Nodes=node[17-27]  Default=NO MaxTime=INFINITE 
State=UP Shared=YES
PartitionName=nodeGPUbig Nodes=node[34]  Default=NO MaxTime=INFINITE State=UP 
Shared=YES

Regards
Navin.

Re: [slurm-users] runtime priority

2020-06-30 Thread Renfro, Michael

There’s a --nice flag to sbatch and srun, at least. Documentation indicates it 
decreases priority by 100 by default.

And untested, but it may be possible to use a job_submit.lua [1] to adjust nice 
values automatically. At least I can see a nice property in [2], which I assume 
means it'd be accessible as job_desc.nice in the Lua script.

[1] https://github.com/SchedMD/slurm/blob/master/contribs/lua/job_submit.lua
[2] https://github.com/SchedMD/slurm/blob/master/src/lua/slurm_lua.c

> On Jun 30, 2020, at 9:52 AM, Lawrence Stewart  wrote:
> 
> How does one configure the runtime priority of a job?  That is, how do you 
> set the CPU scheduling “nice” value?
> 
> We’re using Slurm to share a large (16 core 768 GB) server among FPGA 
> compilation jobs.  Slurm handles core and memory reservations just fine, but 
> runs everything nice -19, which makes for hugh load averages and terrible 
> interactive performance.
> 
> Manually setting the compilation processes with “renice 19 ” works fine, 
> but is tedious.
> 
> -Larry
> 
>

Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs

2020-06-15 Thread Renfro, Michael

So if a GPU job is submitted to a partition containing only GPU nodes, and a 
non-GPU job is submitted to a partition containing at least some nodes without 
GPUs, both jobs should be able to run. Priorities should be evaluated on a 
per-partition basis. I can 100% guarantee that in our HPC, pending GPU jobs 
don't block non-GPU jobs, and vice versa.

I could see a problem if the GPU job was submitted to a partition containing 
both types of nodes: if that job was assigned the highest priority for whatever 
reason (fair share, age, etc.), other jobs in the same partition would have to 
wait until that job started.

A simple solution would be to make a GPU partition containing only GPU nodes, 
and a non-GPU partition containing only non-GPU nodes. Submit GPU jobs to the 
GPU partition, and non-GPU jobs to the non-GPU partition.

Once that works, you could make a partition that includes both types of nodes 
to reduce idle resources, but jobs submitted to that partition would have to 
(a) not require a GPU, (b) require a limited number of CPUs per node, so that 
you'd have some CPUs available for GPU jobs on the nodes containing GPUs.


From: slurm-users  on behalf of navin 
srivastava 
Sent: Saturday, June 13, 2020 10:47 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs


Yes we have separate partitions. Some are specific to gpu having 2 nodes with 8 
gpu and another partitions are mix of both,nodes with 2 gpu and very few nodes 
are without any gpu.

Regards
Navin


On Sat, Jun 13, 2020, 21:11 navin srivastava 
mailto:navin.alt...@gmail.com>> wrote:
Thanks Renfro.

Yes we have both types of nodes with gpu and nongpu.
Also some users job require gpu and some applications use only CPU.

So the issue happens when user priority is high and waiting for gpu resources 
which is not available and the job with lower priority is waiting even though 
enough CPU is available which need only CPU resources.

When I hold gpu  jobs the cpu  jobs will go through.

Regards
Navin

On Sat, Jun 13, 2020, 20:37 Renfro, Michael 
mailto:ren...@tntech.edu>> wrote:
Will probably need more information to find a solution.

To start, do you have separate partitions for GPU and non-GPU jobs? Do you have 
nodes without GPUs?

On Jun 13, 2020, at 12:28 AM, navin srivastava 
mailto:navin.alt...@gmail.com>> wrote:

Hi All,

In our environment we have GPU. so what i found is if the user having high 
priority and his job is in queue and waiting for the GPU resources which are 
almost full and not available. so the other user submitted the job which does 
not require the GPU resources are in queue even though lots of cpu resources 
are available.

our scheduling mechanism is FIFO and Fair tree enabled. Is there any way we can 
make some changes so that the cpu based job should go through and GPU based job 
can wait till the GPU resources are free.

Regards
Navin.

Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs

2020-06-13 Thread Renfro, Michael

Will probably need more information to find a solution.

To start, do you have separate partitions for GPU and non-GPU jobs? Do you have 
nodes without GPUs?

On Jun 13, 2020, at 12:28 AM, navin srivastava  wrote:

Hi All,

In our environment we have GPU. so what i found is if the user having high 
priority and his job is in queue and waiting for the GPU resources which are 
almost full and not available. so the other user submitted the job which does 
not require the GPU resources are in queue even though lots of cpu resources 
are available.

our scheduling mechanism is FIFO and Fair tree enabled. Is there any way we can 
make some changes so that the cpu based job should go through and GPU based job 
can wait till the GPU resources are free.

Regards
Navin.

Re: [slurm-users] Fairshare per-partition?

2020-06-12 Thread Renfro, Michael

I think that’s correct. From notes I’ve got for how we want to handle our 
fairshare in the future:

Setting up a funded account (which can be assigned a fairshare):

sacctmgr add account member1 Description="Member1 Description" FairShare=N

Adding/removing a user to/from the funded account:

sacctmgr add user renfro account=member1 # for all partitions
sacctmgr add user renfro account=member1 partition=gpu # for 
partition-specific fairshare

Modifying funded account fairshare:

sacctmgr modify account member1 set FairShare=N

Modifying funded account fairshare on specific partitions (e.g., if the entity 
funded GPU nodes)

sacctmgr modify user renfro set FairShare=N where account=member1 
partition=gpu

-- 
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University

> On Jun 12, 2020, at 3:52 AM, Diego Zuccato  wrote:
> 
> Hello all.
> 
> Is it possible to configure Slurm so that fairshare calc on a partition
> does not impact calc on a different one?
> 
> We'd need to have different "priorities" on the "postprocessing" nodes
> than the ones on "parallel" nodes, so that even if an user already used
> up all his "quota" on "parallel" nodes but have never used
> "postprocessing", he'll have max prio when submitting jobs on
> "postprocessing".
> 
> IIUC, it should be the case if the user have multiple associations
> (specifying a different partition for each one). Am I right?
> 
> TIA.
> 
> --
> Diego Zuccato
> DIFA - Dip. di Fisica e Astronomia
> Servizi Informatici
> Alma Mater Studiorum - Università di Bologna
> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> tel.: +39 051 20 95786
>

Re: [slurm-users] Make "srun --pty bash -i" always schedule immediately

2020-06-11 Thread Renfro, Michael

Spare capacity is critical. At our scale, the few dozen cores that were 
typically left idle in our GPU nodes handles the vast majority of interactive 
work.

> On Jun 11, 2020, at 8:38 AM, Paul Edmon  wrote:
> 
> External Email Warning
> 
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> 
> 
> 
> That's pretty slick.  We just have a test, gpu_test, and remotedesktop
> partition set up for those purposes.
> 
> What the real trick is making sure you have sufficient spare capacity
> that you can deliberately idle for these purposes.  If we were a smaller
> shop with less hardware I wouldn't be able to set aside as much hardware
> for this.  If that was the case I would likely go the route of a single
> server with oversubscribe.
> 
> You could try to do it with an active partition with no deliberately
> idle resources, but then you will want to make sure that your small jobs
> are really small and won't impact larger work.  I don't necessarily
> recommend that.  A single node with oversubscribe should be sufficient.
> If you can't spare a single node then a VM would do the job.
> 
> -Paul Edmon-
> 
> On 6/11/2020 9:28 AM, Renfro, Michael wrote:
>> That’s close to what we’re doing, but without dedicated nodes. We have three 
>> back-end partitions (interactive, any-interactive, and gpu-interactive), but 
>> the users typically don’t have to consider that, due to our job_submit.lua 
>> plugin.
>> 
>> All three partitions have a default of 2 hours, 1 core, 2 GB RAM, but users 
>> could request more cores and RAM (but not as much as a batch job — we used 
>> https://hpcbios.readthedocs.io/en/latest/HPCBIOS_05-05.html as a starting 
>> point).
>> 
>> If a GPU is requested, the job goes into the gpu-interactive partition and 
>> is limited to 16 cores per node (we have 28 cores per GPU node, but GPU jobs 
>> can’t keep them all busy)
>> 
>> If less than 12 cores per node is requested, the job goes into the 
>> any-interactive partition and could be handled on any of our GPU or non-GPU 
>> nodes.
>> 
>> If more than 12 cores per node is requested, the job goes into the 
>> interactive partition and is handled by only a non-GPU node.
>> 
>> I haven’t needed to QOS the interactive partitions, but that’s not a bad 
>> idea.
>> 
>>> On Jun 11, 2020, at 8:19 AM, Paul Edmon  wrote:
>>> 
>>> Generally the way we've solved this is to set aside a specific set of
>>> nodes in a partition for interactive sessions.  We deliberately scale
>>> the size of the resources so that users will always run immediately and
>>> we also set a QoS on the partition to make it so that no one user can
>>> dominate the partition.
>>> 
>>> -Paul Edmon-
>>> 
>>> On 6/11/2020 8:49 AM, Loris Bennett wrote:
>>>> Hi Manual,
>>>> 
>>>> "Holtgrewe, Manuel"  writes:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> is there a way to make interactive logins where users will use almost no 
>>>>> resources "always succeed"?
>>>>> 
>>>>> In most of these interactive sessions, users will have mostly idle shells 
>>>>> running and do some batch job submissions. Is there a way to allocate 
>>>>> "infinite virtual cpus" on each node that can only be allocated to
>>>>> interactive jobs?
>>>> I have never done this but setting "OverSubscribe" in the appropriate
>>>> place might be what you are looking for.
>>>> 
>>>>   https://slurm.schedmd.com/cons_res_share.html
>>>> 
>>>> Personally, however, I would be a bit wary of doing this.  What if
>>>> someone does start a multithreaded process on purpose or by accident?
>>>> 
>>>> Wouldn't just using cgroups on your login node achieve what you want?
>>>> 
>>>> Cheers,
>>>> 
>>>> Loris
>>>> 
>

Re: [slurm-users] Make "srun --pty bash -i" always schedule immediately

2020-06-11 Thread Renfro, Michael

That’s close to what we’re doing, but without dedicated nodes. We have three 
back-end partitions (interactive, any-interactive, and gpu-interactive), but 
the users typically don’t have to consider that, due to our job_submit.lua 
plugin.

All three partitions have a default of 2 hours, 1 core, 2 GB RAM, but users 
could request more cores and RAM (but not as much as a batch job — we used 
https://hpcbios.readthedocs.io/en/latest/HPCBIOS_05-05.html as a starting 
point).

If a GPU is requested, the job goes into the gpu-interactive partition and is 
limited to 16 cores per node (we have 28 cores per GPU node, but GPU jobs can’t 
keep them all busy)

If less than 12 cores per node is requested, the job goes into the 
any-interactive partition and could be handled on any of our GPU or non-GPU 
nodes.

If more than 12 cores per node is requested, the job goes into the interactive 
partition and is handled by only a non-GPU node.

I haven’t needed to QOS the interactive partitions, but that’s not a bad idea.

> On Jun 11, 2020, at 8:19 AM, Paul Edmon  wrote:
> 
> Generally the way we've solved this is to set aside a specific set of
> nodes in a partition for interactive sessions.  We deliberately scale
> the size of the resources so that users will always run immediately and
> we also set a QoS on the partition to make it so that no one user can
> dominate the partition.
> 
> -Paul Edmon-
> 
> On 6/11/2020 8:49 AM, Loris Bennett wrote:
>> Hi Manual,
>> 
>> "Holtgrewe, Manuel"  writes:
>> 
>>> Hi,
>>> 
>>> is there a way to make interactive logins where users will use almost no 
>>> resources "always succeed"?
>>> 
>>> In most of these interactive sessions, users will have mostly idle shells 
>>> running and do some batch job submissions. Is there a way to allocate 
>>> "infinite virtual cpus" on each node that can only be allocated to
>>> interactive jobs?
>> I have never done this but setting "OverSubscribe" in the appropriate
>> place might be what you are looking for.
>> 
>>   https://slurm.schedmd.com/cons_res_share.html
>> 
>> Personally, however, I would be a bit wary of doing this.  What if
>> someone does start a multithreaded process on purpose or by accident?
>> 
>> Wouldn't just using cgroups on your login node achieve what you want?
>> 
>> Cheers,
>> 
>> Loris
>> 
>

Re: [slurm-users] Slurm Job Count Credit system

2020-06-01 Thread Renfro, Michael

Even without the slurm-bank system, you can enforce a limit on resources with a 
QOS applied to those users. Something like:

=

sacctmgr add qos bank1 flags=NoDecay,DenyOnLimit
sacctmgr modify qos bank1 set grptresmins=cpu=1000

sacctmgr add account bank1
sacctmgr modify account name=bank1 set qos+=bank1

sacctmgr add user someuser account=bank1
sacctmgr modify user someuser set qos+=bank1

=

You can do lots with a QOS, including limiting the number of simultaneous 
running jobs, simultaneous running/queued jobs, etc. Unfortunately, the NoDecay 
flag is only documented to work on GrpTRESMins, GrpWall, and UsageRaw, not on 
the job count.

So if you can live with limiting the number of simultaneous jobs instead of a 
total number of jobs per time period, that’s possible with QOS. Otherwise, 
maybe someone else will have an idea.

-- 
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University

> On May 31, 2020, at 11:35 AM, Songpon Srisawai  
> wrote:
> 
> Hello all,
> 
> I’m Slurm beginner who try to implement our cluster. I would like to know 
> whether there are any Slurm credit/token system plugin such as the number of 
> job count.
> 
> I found Slurm-bank that deposit hour to an account. But, I would like to 
> deposit the jobs token instead of hours.
> 
> Thanks for any recommendation
> Songpon

Re: [slurm-users] Ubuntu Cluster with Slurm

2020-05-13 Thread Renfro, Michael

I’d compare the RealMemory part of ’scontrol show node 
abhi-HP-EliteBook-840-G2’ to the RealMemory part of your slurm.conf:

> Nodes which register to the system with less than the configured resources 
> (e.g. too little memory), will be placed in the "DOWN" state to avoid 
> scheduling jobs on them.

— https://slurm.schedmd.com/slurm.conf.html

As far as GPUs go, it looks like you have Intel graphics on the Lenovo and a 
Radeon R7 on the HP? If so, then nothing is CUDA-compatible, but you might be 
able to make something work with OpenCL. No idea if that would give performance 
improvements over the CPUs, though.

-- 
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University

> On May 13, 2020, at 8:42 AM, Abhinandan Patil 
>  wrote:
> 
> Dear All,
> 
> Preamble
> --
> I want to form simple cluster with three laptops:
> abhi-Latitude-E6430  //This serves as the controller
> abhi-Lenovo-ideapad-330-15IKB //Compute Node
> abhi-HP-EliteBook-840-G2 //Compute Node
> 
> 
> Aim
> -
> I want to make use of CPU+GPU+RAM on all the machines when I execute JAVA 
> programs or Python programs.
> 
> 
> Implementation
> 
> Now let us look at the slurm.conf
> 
> On Machine abhi-Latitude-E6430
> 
> ClusterName=linux
> ControlMachine=abhi-Latitude-E6430
> SlurmUser=abhi
> SlurmctldPort=6817
> SlurmdPort=6818
> AuthType=auth/munge
> SwitchType=switch/none
> StateSaveLocation=/tmp
> MpiDefault=none
> ProctrackType=proctrack/pgid
> NodeName=abhi-Lenovo-ideapad-330-15IKB RealMemory=12000 CPUs=2
> NodeName=abhi-HP-EliteBook-840-G2 RealMemory=14000 CPUs=2
> PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
> 
> Same slurm.conf is copied to all the Machines.
> 
> 
> Observations
> --
> Now when I do
> abhi@abhi-HP-EliteBook-840-G2:~$ service slurmd status
> ● slurmd.service - Slurm node daemon
>  Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor 
> preset: enabled)
>  Active: active (running) since Wed 2020-05-13 18:50:01 IST; 1min 49s ago
>Docs: man:slurmd(8)
> Process: 98235 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, 
> status=0/SUCCESS)
>Main PID: 98253 (slurmd)
>   Tasks: 2
>  Memory: 2.2M
>  CGroup: /system.slice/slurmd.service
>  └─98253 /usr/sbin/slurmd
> 
> abhi@abhi-Lenovo-ideapad-330-15IKB:~$ service slurmd status
> ● slurmd.service - Slurm node daemon
>  Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor 
> preset: enabled)
>  Active: active (running) since Wed 2020-05-13 18:50:20 IST; 8s ago
>Docs: man:slurmd(8)
> Process: 71709 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, 
> status=0/SUCCESS)
>Main PID: 71734 (slurmd)
>   Tasks: 2
>  Memory: 2.0M
>  CGroup: /system.slice/slurmd.service
>  └─71734 /usr/sbin/slurmd
> 
> abhi@abhi-Latitude-E6430:~$ service slurmctld status 
> ● slurmctld.service - Slurm controller daemon
>  Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor 
> preset: enabled)
>  Active: active (running) since Wed 2020-05-13 18:48:58 IST; 4min 56s ago
>Docs: man:slurmctld(8)
> Process: 97114 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS 
> (code=exited, status=0/SUCCESS)
>Main PID: 97116 (slurmctld)
>   Tasks: 7
>  Memory: 2.6M
>  CGroup: /system.slice/slurmctld.service
>  └─97116 /usr/sbin/slurmctld
> 
>  
> However  abhi@abhi-Latitude-E6430:~$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> debug*   up   infinite  1  down* abhi-Lenovo-ideapad-330-15IKB
> 
> 
> Advice needed
> 
> Please let me know Why I am seeing only one node. 
> Further how the total memory is calculated? Can Slurm make use of GPU 
> processing power as well
> Please let me know if I have missed something in configuration or explanation.
> 
> Thank you all
> 
> Best Regards,
> Abhinandan H. Patil, +919886406214
> https://www.AbhinandanHPatil.info
> 
>

Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-09 Thread Renfro, Michael

Still observing, but it looks like clearing out the runaway jobs followed by 
restarting slurmdbd got my user up to 986 CPU-days remaining out of their 
allowed 1000. Not certain the runaways were related, but it definitely started 
behaving better after a late afternoon/early evening slurmdbd restart.

Thanks.

> On May 8, 2020, at 11:47 AM, Renfro, Michael  wrote:
> 
> Working on something like that now. From an SQL export, I see 16 jobs from 
> my user that have a state of 7. Both states 3 and 7 show up as COMPLETED in 
> sacct, and may also have some duplicate job entries found via sacct 
> --duplicates.
> 
>> On May 8, 2020, at 11:34 AM, Ole Holm Nielsen  
>> wrote:
>> 
>> Hi Michael,
>> 
>> You can inquire the database for a job summary of a particular user and
>> time period using the slurmacct command:
>> https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmacct
>> 
>> You can also call "sacct --user=USER" directly like in slurmacct:
>> 
>> # Request job data
>> export
>> FORMAT="JobID,User${ulen},Group${glen},Partition,AllocNodes,AllocCPUS,Submit,Eligible,Start,End,CPUTimeRAW,State"
>> # Request job states: Cancelled, Completed, Failed, Timeout, Preempted
>> export STATE="ca,cd,f,to,pr"
>> # Get Slurm individual job accounting records using the "sacct" command
>> sacct $partitionselect -n -X -a -S $start_time -E $end_time -o $FORMAT
>> -s $STATE
>> 
>> There are numerous output fields which you can inquire, see "sacct -e".
>> 
>> /Ole
>> 
>> 
>>>> On 08-05-2020 16:54, Renfro, Michael wrote:
>>> Slurm 19.05.3 (packaged by Bright). For the three running jobs, the
>>> total GrpTRESRunMins requested is 564480 CPU-minutes as shown by
>>> 'showjob', and their remaining usage that the limit would check against
>>> is less than that.
>>> 
>>> My download of your scripts dated to August 21, 2019, and I've just now
>>> done a clone of your repository to see if there were any differences.
>>> None that I see -- 'showuserlimits -u USER -A ACCOUNT -s cpu' returns
>>> "Limit = 144, current value = 1399895".
>>> 
>>> So I assume there's something lingering in the database from some jobs
>>> that already completed, but still get counted against the user's current
>>> requests.
>>> 
>>> 
>>> *From:* Ole Holm Nielsen 
>>> *Sent:* Friday, May 8, 2020 9:27 AM
>>> *To:* slurm-users@lists.schedmd.com 
>>> *Cc:* Renfro, Michael 
>>> *Subject:* Re: [slurm-users] scontrol show assoc_mgr showing more
>>> resources in use than squeue
>>> Hi Michael,
>>> 
>>> Yes, my Slurm tools use and trust the output of Slurm commands such as
>>> sacct, and any discrepancy would have to come from the Slurm database.
>>> Which version of Slurm are you running on the database server and the
>>> node where you run sacct?
>>> 
>>> Did you add up the GrpTRESRunMins values of all the user's running jobs?
>>>  They had better add up to current value = 1402415.  The "showjob"
>>> command prints #CPUs and time limit in minutes, so you need to multiply
>>> these numbers together.  Example:
>>> 
>>> This job requests 160 CPUs and has a time limit of 2-00:00:00
>>> (days-hh:mm:ss) = 2880 min.
>>> 
>>> Did you download the latest versions of my Slurm tools from Github?  I
>>> make improvements of them from time to time.
>>> 
>>> /Ole
>>> 
>>> 
>>>> On 08-05-2020 16:12, Renfro, Michael wrote:
>>>> Thanks, Ole. Your showuserlimits script is actually where I got started
>>>> today, and where I found the sacct command I sent earlier.
>>>> 
>>>> Your script gives the same output for that user: the only line that's
>>>> not a "Limit = None" is for the user's GrpTRESRunMins value, which is
>>>> at "Limit = 144, current value = 1402415".
>>>> 
>>>> The limit value is correct, but the current value is not (due to the
>>>> incorrect sacct output).
>>>> 
>>>> I've also gone through sacctmgr show runaway to clean up any runaway
>>>> jobs. I had lots, but they were all from a different user, and had no
>>>> effect on this particular user's values.
>>>> 
>>>> ------------
>>>&

Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-08 Thread Renfro, Michael

Working on something like that now. From an SQL export, I see 16 jobs from my 
user that have a state of 7. Both states 3 and 7 show up as COMPLETED in sacct, 
and may also have some duplicate job entries found via sacct --duplicates.

> On May 8, 2020, at 11:34 AM, Ole Holm Nielsen  
> wrote:
> 
> Hi Michael,
> 
> You can inquire the database for a job summary of a particular user and
> time period using the slurmacct command:
> https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmacct
> 
> You can also call "sacct --user=USER" directly like in slurmacct:
> 
> # Request job data
> export
> FORMAT="JobID,User${ulen},Group${glen},Partition,AllocNodes,AllocCPUS,Submit,Eligible,Start,End,CPUTimeRAW,State"
> # Request job states: Cancelled, Completed, Failed, Timeout, Preempted
> export STATE="ca,cd,f,to,pr"
> # Get Slurm individual job accounting records using the "sacct" command
> sacct $partitionselect -n -X -a -S $start_time -E $end_time -o $FORMAT
> -s $STATE
> 
> There are numerous output fields which you can inquire, see "sacct -e".
> 
> /Ole
> 
> 
>> On 08-05-2020 16:54, Renfro, Michael wrote:
>> Slurm 19.05.3 (packaged by Bright). For the three running jobs, the
>> total GrpTRESRunMins requested is 564480 CPU-minutes as shown by
>> 'showjob', and their remaining usage that the limit would check against
>> is less than that.
>> 
>> My download of your scripts dated to August 21, 2019, and I've just now
>> done a clone of your repository to see if there were any differences.
>> None that I see -- 'showuserlimits -u USER -A ACCOUNT -s cpu' returns
>> "Limit = 144, current value = 1399895".
>> 
>> So I assume there's something lingering in the database from some jobs
>> that already completed, but still get counted against the user's current
>> requests.
>> 
>> 
>> *From:* Ole Holm Nielsen 
>> *Sent:* Friday, May 8, 2020 9:27 AM
>> *To:* slurm-users@lists.schedmd.com 
>> *Cc:* Renfro, Michael 
>> *Subject:* Re: [slurm-users] scontrol show assoc_mgr showing more
>> resources in use than squeue
>> Hi Michael,
>> 
>> Yes, my Slurm tools use and trust the output of Slurm commands such as
>> sacct, and any discrepancy would have to come from the Slurm database.
>> Which version of Slurm are you running on the database server and the
>> node where you run sacct?
>> 
>> Did you add up the GrpTRESRunMins values of all the user's running jobs?
>>   They had better add up to current value = 1402415.  The "showjob"
>> command prints #CPUs and time limit in minutes, so you need to multiply
>> these numbers together.  Example:
>> 
>> This job requests 160 CPUs and has a time limit of 2-00:00:00
>> (days-hh:mm:ss) = 2880 min.
>> 
>> Did you download the latest versions of my Slurm tools from Github?  I
>> make improvements of them from time to time.
>> 
>> /Ole
>> 
>> 
>>> On 08-05-2020 16:12, Renfro, Michael wrote:
>>> Thanks, Ole. Your showuserlimits script is actually where I got started
>>> today, and where I found the sacct command I sent earlier.
>>> 
>>> Your script gives the same output for that user: the only line that's
>>> not a "Limit = None" is for the user's GrpTRESRunMins value, which is
>>> at "Limit = 144, current value = 1402415".
>>> 
>>> The limit value is correct, but the current value is not (due to the
>>> incorrect sacct output).
>>> 
>>> I've also gone through sacctmgr show runaway to clean up any runaway
>>> jobs. I had lots, but they were all from a different user, and had no
>>> effect on this particular user's values.
>>> 
>>> 
>>> *From:* slurm-users  on behalf of
>>> Ole Holm Nielsen 
>>> *Sent:* Friday, May 8, 2020 8:54 AM
>>> *To:* slurm-users@lists.schedmd.com 
>>> *Subject:* Re: [slurm-users] scontrol show assoc_mgr showing more
>>> resources in use than squeue
>>> 
>>> Hi Michael,
>>> 
>>> Maybe you will find a couple of my Slurm tools useful for displaying
>>> data from the Slurm database in a more user-friendly format:
>>> 
>>> showjob: Show status of Slurm job(s). Both queue information and
>>> accounting information is printed.
>>> 
>>> showuserlimits: Print Slurm resource user limits and usage
>>> 
>>&g

Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-08 Thread Renfro, Michael

Slurm 19.05.3 (packaged by Bright). For the three running jobs, the total 
GrpTRESRunMins requested is 564480 CPU-minutes as shown by 'showjob', and their 
remaining usage that the limit would check against is less than that.

My download of your scripts dated to August 21, 2019, and I've just now done a 
clone of your repository to see if there were any differences. None that I see 
-- 'showuserlimits -u USER -A ACCOUNT -s cpu' returns "Limit = 144, current 
value = 1399895".

So I assume there's something lingering in the database from some jobs that 
already completed, but still get counted against the user's current requests.


From: Ole Holm Nielsen 
Sent: Friday, May 8, 2020 9:27 AM
To: slurm-users@lists.schedmd.com 
Cc: Renfro, Michael 
Subject: Re: [slurm-users] scontrol show assoc_mgr showing more resources in 
use than squeue

Hi Michael,

Yes, my Slurm tools use and trust the output of Slurm commands such as
sacct, and any discrepancy would have to come from the Slurm database.
Which version of Slurm are you running on the database server and the
node where you run sacct?

Did you add up the GrpTRESRunMins values of all the user's running jobs?
  They had better add up to current value = 1402415.  The "showjob"
command prints #CPUs and time limit in minutes, so you need to multiply
these numbers together.  Example:

This job requests 160 CPUs and has a time limit of 2-00:00:00
(days-hh:mm:ss) = 2880 min.

Did you download the latest versions of my Slurm tools from Github?  I
make improvements of them from time to time.

/Ole


On 08-05-2020 16:12, Renfro, Michael wrote:
> Thanks, Ole. Your showuserlimits script is actually where I got started
> today, and where I found the sacct command I sent earlier.
>
> Your script gives the same output for that user: the only line that's
> not a "Limit = None" is for the user's GrpTRESRunMins value, which is
> at "Limit = 144, current value = 1402415".
>
> The limit value is correct, but the current value is not (due to the
> incorrect sacct output).
>
> I've also gone through sacctmgr show runaway to clean up any runaway
> jobs. I had lots, but they were all from a different user, and had no
> effect on this particular user's values.
>
> 
> *From:* slurm-users  on behalf of
> Ole Holm Nielsen 
> *Sent:* Friday, May 8, 2020 8:54 AM
> *To:* slurm-users@lists.schedmd.com 
> *Subject:* Re: [slurm-users] scontrol show assoc_mgr showing more
> resources in use than squeue
>
> Hi Michael,
>
> Maybe you will find a couple of my Slurm tools useful for displaying
> data from the Slurm database in a more user-friendly format:
>
> showjob: Show status of Slurm job(s). Both queue information and
> accounting information is printed.
>
> showuserlimits: Print Slurm resource user limits and usage
>
> The user's limits are printed in detail by showuserlimits.
>
> These tools are available from https://github.com/OleHolmNielsen/Slurm_tools
>
> /Ole
>
> On 08-05-2020 15:34, Renfro, Michael wrote:
>> Hey, folks. I've had a 1000 CPU-day (144 CPU-minutes) GrpTRESMins
>> limit applied to each user for years. It generally works as intended,
>> but I have one user I've noticed whose usage is highly inflated from
>> reality, causing the GrpTRESMins limit to be enforced much earlier than
>> necessary:
>>
>> squeue output, showing roughly 340 CPU-days in running jobs, and all
>> other jobs blocked:
>>
>> # squeue -u USER
>> JOBID  PARTI   NAME USER ST TIME CPUS NODES
>> NODELIST(REASON) PRIORITY TRES_P START_TIME   TIME_LEFT
>> 747436 batchjob USER PD 0:00 28   1
>> (AssocGrpCPURunM 4784 N/AN/A  10-00:00:00
>> 747437 batchjob USER PD 0:00 28   1
>> (AssocGrpCPURunM 4784 N/AN/A  4-04:00:00
>> 747438 batchjob USER PD 0:00 28   1
>> (AssocGrpCPURunM 4784 N/AN/A  10-00:00:00
>> 747439 batchjob USER PD 0:00 28   1
>> (AssocGrpCPURunM 4784 N/AN/A  4-04:00:00
>> 747440 batchjob USER PD 0:00 28   1
>> (AssocGrpCPURunM 4784 N/AN/A  10-00:00:00
>> 747441 batchjob USER PD 0:00 28   1
>> (AssocGrpCPURunM 4784 N/AN/A  4-14:00:00
>> 747442 batchjob USER PD 0:00 28   1
>> (AssocGrpCPURunM 4784 N/AN/A  10-00:00:00
>> 747446 batchjob USER PD 0:00 14   1
>> (AssocGrpCPURunM 4778 N/AN/A  4-00:00:00

Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-08 Thread Renfro, Michael

Thanks, Ole. Your showuserlimits script is actually where I got started today, 
and where I found the sacct command I sent earlier.

Your script gives the same output for that user: the only line that's not a 
"Limit = None" is for the user's GrpTRESRunMins value, which is at "Limit = 
144, current value = 1402415".

The limit value is correct, but the current value is not (due to the incorrect 
sacct output).

I've also gone through sacctmgr show runaway to clean up any runaway jobs. I 
had lots, but they were all from a different user, and had no effect on this 
particular user's values.


From: slurm-users  on behalf of Ole Holm 
Nielsen 
Sent: Friday, May 8, 2020 8:54 AM
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] scontrol show assoc_mgr showing more resources in 
use than squeue

Hi Michael,

Maybe you will find a couple of my Slurm tools useful for displaying
data from the Slurm database in a more user-friendly format:

showjob: Show status of Slurm job(s). Both queue information and
accounting information is printed.

showuserlimits: Print Slurm resource user limits and usage

The user's limits are printed in detail by showuserlimits.

These tools are available from https://github.com/OleHolmNielsen/Slurm_tools

/Ole

On 08-05-2020 15:34, Renfro, Michael wrote:
> Hey, folks. I've had a 1000 CPU-day (144 CPU-minutes) GrpTRESMins
> limit applied to each user for years. It generally works as intended,
> but I have one user I've noticed whose usage is highly inflated from
> reality, causing the GrpTRESMins limit to be enforced much earlier than
> necessary:
>
> squeue output, showing roughly 340 CPU-days in running jobs, and all
> other jobs blocked:
>
> # squeue -u USER
> JOBID  PARTI   NAME USER ST TIME CPUS NODES
> NODELIST(REASON) PRIORITY TRES_P START_TIME   TIME_LEFT
> 747436 batchjob USER PD 0:00 28   1
> (AssocGrpCPURunM 4784 N/AN/A  10-00:00:00
> 747437 batchjob USER PD 0:00 28   1
> (AssocGrpCPURunM 4784 N/AN/A  4-04:00:00
> 747438 batchjob USER PD 0:00 28   1
> (AssocGrpCPURunM 4784 N/AN/A  10-00:00:00
> 747439 batchjob USER PD 0:00 28   1
> (AssocGrpCPURunM 4784 N/AN/A  4-04:00:00
> 747440 batchjob USER PD 0:00 28   1
> (AssocGrpCPURunM 4784 N/AN/A  10-00:00:00
> 747441 batchjob USER PD 0:00 28   1
> (AssocGrpCPURunM 4784 N/AN/A  4-14:00:00
> 747442 batchjob USER PD 0:00 28   1
> (AssocGrpCPURunM 4784 N/AN/A  10-00:00:00
> 747446 batchjob USER PD 0:00 14   1
> (AssocGrpCPURunM 4778 N/AN/A  4-00:00:00
> 747447 batchjob USER PD 0:00 14   1
> (AssocGrpCPURunM 4778 N/AN/A  4-00:00:00
> 747448 batchjob USER PD 0:00 14   1
> (AssocGrpCPURunM 4778 N/AN/A  4-00:00:00
> 747445 batchjob USER  R  8:39:17 14   1 node002
>   4778 N/A2020-05-07T23:02:19  3-15:20:43
> 747444 batchjob USER  R 16:03:13 14   1 node003
>   4515 N/A2020-05-07T15:38:23  3-07:56:47
> 747435 batchjob USER  R   1-10:07:42 28   1 node005
>   3784 N/A2020-05-06T21:33:54  8-13:52:18
>
> scontrol output, showing roughly 980 CPU-days in use on the second line,
> and thus blocking additional jobs:
>
> # scontrol -o show assoc_mgr users=USER account=ACCOUNT flags=assoc
> ClusterName=its Account=ACCOUNT UserName= Partition= Priority=0 ID=21
> SharesRaw/Norm/Level/Factor=1/0.03/35/0.00
> UsageRaw/Norm/Efctv=2733615872.34/0.39/0.71 ParentAccount=PARENT(9)
> Lft=1197 DefAssoc=No GrpJobs=N(4) GrpJobsAccrue=N(10)
> GrpSubmitJobs=N(14) GrpWall=N(616142.94)
> GrpTRES=cpu=N(84),mem=N(168000),energy=N(0),node=N(40),billing=N(420),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
> GrpTRESMins=cpu=N(9239391),mem=N(18478778157),energy=N(0),node=N(616142),billing=N(45546470),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
> GrpTRESRunMins=cpu=N(1890060),mem=N(3780121866),energy=N(0),node=N(113778),billing=N(9450304),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
> MaxJobs= MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN=
> MaxTRESMinsPJ= MinPrioThresh=
> ClusterName=its Account=ACCOUNT UserName=USER(UID) Partition= Priority=0
> ID=56 SharesRaw/Norm/Level/Factor=1/0.08/13/0.00
> UsageRaw/Norm/Efctv=994969457.37/0.14/0.36 ParentAccount= Lft=1218
> DefAssoc=Yes GrpJobs=N(3) GrpJobsAccrue=N(10) GrpSubmitJobs=N(13)
> GrpWall=N(227625.69)
> GrpT

[slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-08 Thread Renfro, Michael

Hey, folks. I've had a 1000 CPU-day (144 CPU-minutes) GrpTRESMins limit 
applied to each user for years. It generally works as intended, but I have one 
user I've noticed whose usage is highly inflated from reality, causing the 
GrpTRESMins limit to be enforced much earlier than necessary:

squeue output, showing roughly 340 CPU-days in running jobs, and all other jobs 
blocked:

# squeue -u USER
JOBID  PARTI   NAME USER ST TIME CPUS NODES NODELIST(REASON) 
PRIORITY TRES_P START_TIME   TIME_LEFT
747436 batchjob USER PD 0:00 28   1 (AssocGrpCPURunM 
4784 N/AN/A  10-00:00:00
747437 batchjob USER PD 0:00 28   1 (AssocGrpCPURunM 
4784 N/AN/A  4-04:00:00
747438 batchjob USER PD 0:00 28   1 (AssocGrpCPURunM 
4784 N/AN/A  10-00:00:00
747439 batchjob USER PD 0:00 28   1 (AssocGrpCPURunM 
4784 N/AN/A  4-04:00:00
747440 batchjob USER PD 0:00 28   1 (AssocGrpCPURunM 
4784 N/AN/A  10-00:00:00
747441 batchjob USER PD 0:00 28   1 (AssocGrpCPURunM 
4784 N/AN/A  4-14:00:00
747442 batchjob USER PD 0:00 28   1 (AssocGrpCPURunM 
4784 N/AN/A  10-00:00:00
747446 batchjob USER PD 0:00 14   1 (AssocGrpCPURunM 
4778 N/AN/A  4-00:00:00
747447 batchjob USER PD 0:00 14   1 (AssocGrpCPURunM 
4778 N/AN/A  4-00:00:00
747448 batchjob USER PD 0:00 14   1 (AssocGrpCPURunM 
4778 N/AN/A  4-00:00:00
747445 batchjob USER  R  8:39:17 14   1 node002  
4778 N/A2020-05-07T23:02:19  3-15:20:43
747444 batchjob USER  R 16:03:13 14   1 node003  
4515 N/A2020-05-07T15:38:23  3-07:56:47
747435 batchjob USER  R   1-10:07:42 28   1 node005  
3784 N/A2020-05-06T21:33:54  8-13:52:18

scontrol output, showing roughly 980 CPU-days in use on the second line, and 
thus blocking additional jobs:

# scontrol -o show assoc_mgr users=USER account=ACCOUNT flags=assoc
ClusterName=its Account=ACCOUNT UserName= Partition= Priority=0 ID=21 
SharesRaw/Norm/Level/Factor=1/0.03/35/0.00 
UsageRaw/Norm/Efctv=2733615872.34/0.39/0.71 ParentAccount=PARENT(9) Lft=1197 
DefAssoc=No GrpJobs=N(4) GrpJobsAccrue=N(10) GrpSubmitJobs=N(14) 
GrpWall=N(616142.94) 
GrpTRES=cpu=N(84),mem=N(168000),energy=N(0),node=N(40),billing=N(420),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
 
GrpTRESMins=cpu=N(9239391),mem=N(18478778157),energy=N(0),node=N(616142),billing=N(45546470),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
 
GrpTRESRunMins=cpu=N(1890060),mem=N(3780121866),energy=N(0),node=N(113778),billing=N(9450304),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
 MaxJobs= MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN= 
MaxTRESMinsPJ= MinPrioThresh=
ClusterName=its Account=ACCOUNT UserName=USER(UID) Partition= Priority=0 ID=56 
SharesRaw/Norm/Level/Factor=1/0.08/13/0.00 
UsageRaw/Norm/Efctv=994969457.37/0.14/0.36 ParentAccount= Lft=1218 DefAssoc=Yes 
GrpJobs=N(3) GrpJobsAccrue=N(10) GrpSubmitJobs=N(13) GrpWall=N(227625.69) 
GrpTRES=cpu=N(56),mem=N(112000),energy=N(0),node=N(35),billing=N(280),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=8(0)
 
GrpTRESMins=cpu=N(3346095),mem=N(6692190572),energy=N(0),node=N(227625),billing=N(16580497),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
 
GrpTRESRunMins=cpu=144(1407455),mem=N(2814910466),energy=N(0),node=N(88171),billing=N(7037276),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
 MaxJobs= MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN= 
MaxTRESMinsPJ= MinPrioThresh=

Where can I investigate to find the cause of this difference? Thanks.


--

Mike Renfro, PhD  / HPC Systems Administrator, Information Technology Services

931 372-3601  / Tennessee Tech University

Re: [slurm-users] Defining a default --nodes=1

2020-05-08 Thread Renfro, Michael

There are MinNodes and MaxNodes settings that can be defined for each partition 
listed in slurm.conf [1]. Set both to 1 and you should end up with the non-MPI 
partitions you want.

[1] https://slurm.schedmd.com/slurm.conf.html



From: slurm-users  on behalf of 
Holtgrewe, Manuel 
Sent: Friday, May 8, 2020 4:26 AM
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Defining a default --nodes=1

Dear all,

we're running a cluster where the large majority of jobs will use 
multi-threading and no message passing. Sometimes CPU>1 jobs are scheduled to 
run on more than one node (which would be fine for MPI jobs of course...)

Is it possible to automatically set "--nodes=1" for all jobs outside of the 
"mpi" partition (that we setup for message passing jobs)?

Thank you,
Manuel

--
Dr. Manuel Holtgrewe, Dipl.-Inform.
Bioinformatician
Core Unit Bioinformatics – CUBI
Berlin Institute of Health / Max Delbrück Center for Molecular Medicine in the 
Helmholtz Association / Charité – Universitätsmedizin Berlin

Visiting Address: Invalidenstr. 80, 3rd Floor, Room 03 028, 10117 Berlin
Postal Address: Chariteplatz 1, 10117 Berlin

E-Mail: manuel.holtgr...@bihealth.de
Phone: +49 30 450 543 607
Fax: +49 30 450 7 543 901
Web: cubi.bihealth.org  www.bihealth.org  www.mdc-berlin.de  www.charite.de

Re: [slurm-users] how to restrict jobs

2020-05-06 Thread Renfro, Michael

Ok, then regular license accounting won’t work.

Somewhat tested, but should work or at least be a starting point. Given a job 
number JOBID that’s already running with this license on one or more nodes:

  sbatch -w $(scontrol show job JOBID | grep ' NodeList=' | cut -d= -f2) -N 1

should start a one-node job on an available node being used by JOBID. Add other 
parameters as required for cpus-per-task, time limits, or whatever else is 
needed. If you start the larger jobs first, and let the later jobs fill in on 
idle CPUs on those nodes, it should work.

> On May 6, 2020, at 9:46 AM, navin srivastava  wrote:
> 
> To explain with more details.
> 
> job will be submitted based on core at any time but it will go to any random 
> nodes but limited to 4 Nodes only.(license having some intelligence that it 
> calculate the nodes and if it reached to 4 then it will not allow any more 
> nodes. yes it didn't depend on the no of core available on nodes.
> 
> Case-1 if 4 jobs running with 4 cores each on 4 nodes [node1, node2, node3 
> and node4]
>  Again Fifth job assigned by SLURM with 4 cores on any one node 
> of node1, node2, node3 and node4 then license will be allowed.
>  
> Case-2 if 4 jobs running with 4 cores each on 4 nodes [node1, node2, node3 
> and node4]
>  Again Fifth job assigned by SLURM on node5 with 4 cores  then 
> license will not allowed [ license not found error came in this case]
> 
> Regards
> Navin.
> 
> 
> On Wed, May 6, 2020 at 7:47 PM Renfro, Michael  wrote:
> To make sure I’m reading this correctly, you have a software license that 
> lets you run jobs on up to 4 nodes at once, regardless of how many CPUs you 
> use? That is, you could run any one of the following sets of jobs:
> 
> - four 1-node jobs,
> - two 2-node jobs,
> - one 1-node and one 3-node job,
> - two 1-node and one 2-node jobs,
> - one 4-node job,
> 
> simultaneously? And the license isn’t node-locked to specific nodes by MAC 
> address or anything similar? But if you try to run jobs beyond what I’ve 
> listed above, you run out of licenses, and you want those later jobs to be 
> held until licenses are freed up?
> 
> If all of those questions have an answer of ‘yes’, I think you want the 
> remote license part of the https://slurm.schedmd.com/licenses.html, something 
> like:
> 
>   sacctmgr add resource name=software_name count=4 percentallowed=100 
> server=flex_host servertype=flexlm type=license
> 
> and submit jobs with a '-L software_name:N’ flag where N is the number of 
> nodes you want to run on.
> 
> > On May 6, 2020, at 5:33 AM, navin srivastava  wrote:
> > 
> > Thanks Micheal.
> > 
> > Actually one application license are based on node and we have 4 Node 
> > license( not a fix node). we have several nodes but when job lands on any 4 
> > random nodes it runs on those nodes only. After that it fails if it goes to 
> > other nodes.
> > 
> > can we define a custom variable and set it on the node level and when user 
> > submit it will pass that variable and then job will and onto those specific 
> > nodes?
> > i do not want to create a separate partition. 
> > 
> > is there any way to achieve this by any other method?
> > 
> > Regards
> > Navin.
> > 
> > 
> > Regards
> > Navin.
> > 
> > On Tue, May 5, 2020 at 7:46 PM Renfro, Michael  wrote:
> > Haven’t done it yet myself, but it’s on my todo list.
> > 
> > But I’d assume that if you use the FlexLM or RLM parts of that 
> > documentation, that Slurm would query the remote license server 
> > periodically and hold the job until the necessary licenses were available.
> > 
> > > On May 5, 2020, at 8:37 AM, navin srivastava  
> > > wrote:
> > > 
> > > External Email Warning
> > > This email originated from outside the university. Please use caution 
> > > when opening attachments, clicking links, or responding to requests.
> > > Thanks Michael,
> > > 
> > > yes i have gone through but the licenses are remote license and it will 
> > > be used by outside as well not only in slurm.
> > > so basically i am interested to know how we can update the database 
> > > dynamically to get the exact value at that point of time.
> > > i mean query the license server and update the database accordingly. does 
> > > slurm automatically updated the value based on usage?
> > > 
> > > 
> > > Regards
> > > Navin.
> > > 
> > > 
> > > On Tue, May 5, 2020 at 7:00 PM Renfro, Michael  wrote:
> > > Have you seen https://slurm

Re: [slurm-users] how to restrict jobs

2020-05-06 Thread Renfro, Michael

To make sure I’m reading this correctly, you have a software license that lets 
you run jobs on up to 4 nodes at once, regardless of how many CPUs you use? 
That is, you could run any one of the following sets of jobs:

- four 1-node jobs,
- two 2-node jobs,
- one 1-node and one 3-node job,
- two 1-node and one 2-node jobs,
- one 4-node job,

simultaneously? And the license isn’t node-locked to specific nodes by MAC 
address or anything similar? But if you try to run jobs beyond what I’ve listed 
above, you run out of licenses, and you want those later jobs to be held until 
licenses are freed up?

If all of those questions have an answer of ‘yes’, I think you want the remote 
license part of the https://slurm.schedmd.com/licenses.html, something like:

  sacctmgr add resource name=software_name count=4 percentallowed=100 
server=flex_host servertype=flexlm type=license

and submit jobs with a '-L software_name:N’ flag where N is the number of nodes 
you want to run on.

> On May 6, 2020, at 5:33 AM, navin srivastava  wrote:
> 
> Thanks Micheal.
> 
> Actually one application license are based on node and we have 4 Node 
> license( not a fix node). we have several nodes but when job lands on any 4 
> random nodes it runs on those nodes only. After that it fails if it goes to 
> other nodes.
> 
> can we define a custom variable and set it on the node level and when user 
> submit it will pass that variable and then job will and onto those specific 
> nodes?
> i do not want to create a separate partition. 
> 
> is there any way to achieve this by any other method?
> 
> Regards
> Navin.
> 
> 
> Regards
> Navin.
> 
> On Tue, May 5, 2020 at 7:46 PM Renfro, Michael  wrote:
> Haven’t done it yet myself, but it’s on my todo list.
> 
> But I’d assume that if you use the FlexLM or RLM parts of that documentation, 
> that Slurm would query the remote license server periodically and hold the 
> job until the necessary licenses were available.
> 
> > On May 5, 2020, at 8:37 AM, navin srivastava  wrote:
> > 
> > External Email Warning
> > This email originated from outside the university. Please use caution when 
> > opening attachments, clicking links, or responding to requests.
> > Thanks Michael,
> > 
> > yes i have gone through but the licenses are remote license and it will be 
> > used by outside as well not only in slurm.
> > so basically i am interested to know how we can update the database 
> > dynamically to get the exact value at that point of time.
> > i mean query the license server and update the database accordingly. does 
> > slurm automatically updated the value based on usage?
> > 
> > 
> > Regards
> > Navin.
> > 
> > 
> > On Tue, May 5, 2020 at 7:00 PM Renfro, Michael  wrote:
> > Have you seen https://slurm.schedmd.com/licenses.html already? If the 
> > software is just for use inside the cluster, one Licenses= line in 
> > slurm.conf plus users submitting with the -L flag should suffice. Should be 
> > able to set that license value is 4 if it’s licensed per node and you can 
> > run up to 4 jobs simultaneously, or 4*NCPUS if it’s licensed per CPU, or 1 
> > if it’s a single license good for one run from 1-4 nodes.
> > 
> > There are also options to query a FlexLM or RLM server for license 
> > management.
> > 
> > -- 
> > Mike Renfro, PhD / HPC Systems Administrator, Information Technology 
> > Services
> > 931 372-3601 / Tennessee Tech University
> > 
> > > On May 5, 2020, at 7:54 AM, navin srivastava  
> > > wrote:
> > > 
> > > Hi Team,
> > > 
> > > we have an application whose licenses is limited .it scales upto 4 
> > > nodes(~80 cores).
> > > so if 4 nodes are full, in 5th node job used to get fail.
> > > we want to put a restriction so that the application can't go for the 
> > > execution beyond the 4 nodes and fail it should be in queue state.
> > > i do not want to keep a separate partition to achieve this config.is 
> > > there a way to achieve this scenario using some dynamic resource which 
> > > can call the license variable on the fly and if it is reached it should 
> > > keep the job in queue.
> > > 
> > > Regards
> > > Navin.
> > > 
> > > 
> > > 
> > 
>

Re: [slurm-users] Major newbie - Slurm/jupyterhub

2020-05-05 Thread Renfro, Michael

Aside from any Slurm configuration, I’d recommend setting up a modules [1 or 2] 
folder structure for CUDA and other third-party software. That handles 
LD_LIBRARY_PATH and other similar variables, reduces the chances for library 
conflicts, and lets users decide their environment on a per-job basis. Ours 
includes a basic Miniconda installation, and the users can make their own 
environments from there [3]. I very rarely install a system-wide Python module.

[1] http://modules.sourceforge.net
[2] https://lmod.readthedocs.io/
[3] https://its.tntech.edu/display/MON/HPC+Sample+Job%3A+Jupyter+Notebook

> On May 5, 2020, at 9:37 AM, Lisa Kay Weihl  wrote:
> 
> Thanks Guy, I did find that there was a jupyterhub_slurmspawner log in my 
> home directory.  That enabled me to find out that it could not find the path 
> for batchspawner-singleuser. 
> 
> 
> So I added this to jupyter_config.py
> export PATH=/opt/rh/rh-python36/root/bin:$PATH
> 
> 
> That seemed to now allow the server to launch for my user that I use for all 
> the configuration work. I get errors (see below) but the notebook loads. The 
> problem is I'm not sure how to kill the job in the Slurm queue or the 
> notebook server if I finish before the job times out and kills it. Logout 
> doesn't seem to do it.
> 
> It still doesn't work for a regular user (see below)
> 
> I think my problems all have to do with Slurm/jupyterhub finding python. So I 
> have some questions about the best way to set it up for multiple users and 
> make it work for this.
> 
> I use CentOS distribution so that if the university admins will ever have to 
> take over it will match their RedHat setups they use. I know on all Linux 
> distros you need to leave the python 2 system install alone. It looks like as 
> of CentOS 7.7 there is now a python3 in the repository. I didn't go that 
> route because in the past I installed the python from RedHat Software 
> Collection which is what I did this time.
> I don't know if that's the best route for this use case. They also say don't 
> sudo pip3 to try to install global packages but does that mean sudo to root 
> and then using pip3 is okay?
> 
> When I test and faculty don't give me code I go to the web and try to find 
> examples. I know I also wanted to try to test the GPUs from within the 
> notebook. I have 2 examples:
> 
> Example 1 uses these modules:
> import numpy as np
> import xgboost as xgb
> from sklearn import datasets
> from sklearn.model_selection import train_test_split
> from sklearn.datasets import dump_svmlight_file
> from sklearn.externals import joblib
> from sklearn.metrics import precision_score
> 
> It gives error: cannot load library 
> '/home/csadmin/.local/lib/python3.6/site-packages/librmm.so': 
> libcudart.so.9.2: cannot open shared object file: No such file or directory
> 
> libcudart.so is in: /usr/local/cuda-10.2/targets/x86_64-linux/lib
> 
> Does this mean I need LD_LIBRARY_PATH  set also? Cuda was installed with 
> typical NVIDIA instructions using their repo.
> 
> Example 2 uses these modules:
> import numpy as np
> from numba import vectorize
> 
> And gives error:  NvvmSupportError: libNVVM cannot be found. Do `conda 
> install cudatoolkit`:
> library nvvm not found
> 
> I don't have conda installed. Will that interfere with pip3?
> 
> Part II - using jupyterhub with regular user gives different error
> 
> I'm assuming this is a python path issue?
> 
>  File "/opt/rh/rh-python36/root/bin/batchspawner-singleuser", line 4, in 
> 
> __import__('pkg_resources').require('batchspawner==1.0.0rc0')
> and later
> pkg_resources.DistributionNotFound: The 'batchspawner==1.0.0rc0' distribution 
> was not found and is required by the application
> 
> Thanks again for any help especially if you can help clear up python 
> configuration.
> 
> 
> ***
> Lisa Weihl Systems Administrator
> Computer Science, Bowling Green State University
> Tel: (419) 372-0116   |Fax: (419) 372-8061
> lwe...@bgsu.edu
> www.bgsu.edu
> 
> From: slurm-users  on behalf of 
> slurm-users-requ...@lists.schedmd.com 
> Sent: Tuesday, May 5, 2020 4:59 AM
> To: slurm-users@lists.schedmd.com 
> Subject: [EXTERNAL] slurm-users Digest, Vol 31, Issue 8
>  
> Send slurm-users mailing list submissions to
> slurm-users@lists.schedmd.com
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.schedmd.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fslurm-usersdata=02%7C01%7Clweihl%40bgsu.edu%7C322dc8435ab642ef25aa08d7f0d29d44%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637242659703767084sdata=Nh9fjFzOGIXhLhdnbyiLc9oIENdpVkVl%2F5hysXbkMT8%3Dreserved=0
> or, via email, send a message with subject or body 'help' to
> slurm-users-requ...@lists.schedmd.com
> 
> You can reach the person managing the list at
> slurm-users-ow...@lists.schedmd.com
> 
> When replying, please edit your

Re: [slurm-users] how to restrict jobs

2020-05-05 Thread Renfro, Michael

Haven’t done it yet myself, but it’s on my todo list.

But I’d assume that if you use the FlexLM or RLM parts of that documentation, 
that Slurm would query the remote license server periodically and hold the job 
until the necessary licenses were available.

> On May 5, 2020, at 8:37 AM, navin srivastava  wrote:
> 
> External Email Warning
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> Thanks Michael,
> 
> yes i have gone through but the licenses are remote license and it will be 
> used by outside as well not only in slurm.
> so basically i am interested to know how we can update the database 
> dynamically to get the exact value at that point of time.
> i mean query the license server and update the database accordingly. does 
> slurm automatically updated the value based on usage?
> 
> 
> Regards
> Navin.
> 
> 
> On Tue, May 5, 2020 at 7:00 PM Renfro, Michael  wrote:
> Have you seen https://slurm.schedmd.com/licenses.html already? If the 
> software is just for use inside the cluster, one Licenses= line in slurm.conf 
> plus users submitting with the -L flag should suffice. Should be able to set 
> that license value is 4 if it’s licensed per node and you can run up to 4 
> jobs simultaneously, or 4*NCPUS if it’s licensed per CPU, or 1 if it’s a 
> single license good for one run from 1-4 nodes.
> 
> There are also options to query a FlexLM or RLM server for license management.
> 
> -- 
> Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
> 931 372-3601 / Tennessee Tech University
> 
> > On May 5, 2020, at 7:54 AM, navin srivastava  wrote:
> > 
> > Hi Team,
> > 
> > we have an application whose licenses is limited .it scales upto 4 
> > nodes(~80 cores).
> > so if 4 nodes are full, in 5th node job used to get fail.
> > we want to put a restriction so that the application can't go for the 
> > execution beyond the 4 nodes and fail it should be in queue state.
> > i do not want to keep a separate partition to achieve this config.is there 
> > a way to achieve this scenario using some dynamic resource which can call 
> > the license variable on the fly and if it is reached it should keep the job 
> > in queue.
> > 
> > Regards
> > Navin.
> > 
> > 
> > 
>

Re: [slurm-users] how to restrict jobs

2020-05-05 Thread Renfro, Michael

Have you seen https://slurm.schedmd.com/licenses.html already? If the software 
is just for use inside the cluster, one Licenses= line in slurm.conf plus users 
submitting with the -L flag should suffice. Should be able to set that license 
value is 4 if it’s licensed per node and you can run up to 4 jobs 
simultaneously, or 4*NCPUS if it’s licensed per CPU, or 1 if it’s a single 
license good for one run from 1-4 nodes.

There are also options to query a FlexLM or RLM server for license management.

-- 
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University

> On May 5, 2020, at 7:54 AM, navin srivastava  wrote:
> 
> Hi Team,
> 
> we have an application whose licenses is limited .it scales upto 4 nodes(~80 
> cores).
> so if 4 nodes are full, in 5th node job used to get fail.
> we want to put a restriction so that the application can't go for the 
> execution beyond the 4 nodes and fail it should be in queue state.
> i do not want to keep a separate partition to achieve this config.is there a 
> way to achieve this scenario using some dynamic resource which can call the 
> license variable on the fly and if it is reached it should keep the job in 
> queue.
> 
> Regards
> Navin.
> 
> 
>

Re: [slurm-users] Major newbie - Slurm/jupyterhub

2020-05-04 Thread Renfro, Michael

Assuming you need a scheduler for whatever size your user population is, so 
they need literal JupyterHub, or would they all be satisfied running regular 
Jupyter notebooks?

On May 4, 2020, at 7:25 PM, Lisa Kay Weihl  wrote:



External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


I have a single server with 2 cpu, 384gb memory and 4 gpu (GeForce RTX 2080 Ti).

Use is to be for GPU ML computing and python based data science.

One faculty wants jupyter notebooks, other faculty member is used to using CUDA 
for GPU but has only done it on a workstation in his lab with a GUI.  New 
faculty member coming in has used nvidia-docker container for GPU (I think on a 
large cluster, we are just getting started)

I'm charged with making all this work and hopefully all at once. Right now I'll 
take one thing working.

So I managed to get Slurm-20.02.1 installed with CUDA-10.2 on CentOS 7 (SE 
Linux enabled). I posted once before about having trouble getting that 
combination correct and I finally worked that out. Most of the tests in the 
test suite seem to run okay. I'm trying to start with very basic Slurm 
configuration so I haven't enabled accounting.

For reference here is my slurm.conf


# slurm.conf file generated by configurator easy.html.

# Put this file on all nodes of your cluster.

# See the slurm.conf man page for more information.

#

SlurmctldHost=cs-host


#authentication

AuthType=auth/munge

CacheGroups = 0

CryptoType=crypto/munge


#Add GPU support

GresTypes=gpu


#

#MailProg=/bin/mail

MpiDefault=none

#MpiParams=ports=#-#


#service

ProctrackType=proctrack/cgroup

ReturnToService=1

SlurmctldPidFile=/var/run/slurmctld.pid

#SlurmctldPort=6817

SlurmdPidFile=/var/run/slurmd.pid

#SlurmdPort=6818

SlurmdSpoolDir=/var/spool/slurmd

SlurmUser=slurm

#SlurmdUser=root

StateSaveLocation=/var/spool/slurmctld

SwitchType=switch/none

TaskPlugin=task/affinity

#

#

# TIMERS

#KillWait=30

#MinJobAge=300

#SlurmctldTimeout=120

SlurmdTimeout=1800

#

#

# SCHEDULING

SchedulerType=sched/backfill

SelectType=select/cons_tres

SelectTypeParameters=CR_Core_Memory

PriorityType=priority/multifactor

PriorityDecayHalfLife=3-0

PriorityMaxAge=7-0

PriorityFavorSmall=YES

PriorityWeightAge=1000

PriorityWeightFairshare=0

PriorityWeightJobSize=125

PriorityWeightPartition=1000

PriorityWeightQOS=0

#

#

# LOGGING AND ACCOUNTING

AccountingStorageType=accounting_storage/none

ClusterName=cs-host

#JobAcctGatherFrequency=30

JobAcctGatherType=jobacct_gather/none

SlurmctldDebug=info

SlurmctldLogFile=/var/log/slurmctld.log

#SlurmdDebug=info

SlurmdLogFile=/var/log/slurmd.log

#

#

# COMPUTE NODES

NodeName=cs-host CPUs=24 RealMemory=385405 Sockets=2 CoresPerSocket=6 
ThreadsPerCore=2 State=UNKNOWN Gres=gpu:4


#PARTITIONS

PartitionName=DEFAULT Nodes=cs-host Shared=FORCE:1 Default=YES MaxTime=INFINITE 
State=UP

PartitionName=faculty  Priority=10 Default=YES


I have jupyterhub running as part of RedHat SCL. It works fine with no 
integration with Slurm. Now I'm trying to use batchspawner to start a server 
for the user.  Right now I'm just trying one configuration from within the 
jupyterhub_config.py and trying to keep it simple (see below).

When I connect I get this error:
500: Internal Server Error
Error in Authenticator.pre_spawn_start: RuntimeError The Jupyter batch job has 
disappeared while pending in the queue or died immediately after starting.

In the jupyterhub.log:


[I 2020-05-04 19:47:58.604 JupyterHub base:707] User logged in: csadmin

[I 2020-05-04 19:47:58.606 JupyterHub log:174] 302 POST /hub/login?next= -> 
/hub/spawn (csadmin@127.0.0.1) 227.13ms

[I 2020-05-04 19:47:58.748 JupyterHub batchspawner:248] Spawner submitting job 
using sudo -E -u csadmin sbatch --parsable

[I 2020-05-04 19:47:58.749 JupyterHub batchspawner:249] Spawner submitted 
script:

#!/bin/bash

#SBATCH --partition=faculty

#SBATCH --time=8:00:00

#SBATCH --output=/home/csadmin/jupyterhub_slurmspawner_%j.log

#SBATCH --job-name=jupyterhub-spawner

#SBATCH --cpus-per-task=1

#SBATCH --chdir=/home/csadmin

#SBATCH --uid=csadmin



env

which jupyterhub-singleuser

batchspawner-singleuser jupyterhub-singleuser --ip=0.0.0.0



[I 2020-05-04 19:47:58.831 JupyterHub batchspawner:252] Job submitted. cmd: 
sudo -E -u csadmin sbatch --parsable output: 7117

[W 2020-05-04 19:47:59.481 JupyterHub batchspawner:377] Job  neither pending 
nor running.



[E 2020-05-04 19:47:59.482 JupyterHub user:640] Unhandled error starting 
csadmin's server: The Jupyter batch job has disappeared while pending in the 
queue or died immediately after starting.

[W 2020-05-04 19:47:59.518 JupyterHub web:1782] 500 GET /hub/spawn (127.0.0.1): 
Error in Authenticator.pre_spawn_start: RuntimeError The Jupyter batch job has 
disappeared while pending in the queue or

Re: [slurm-users] one job at a time - how to set?

2020-04-30 Thread Renfro, Michael

You can adjust or enforce almost anything about a job with a job_submit.lua [1 
(search for “JobSubmitPlugins”), 2].

Assuming you want this node in a single partition, you can set ExclusiveUser in 
a partition definition in slurm.conf. That would at least keep other users off 
the node, but wouldn’t prevent a single user from running multiple jobs on the 
node.

Past that, you can force a QOS on the partition [3], and use that to set limits 
on how many jobs a user can have running [4]. That might be just a MaxJobs=1 
for the QOS.

[1] https://slurm.schedmd.com/archive/slurm-15.08.13/slurm.conf.html
[2] 
https://github.com/SchedMD/slurm/blob/slurm-15-08-13-1/contribs/lua/job_submit.lua
[3] https://slurm.schedmd.com/archive/slurm-15.08.13/qos.html
[4] https://slurm.schedmd.com/archive/slurm-15.08.13/resource_limits.html

> On Apr 29, 2020, at 3:19 PM, Rutger Vos  wrote:
> 
> External Email Warning
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> Hi Michael,
> 
> thanks very much for your swift reply. So here we would have to convince the 
> users they'd have to specify this when submitting, right? I.e. 'sbatch 
> --exclusive myjob.sh', if I understand correctly. Would there be a way to 
> simply enforce this, i.e. at the slurm.conf level or something?
> 
> Thanks again!
> 
> Rutger
> 
> On Wed, Apr 29, 2020 at 10:06 PM Renfro, Michael  wrote:
> That’s a *really* old version, but 
> https://slurm.schedmd.com/archive/slurm-15.08.13/sbatch.html indicates 
> there’s an exclusive flag you can set.
> 
>> On Apr 29, 2020, at 1:54 PM, Rutger Vos  wrote:
>> .
>> Hi,
>> 
>> for a smallish machine that has been having degraded performance we want to 
>> implement a policy where only one job (submitted with sbatch) is allowed to 
>> run and any others submitted after it are supposed to wait in line.
>> 
>> I assumed this was straightforward but I can't seem to figure it out. Can I 
>> set that up in slurm.conf or in some other way? Thank you very much for your 
>> help. BTW we are running slurm 15.08.7 if that is at all relevant.
>> 
>> Best wishes,
>> 
>> Dr. Rutger A. Vos
>> Researcher / Bioinformatician
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> +31717519600 - +31627085806
>> rutger@naturalis.nl - www.naturalis.nl
>> Darwinweg 2, 2333 CR Leiden
>> Postbus 9517, 2300 RA Leiden
>> 
>>  
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> -- 
> 
> Met vriendelijke groet,
> 
> Dr. Rutger A. Vos
> Researcher / Bioinformatician
> 
> 
> 
> 
> 
> 
> 
> +31717519600 - +31627085806
> rutger@naturalis.nl - www.naturalis.nl
> Darwinweg 2, 2333 CR Leiden
> Postbus 9517, 2300 RA Leiden
> 
>  
> 
> 
> 
> 
> 
> 
> 
> 
>

Re: [slurm-users] one job at a time - how to set?

2020-04-29 Thread Renfro, Michael

That’s a *really* old version, but 
https://slurm.schedmd.com/archive/slurm-15.08.13/sbatch.html indicates there’s 
an exclusive flag you can set.

On Apr 29, 2020, at 1:54 PM, Rutger Vos  wrote:

.

Hi,

for a smallish machine that has been having degraded performance we want to 
implement a policy where only one job (submitted with sbatch) is allowed to run 
and any others submitted after it are supposed to wait in line.

I assumed this was straightforward but I can't seem to figure it out. Can I set 
that up in slurm.conf or in some other way? Thank you very much for your help. 
BTW we are running slurm 15.08.7 if that is at all relevant.

Best wishes,

Dr. Rutger A. Vos
Researcher / Bioinformatician
[https://06ecba7b-a-deac235a-s-sites.googlegroups.com/a/naturalis.nl/signatures/home/logo-new.png]






+31717519600 - +31627085806
rutger@naturalis.nl - 
www.naturalis.nl
Darwinweg 2, 2333 CR Leiden
Postbus 9517, 2300 RA Leiden

[https://06ecba7b-a-deac235a-s-sites.googlegroups.com/a/naturalis.nl/signatures/home/schildpad.gif]

Re: [slurm-users] One node is not used by slurm

2020-04-19 Thread Renfro, Michael

Someone else might see more than I do, but from what you’ve posted, it’s clear 
that compute-0-0 will be used only after other lower-weighted nodes are too 
full to accept a particular job.

I assume you’ve already submitted a set of jobs requesting enough resources to 
fill up all the nodes, and the some jobs stay in a pending state instead of 
using compute-0-0, which sits idle?

> On Apr 19, 2020, at 1:10 PM, Mahmood Naderan  wrote:
> 
> Hi,
> Although compute-0-0 is included in a partition, I have noticed that
> no job is offloaded there automatically. If someone intentionally
> write --nodelist=compute-0-0 it will be fine.
> 
> # grep -r compute-0-0 .
> ./nodenames.conf.new:NodeName=compute-0-0 NodeAddr=10.1.1.254 CPUs=32
> Weight=20511900 Feature=rack-0,32CPUs
> ./node.conf:NodeName=compute-0-0 NodeAddr=10.1.1.254 CPUs=32
> Weight=20511900 Feature=rack-0,32CPUs
> ./nodenames.conf.new4:NodeName=compute-0-0 NodeAddr=10.1.1.254 CPUs=32
> Weight=20511900 Feature=rack-0,32CPUs
> # grep -r compute-0-1 .
> ./nodenames.conf.new:NodeName=compute-0-1 NodeAddr=10.1.1.253 CPUs=32
> Weight=20511899 Feature=rack-0,32CPUs
> ./node.conf:NodeName=compute-0-1 NodeAddr=10.1.1.253 CPUs=32
> Weight=20511899 Feature=rack-0,32CPUs
> ./nodenames.conf.new4:NodeName=compute-0-1 NodeAddr=10.1.1.253 CPUs=32
> Weight=20511899 Feature=rack-0,32CPUs
> # cat parts
> PartitionName=WHEEL RootOnly=yes Priority=1000 Nodes=ALL
> PartitionName=SEA AllowAccounts=fish Nodes=ALL
> # scontrol show node compute-0-0
> NodeName=compute-0-0 Arch=x86_64 CoresPerSocket=1
>   CPUAlloc=0 CPUTot=32 CPULoad=0.01
>   AvailableFeatures=rack-0,32CPUs
>   ActiveFeatures=rack-0,32CPUs
>   Gres=(null)
>   NodeAddr=10.1.1.254 NodeHostName=compute-0-0
>   OS=Linux 3.10.0-1062.1.2.el7.x86_64 #1 SMP Mon Sep 30 14:19:46 UTC 2019
>   RealMemory=64259 AllocMem=0 FreeMem=63421 Sockets=32 Boards=1
>   State=IDLE ThreadsPerCore=1 TmpDisk=444124 Weight=20511900
> Owner=N/A MCS_label=N/A
>   Partitions=CLUSTER,WHEEL,SEA
>   BootTime=2020-04-18T10:30:07 SlurmdStartTime=2020-04-19T22:32:12
>   CfgTRES=cpu=32,mem=64259M,billing=47
>   AllocTRES=
>   CapWatts=n/a
>   CurrentWatts=0 AveWatts=0
>   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> 
> # squeue
> JOBID PARTITION NAME USER ST   TIME  NODES
> NODELIST(REASON)
>   436   SEA  relax13   raz  R   21:44:22  3
> compute-0-[1-2],hpc
>   435   SEA 261660mo abb  R 1-05:19:31  3
> compute-0-[1-2],hpc
> 
> Compute-0-0 is idle. So, why slurm decided to put those jobs on other nodes?
> Any idea for debugging?
> 
> 
> Regards,
> Mahmood
>

Re: [slurm-users] [EXTERNAL] Follow-up-slurm-users Digest, Vol 30, Issue 32

2020-04-17 Thread Renfro, Michael

Can’t speak for everyone, but I went to Slurm 19.05 some months back, and 
haven't had any problems with CUDA 10.0 or 10.1 (or 8.0, 9.0, or 9.1).

> On Apr 17, 2020, at 8:46 AM, Lisa Kay Weihl  wrote:
> 
> External Email Warning
> 
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> 
> 
> 
> Wow. I did not catch that version issue. I saw that there were issues with 
> the newest Slurm and how CUDA 10+ installs so I avoided that even though we 
> have CUDA 8. I did have Slurm 19 downloaded so I'm thinking I ran into an 
> issue with that and went back to 18 but now that I have more experience 
> setting it up I'll wipe the 18 install and start over. Fingers crossed for 
> success!
> 
> Thanks for your help!
> 
> --
> Lisa Weihl
> Systems Administrator, Computer Science
> Bowling Green State University
> Tel: (419) 372-0116   |Fax: (419) 372-8061
> lwe...@bgsu.edu
> www.bgsu.edu
> 
> -Original Message-
> From: slurm-users  On Behalf Of 
> slurm-users-requ...@lists.schedmd.com
> Sent: Thursday, April 16, 2020 6:39 PM
> To: slurm-users@lists.schedmd.com
> Subject: [EXTERNAL] slurm-users Digest, Vol 30, Issue 32
> 
> Send slurm-users mailing list submissions to
>slurm-users@lists.schedmd.com
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.schedmd.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fslurm-usersdata=02%7C01%7Clweihl%40bgsu.edu%7C51ded050bd424dc6ba8908d7e256fdad%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637226735569993045sdata=D782Wwobcc6ezSuy5GipiXuiH7EKRMm5Llk3BRwYnss%3Dreserved=0
> or, via email, send a message with subject or body 'help' to
>slurm-users-requ...@lists.schedmd.com
> 
> You can reach the person managing the list at
>slurm-users-ow...@lists.schedmd.com
> 
> When replying, please edit your Subject line so it is more specific than "Re: 
> Contents of slurm-users digest..."
> 
> 
> Today's Topics:
> 
>   1. CentOS 7 CUDA 8.0 can't find plugin cons_tres (Lisa Kay Weihl)
>   2. Re: [EXTERNAL] CentOS 7 CUDA 8.0 can't find plugin cons_tres
>  (Sean Crosby)
> 
> 
> --
> 
> Message: 1
> Date: Thu, 16 Apr 2020 19:00:03 +
> From: Lisa Kay Weihl 
> To: "slurm-users@lists.schedmd.com" 
> Subject: [slurm-users] CentOS 7 CUDA 8.0 can't find plugin cons_tres
> Message-ID:
>
> 
> 
> Content-Type: text/plain; charset="utf-8"
> 
> I have a standalone server with 4 GeForce RTX 2080 Ti. The purpose is to 
> serve as a computer server for data science jobs. My department chair wants a 
> job scheduler on it. I have installed SLURM (18.08.9). That works just fine 
> in a basic configuration when I attempt to add Gres_Types gpu and then add 
> Gres:gpu:4 to the end of the node description:
> 
> 
> NodeName=cs-datasci CPUs=24 RealMemory=385405 Sockets=2 CoresPerSocket=6 
> ThreadsPerCore=2 State=UNKNOWN Gres=gpu:4
> 
> and then try to restart slurmd I get an error that it cannot find the plugin
> 
> slurmd: error: Couldn't find the specified plugin name for select/cons_tres 
> looking at all files
> 
> slurmd: error: cannot find select plugin for select/cons_tres
> 
> slurmd: fatal: Can't find plugin for select/cons_tres
> 
> The system was prebuilt by AdvancedHPC with CentOS 7 and CUDA 8.0
> 
> I usually keep notes when I'm installing things but in this case I wasn't 
> jotting things down as I went. I think I started with the instructions on 
> this page: 
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fquickstart_admin.htmldata=02%7C01%7Clweihl%40bgsu.edu%7C51ded050bd424dc6ba8908d7e256fdad%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637226735569993045sdata=0%2BjmfxFqNhRQBC50zbeG5g5EO6pi2n5We9vPt6WGyHs%3Dreserved=0
>  and went with the usual ./configure, make, make install.
> 
> I have a feeling maybe something did not work and I switched to the rpm 
> packages based on some other web pages I saw because if I do a yum list 
> installed | grep slurm I see a lot of pacakages. The problem is I was 
> interrupted with other tasks and my memory was somewhat rusty when I came 
> back to this.
> 
> When I went looking for this error I saw there were some issues with the 
> newest SLURM and CUDA 10.2 but I didn't think that should be an issue because 
> I was at CUDA 8.0.  Just in case I backed down to SLURM 18.
> 
> I'm willing to start all over if anyone thinks cleaning up and rebuilding 
> will help that. I do see libraries in /etc/lib64/slurm but I also see 2 files 
> in /usr/local/lib/slurm/src so I'm not sure if that's left over from trying 
> to install from source.  All the daemons are in /usr/sbin and user commands 
> in /usr/bin
> 
> I'm a newbie at this and very frustrated. Can anyone help?
> 
>

Re: [slurm-users] Need to calculate total runtime/walltime for one year

2020-04-11 Thread Renfro, Michael

Unless I’m misreading it, you have a wall time limit of 2 days, and jobs that 
use up to 32 CPUs. So a total CPU time of up to 64 CPU-days would be possible 
for a single job.

So if you want total wall time for jobs instead of CPU time, then you’ll want 
to use the Elapsed attribute, not CPUTime.

--
Mike Renfro, PhD  / HPC Systems Administrator, Information Technology Services
931 372-3601  / Tennessee Tech University

On Apr 11, 2020, at 10:05 AM, Sudeep Narayan Banerjee  
wrote:

Hi,

I want to calculate the total walltime or runtime for all jobs submitted by 
each user in a year. I am using the syntax as below and it is also generating 
some output.

We have walltime set for queues (main & main_new) as 48hrs only but the below 
is giving me hours ranging from 15hrs to 56 hours of even more. I am missing 
anything from logical/analytical point of view or the syntax is not correct 
with respect to the desired information ? Many thanks for any suggestion.

[root@hpc ~]# sacct  --format=user,ncpus,state,CPUTime --starttime=04/01/19 
--endtime=03/31/20 |  grep mithunr
  mithunr 32  COMPLETED 15-10:34:40
  mithunr 16  COMPLETED   00:02:56
  mithunr 16  COMPLETED   02:22:40
  mithunr 16  COMPLETED   00:00:48
  mithunr 16  COMPLETED   00:00:32
  mithunr 16 FAILED   00:00:32
  mithunr 16 FAILED   00:00:32
  mithunr 16 FAILED   00:00:48
  mithunr 16 FAILED   00:00:32
  mithunr 16 FAILED   00:00:32
  mithunr  0 CANCELLED+   00:00:00
  mithunr 16 FAILED   00:00:32
  mithunr 32  COMPLETED   00:02:08
  mithunr  0 CANCELLED+   00:00:00
  mithunr 32  COMPLETED   00:01:36
  mithunr 16 FAILED   00:00:48
  mithunr 32  COMPLETED 33-02:58:08
  mithunr 32  COMPLETED 56-01:23:12
...
..
..


--
Thanks & Regards,
Sudeep Narayan Banerjee
System Analyst | Scientist B
Information System Technology Facility
Academic Block 5 | Room 110
Indian Institute of Technology Gandhinagar
Palaj, Gujarat 382355 INDIA

Re: [slurm-users] Job are pending when plenty of resources available

2020-03-30 Thread Renfro, Michael

All of this is subject to scheduler configuration, but: what has job 409978 
requested, in terms of resources and time? It looks like it's the highest 
priority pending job in the interactive partition, and I’d expect the 
interactive partition has a higher priority than the regress partition.

As for job 40, it’s requesting 8 cores and 32 GB of RAM for an infinite 
amount of time, not 1 core and 1 GB of RAM.

*If* job 409978 has requested an large amount of time on the entire cluster, 
*and* you don’t have backfill running, I could see this situation happening.

-- 
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University

> On Mar 29, 2020, at 10:17 PM, Carter, Allan  wrote:
> 
> 
> I’m perplexed. My cluster has been churning along and tonight it has decided 
> to start pending jobs even though there are plenty of nodes available.
>  
> An example job from squeue:
>  
> JOBID PARTITION NAME USER ST   TIME  NODES 
> NODELIST(REASON)
> 409978 interactiverdi amirinen PD   0:00  1 
> (Resources)
> 409989   regress update_r  jenkins PD   0:00  1 (Nodes 
> required for job are DOWN, DRAINED or reserved for jobs in higher priority 
> partitions)
> 409985   regress update_r amirinen PD   0:00  1 (Nodes 
> required for job are DOWN, DRAINED or reserved for jobs in higher priority 
> partitions)
> 409982   regress update_r akshabal PD   0:00  1 (Nodes 
> required for job are DOWN, DRAINED or reserved for jobs in higher priority 
> partitions)
> 409994   regress SYN__tpb kumarbck PD   0:00  1 (Nodes 
> required for job are DOWN, DRAINED or reserved for jobs in higher priority 
> partitions)
> 40 interacti sbatch_w akshabal PD   0:00  1 (Priority)
> 41   regress ICC2__tp  gadikon PD   0:00  1 (Nodes 
> required for job are DOWN, DRAINED or reserved for jobs in higher priority 
> partitions)
> 410005   regress update_r amirinen PD   0:00  1 (Nodes 
> required for job are DOWN, DRAINED or reserved for jobs in higher priority 
> partitions)
> 410003   regress update_r bachchuk PD   0:00  1 (Nodes 
> required for job are DOWN, DRAINED or reserved for jobs in higher priority 
> partitions)
> 410006   regress update_r saurahuj PD   0:00  1 (Nodes 
> required for job are DOWN, DRAINED or reserved for jobs in higher priority 
> partitions)
> 410009   regress xterm_fi  gadikon PD   0:00  1 (Nodes 
> required for job are DOWN, DRAINED or reserved for jobs in higher priority 
> partitions)
> 410010   regress ICC2__tp  gadikon PD   0:00  1 (Nodes 
> required for job are DOWN, DRAINED or reserved for jobs in higher priority 
> partitions)
> 410001   regress ICC2__tp  gadikon PD   0:00  1 
> (Dependency)
> 410002   regress ICC2__tp  gadikon PD   0:00  1 
> (Dependency)
> 410004   regress ICC2__tp  gadikon PD   0:00  1 
> (Dependency)
> 410011   regress ICC2__tp  gadikon PD   0:00  1 
> (Dependency)
> 410014   regress ICC2__tp  gadikon PD   0:00  1 
> (Dependency)
> 410015   regress ICC2__tp  gadikon PD   0:00  1 
> (Dependency)
> 409937 interactiverdi   nsamra  R5:51:10  1 
> c7-c5n-18xl-3
>  
> The output of sinfo shows plenty of nodes available for the scheduler.
>  
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> all  up   infinite  31954  idle~ 
> al2-t3-2xl-[0-999],al2-t3-l-[0-999],c7-c5-24xl-[0-5,7-10,14,16-17,19,21-46,48-151,153-155,157-164,167,169-485,487-999],c7-c5d-24xl-[0,2-999],c7-c5n-18xl-[0-2,4-14,16-26,28-44,46,48-51,53-54,56-63,65-67,69-72,74-75,77-82,84,86-99,101-999],c7-m5-24xl-[0-325,327-999],c7-m5d-24xl-[0-191,193-999],c7-m5dn-24xl-[0-3,5-97,99-999],c7-m5n-24xl-[0-24,26-999],c7-r5d-16xl-[0-3,5-999],c7-r5d-24xl-[1-16,18-999],c7-r5dn-24xl-[0-1,3-999],c7-t3-2xl-[0-8,10-970,973-999],c7-t3-l-[0-999],c7-x1-32xl-[0-6,8-999],c7-x1e-32xl-[0-999],c7-z1d-12xl-[0,2-5,7,9-10,12-999],rh7-c5-24xl-[0-999],rh7-c5d-24xl-[0-999],rh7-c5n-18xl-[0-999],rh7-m5-24xl-[0-999],rh7-m5d-24xl-[0-999],rh7-m5dn-24xl-[0-999],rh7-m5n-24xl-[0-999],rh7-r5d-16xl-[0-999],rh7-r5d-24xl-[0-999],rh7-r5dn-24xl-[0-999],rh7-t3-2xl-[0-999],rh7-t3-l-[0-999],rh7-x1-32xl-[0-999],rh7-x1e-32xl-[0-999],rh7-z1d-12xl-[0-999]
> all  up   infinite  2  drain c7-t3-l-s-0,rh7-t3-l-s-0
> all  up   infinite 46mix 
> c7-c5-24xl-[6,11-13,15,18,20,47,152,156,165-166,168,486],c7-c5d-24xl-1,c7-c5n-18xl-[3,15,27,45,47,52,55,64,68,73,76,83,85,100],c7-m5-24xl-326,c7-m5d-24xl-192,c7-m5dn-24xl-[4,98],c7-m5n-24xl-25,c7-r5d-16xl-4,c7-r5d-24xl-[0,17],c7-r5dn-24xl-2,c7-t3-2xl-[9,971-972],c7-x1-32xl-7,c7-z1d-12xl-[1,6,8,11]
> all  up   infinite  1

Re: [slurm-users] Running an MPI job across two partitions

2020-03-23 Thread Renfro, Michael

Others might have more ideas, but anything I can think of would require a lot 
of manual steps to avoid mutual interference with jobs in the other partitions 
(allocating resources for a dummy job in the other partition, modifying the MPI 
host list to include nodes in the other partition, etc.).

So why not make another partition encompassing both sets of nodes?

> On Mar 23, 2020, at 10:58 AM, CB  wrote:
> 
> Hi Andy,
> 
> Yes, they are on teh same network fabric.
> 
> Sure, creating another partition that encompass all of the nodes of the two 
> or more partitions would solve the problem.
> I am wondering if there are any other ways instead of creating a new 
> partition?
> 
> Thanks,
> Chansup
> 
> 
> On Mon, Mar 23, 2020 at 11:51 AM Riebs, Andy  wrote:
> When you say “distinct compute nodes,” are they at least on the same network 
> fabric?
> 
>  
> 
> If so, the first thing I’d try would be to create a new partition that 
> encompasses all of the nodes of the other two partitions.
> 
>  
> 
> Andy
> 
>  
> 
> From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
> CB
> Sent: Monday, March 23, 2020 11:32 AM
> To: Slurm User Community List 
> Subject: [slurm-users] Running an MPI job across two partitions
> 
>  
> 
> Hi,
> 
>  
> 
> I'm running Slurm 19.05 version.
> 
>  
> 
> Is there any way to launch an MPI job on a group of distributed  nodes from 
> two or more partitions, where each partition has distinct compute nodes?
> 
>  
> 
> I've looked at the heterogeneous job support but it creates two-separate jobs.
> 
>  
> 
> If there is no such capability with the current Slurm, I'd like to hear any 
> recommendations or suggestions.
> 
>  
> 
> Thanks,
> 
> Chansup
>

Re: [slurm-users] Can slurm be configured to only run one job at a time?

2020-03-23 Thread Renfro, Michael

Rather than configure it to only run one job at a time, you can use job 
dependencies to make sure only one job of a particular type at a time. A 
singleton dependency [1, 2] should work for this. From [1]:

  #SBATCH --dependency=singleton --job-name=big-youtube-upload

in any job script would ensure that only one job with that job name should run 
at a time.

[1] https://slurm.schedmd.com/sbatch.html
[2] https://hpc.nih.gov/docs/job_dependencies.html

-- 
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University

> On Mar 23, 2020, at 10:00 AM, Faraz Hussain  wrote:
> 
> External Email Warning
> 
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> 
> 
> 
> I have a five node cluster of raspberry pis. Every hour they all have to 
> upload a local 1 GB file to YouTube. I want it so only one pi can upload at a 
> time so that network doesn't get bogged down.
> 
> Can slurm be configured to only run one job at a time? Or perhaps some other 
> way to accomplish what I want?
> 
> Thanks!
>

Re: [slurm-users] Limit Number of Jobs per User in Queue?

2020-03-18 Thread Renfro, Michael

In addition to Sean’s recommendation, your user might want to use job arrays 
[1]. That’s less stress on the scheduler, and throughput should be equivalent 
to independent jobs.

[1] https://slurm.schedmd.com/job_array.html

--
Mike Renfro, PhD  / HPC Systems Administrator, Information Technology Services
931 372-3601  / Tennessee Tech University

On Mar 18, 2020, at 12:10 PM, Hanby, Mike  wrote:



External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


Howdy,

We are running Slurm 18.08. We have a user who has, twice, submitted over 15 
thousand jobs to the cluster (the queue normally has a couple thousand jobs at 
any given time).

This results in Slurm being unresponsive to user requests / job submits. I 
suspect the scheduler is getting bogged down doing backfill processing.

Is there any way to limit the maximum number of jobs a single user can have in 
the queue at any given time?


Mike Hanby
mhanby @ uab.edu
Systems Analyst III - Enterprise
IT Research Computing Services
The University of Alabama at Birmingham

Re: [slurm-users] Upgrade paths

2020-03-11 Thread Renfro, Michael

The release notes at https://slurm.schedmd.com/archive/slurm-19.05.5/news.html 
indicate you can upgrade from 17.11 or 18.08 to 19.05. I didn’t find equivalent 
release notes for 17.11.7, but upgrades over one major release should work.

> On Mar 11, 2020, at 2:01 PM, Will Dennis  wrote:
> 
> External Email Warning
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> Hi all,
>  
> I have one cluster running v16.05.4 that I would like to upgrade if possible 
> to 19.05.5; it was installed via a .deb package I created back in 2016. I 
> have located a 17.11.7 Ubuntu PPA 
> (https://launchpad.net/~jonathonf/+archive/ubuntu/slurm) and have myself 
> recently put up one for 19.05.5 
> (https://launchpad.net/~wdennis/+archive/ubuntu/dhpc-backports). 
> Theoretically, I believe I should be able to upgrade from the 16.05 release 
> to 17.11, then from 17.11 to 19.05, correct? (going under the assumption that 
> can only go forward at most 2 Slurm releases, which went 16.05 -> 17.02 -> 
> 17.11 -> 18.08 -> 19.05, if I am correct.)
>  
> Thanks,
> Will

Re: [slurm-users] Issue with "hetjob" directive with heterogeneous job submission script

2020-03-05 Thread Renfro, Michael

I’m going to guess the job directive changed between earlier releases and 
20.02. An version of the page from last year [1] has no mention of hetjob, and 
uses packjob instead.

On a related note, is there a canonical location for older versions of Slurm 
documentation? My local man pages are always consistent with the installed 
version, but lots of people Google part of their solution, and are always 
pointed to documentation for the latest stable release.

[1] 
https://web.archive.org/web/20191227221359/https://slurm.schedmd.com/heterogeneous_jobs.html
-- 
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University

> On Mar 4, 2020, at 2:05 PM, CB  wrote:
> 
> External Email Warning
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> Hi,
> 
> I'm running Slurm 19.05.5.
> 
> I've tried to write a job submission script for a heterogeneous job following 
> the example at https://slurm.schedmd.com/heterogeneous_jobs.html
> 
> But it failed with the following error message:
> 
> $ sbatch new.bash
> sbatch: error: Invalid directive found in batch script: hetjob
> 
> Below is the new.bash job script:
> $ cat new.bash
> #!/bin/bash
> #SBATCH --cpus-per-task=4 --mem-per-cpu=16g --ntasks=1
> #SBATCH hetjob
> #SBATCH --cpus-per-task=2 --mem-per-cpu=1g  --ntasks=8
> srun exec_myapp.bash
> 
> Has anyone tried this?
> 
> I've tried the following command at the command line and it worked fine. 
> $ sbatch --cpus-per-task=4 --mem-per-cpu=16g --ntasks=1 : --cpus-per-task=2 
> --mem-per-cpu=1g  --ntasks=8 exec_myapp.bash
> 
> Thanks,
> Chansup
>

Re: [slurm-users] Should there be a different gres.conf for each node?

2020-03-05 Thread Renfro, Michael

We have a shared gres.conf that includes node names, which should have the 
flexibility to specify node-specific settings for GPUs:

=

NodeName=gpunode00[1-4] Name=gpu Type=k80 File=/dev/nvidia0 COREs=0-7
NodeName=gpunode00[1-4] Name=gpu Type=k80 File=/dev/nvidia1 COREs=8-15

=

See the third example configuration at https://slurm.schedmd.com/gres.conf.html 
for a reference.

> On Mar 5, 2020, at 9:24 AM, Durai Arasan  wrote:
> 
> External Email Warning
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> When configuring a slurm cluster you need to have a copy of the configuration 
> file slurm.conf on all nodes. These copies are identical. In the situation 
> where you need to use GPUs in your cluster you have an additional 
> configuration file that you need to have on all nodes. This is the gres.conf. 
> My question is - will this file be different on each node depending on the 
> configuration on that node or will it be identical on all nodes (like 
> slurm.conf?). Assume that the slave nodes have different configurations of 
> gpus in them and are not identical.
> 
> 
> Thank you,
> Durai

Re: [slurm-users] Problem with configuration CPU/GPU partitions

2020-02-28 Thread Renfro, Michael

When I made similar queues, and only wanted my GPU jobs to use up to 8 cores 
per GPU, I set Cores=0-7 and 8-15 for each of the two GPU devices in gres.conf. 
Have you tried reducing those values to Cores=0 and Cores=20?

> On Feb 27, 2020, at 9:51 PM, Pavel Vashchenkov  wrote:
> 
> External Email Warning
> 
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> 
> 
> 
> Hello,
> 
> I have a hybrid cluster with 2 GPUs and 2 20-cores CPUs on each node.
> 
> I created two partitions: - "cpu" for CPU-only jobs which are allowed to
> allocate up to 38 cores per node - "gpu" for GPU-only jobs which are
> allowed to allocate up to 2 GPUs and 2 CPU cores.
> 
> Respective sections in slurm.conf:
> 
> # NODES
> NodeName=node[01-06] Sockets=2 CoresPerSocket=20 ThreadsPerCore=1
> Gres=gpu:2(S:0-1) RealMemory=257433
> 
> # PARTITIONS
> PartitionName=cpu Default=YES Nodes=node[01-06] MaxNodes=6 MinNodes=0
> DefaultTime=04:00:00 MaxTime=14-00:00:00 MaxCPUsPerNode=38
> PartitionName=gpu Nodes=node[01-06] MaxNodes=6 MinNodes=0
> DefaultTime=04:00:00 MaxTime=14-00:00:00 MaxCPUsPerNode=2
> 
> and in gres.conf:
> Name=gpu Type=v100 File=/dev/nvidia0 Cores=0-19
> Name=gpu Type=v100 File=/dev/nvidia1 Cores=20-39
> 
> However, it seems to be not working properly. If I first submit GPU job
> using all available in "gpu" partition resources and then CPU job
> allocating the rest of the CPU cores (i.e. 38 cores per node) in "cpu"
> partition, it works perfectly fine. Both jobs start running. But if I
> change the submission order and start CPU-job before GPU-job,  the "cpu"
> job starts running while the "gpu" job stays in queue with PENDING
> status and RESOURCES reason.
> 
> My first guess was that "cpu" job allocates cores assigned to respective
> GPUs in gres.conf and prevents the GPU devices from running. However, it
> seems not to be the case, because 37 cores job per node instead of 38
> solves the problem.
> 
> Another thought was it has something to do with the specialized cores
> reservation, but I tried to change CoreSpecCount option without success.
> 
> So, any ideas how to fix this behavior and where should look?
> 
> Thanks!
>

Re: [slurm-users] Slurm 17.11 and configuring backfill and oversubscribe to allow concurrent processes

2020-02-27 Thread Renfro, Michael

If that 32 GB is main system RAM, and not GPU RAM, then yes. Since our GPU 
nodes are over-provisioned in terms of both RAM and CPU, we end up using the 
excess resources for non-GPU jobs.

If that 32 GB is GPU RAM, then I have no experience with that, but I suspect 
MPS would be required.

> On Feb 27, 2020, at 11:14 AM, Robert Kudyba  wrote:
> 
> So looking at the new cons_tres option at 
> https://slurm.schedmd.com/SLUG19/GPU_Scheduling_and_Cons_Tres.pdf, would we 
> be able to use, e.g., --mem-per-gpu= Memory per allocated GPU, and it a user 
> allocated --mem-per-gpu=8, and the V100 we have is 32 GB, will subsequent 
> jobs be able to use the remaining 24 GB?

Re: [slurm-users] Using "Nodes" on script - file ????

2020-02-12 Thread Renfro, Michael

Hey, Matthias. I’m having to translate a bit, so if I get a meaning wrong, 
please correct me.

You should be able to set the minimum and maximum number of nodes used for jobs 
on a per-partition basis, or to set a default for all partitions. My most 
commonly used partition has:

  PartitionName=batch MinNodes=1 MaxNodes=40 …

and each job runs on one node by default, without anyone having to specify a 
node count.

If your users are running purely OpenMP jobs, with no MPI at all, there’s no 
reason for them to request more than one node per job, as you probably already 
know, and you could potentially set MaxNodes=1 for one or more partitions. If 
they’re using MPI, they’ll typically need the ability to use more than one node.

You could also use maximum job times, QoS settings, or trackable resource 
(TRES) limits on a per-user, per-account, or per-partition basis to keep users 
from consuming all your resources for an extended period of time.

-- 
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University

> On Feb 12, 2020, at 6:27 AM, Matthias Krawutschke 
>  wrote:
> 
> Hello together,
> I have a special question regarding the variables:
>  
> #SBATCH -nodes = 2 or SRUN -N….
>  
> Some users of the HPC set these WORTH very high and allocate the ComputeNode, 
> although with that they do not require this at all.
>  
> My question is the following now:
> Is it really necessary that this value must be put in the script, 
> command-line or this can be left out?
> In which case is it necessary to limit these? at OpenMPI possible?
>  
> Best regards….
>  
>  
>  
> Matthias Krawutschke, Dipl. Inf.
>  
> Universität Potsdam
> ZIM - Zentrum für Informationstechnologie und Medienmanagement
> Team High-Performance-Computing on Cluster - Environment
>  
> Campus Am Neuen Palais: Am Neuen Palais 10 | 14469 Potsdam
> Tel: +49 331 977-, Fax: +49 331 977-1750
>  
> Internet: https://www.uni-potsdam.de/de/zim/angebote-loesungen/hpc.html

Re: [slurm-users] Limits to partitions for users groups

2020-02-05 Thread Renfro, Michael

If you want to rigidly define which 20 nodes are available to the one group of 
users, you could define a 20-node partition for them, and a 35-node partition 
for the priority group, and restrict access by Unix group membership:

PartitionName=restricted Nodes=node0[01-20] AllowGroups=ALL
PartitionName=priority Nodes=node0[01-35] AllowGroups=prioritygroup

If you don’t care which of the 35 nodes get used by the first group, but want 
to restrict them to using at most 20 nodes of the 35, you could define a single 
partition and a QOS for each group:

PartitionName=restricted Nodes=node0[01-35] AllowGroups=ALL QoS=restricted
PartitionName=priority Nodes=node0[01-35] AllowGroups=prioritygroup QoS=priority

sacctmgr add qos restricted
sacctmgr modify qos restricted set grptres=cpu=N # where N=20*(cores per node)
sacctmgr add qos priority
sacctmgr modify qos restricted set grptres=cpu=-1 # might not be strictly 
required


> On Feb 5, 2020, at 8:07 AM, Рачко Антон Сергеевич  wrote:
> 
> External Email Warning
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> I have partition with 35 nodes. Many users use it, but one group of them have 
> more priority than others. I want to set limit of max. 20 nodes for any users 
> and allow use all nodes for users in priority group.
> I can split this partition to 2: 20-node partition for all and 15-node for 
> priority group. Can I do it otherwise (sacctmg, QOS, etc.)?

Re: [slurm-users] Longer queuing times for larger jobs

2020-01-31 Thread Renfro, Michael

Slurm 19.05 now, though all these settings were in effect on 17.02 until quite 
recently. If I get some detail wrong below, I hope someone will correct me. But 
this is our current working state. We’ve been able to schedule 10-20k jobs per 
month since late 2017, and we successfully scheduled 320k jobs over December 
and January (largely due to one user using some form of automated submission 
for very short jobs).

Basic scheduler setup:

As I’d said previously, we prioritize on fairshare almost exclusively. Most of 
our jobs (molecular dynamics, CFD) end up in a single batch partition, since 
GPU and big-memory jobs have other partitions.

SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0
PriorityWeightFairshare=10
PriorityWeightAge=1000
PriorityWeightPartition=1
PriorityWeightJobSize=1000
PriorityMaxAge=1-0

TRES limits:

We’ve limited users to 1000 CPU-days with: sacctmgr modify user someuser set 
grptresrunmin=cpu=144 — there might be a way of doing this at a higher 
accounting level, but it works as is.

We also force QoS=gpu in each GPU partition’s definition in slurm.conf, and set 
MaxJobsPerUser equal to our total GPU count. That helps prevent users from 
queue-stuffing the GPUs even if they stay well below the 1000 CPU-day TRES 
limit above.

Backfill:

  SchedulerType=sched/backfill
  
SchedulerParameters=bf_window=43200,bf_resolution=2160,bf_max_job_user=80,bf_continue,default_queue_depth=200

Can’t remember where I found the backfill guidance, but:

- bf_window is set to our maximum job length (30 days) and bf_resolution is set 
to 1.5 days. Most of our users’ jobs are well over 1 day.
- We have had users who didn’t use job arrays, and submitted a ton of small 
jobs at once, thus bf_max_job_user gives the scheduler a chance to start up to 
80 jobs per user each cycle. This also prompted us to increase 
default_queue_depth, so the backfill scheduler would examine more jobs each 
cycle.
- bf_continue should let the backfill scheduler continue where it left off if 
it gets interrupted, instead of having to start from scratch each time.

I can guarantee you that our backfilling was sub-par until we tuned these 
parameters (or at least a few users could find a way to submit so many jobs 
that the backfill couldn’t keep up, even when we had idle resources for their 
very short jobs).

> On Jan 31, 2020, at 3:01 PM, David Baker  wrote:
> 
> External Email Warning
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> Hello,
> 
> Thank you for your detailed reply. That’s all very useful. I manage to 
> mistype our cluster size since there are actually 450 standard compute, 40 
> core, compute nodes. What you say is interesting and so it concerns me that 
> things are so bad at the moment,
> 
> I wondered if you could please give me some more details of how you use TRES 
> to throttle user activity. We have applied some limits to throttle users, 
> however perhaps not enough or not well enough. So the details of what you do 
> would be really appreciated, please.
> 
> In addition, we do use backfill, however we rarely see nodes being freed up 
> in the cluster to make way for high priority work which again concerns me. If 
> you could please share your backfill configuration then that would be 
> appreciated, please.
> 
> Finally, which version of Slurm are you running? We are using an early 
> release of v18.
> 
> Best regards,
> David
> 
> From: slurm-users  on behalf of 
> Renfro, Michael 
> Sent: 31 January 2020 17:23:05
> To: Slurm User Community List 
> Subject: Re: [slurm-users] Longer queuing times for larger jobs
>  
> I missed reading what size your cluster was at first, but found it on a 
> second read. Our cluster and typical maximum job size scales about the same 
> way, though (our users’ typical job size is anywhere from a few cores up to 
> 10% of our core count).
> 
> There are several recommendations to separate your priority weights by an 
> order of magnitude or so. Our weights are dominated by fairshare, and we 
> effectively ignore all other factors.
> 
> We also put TRES limits on by default, so that users can’t queue-stuff beyond 
> a certain limit (any jobs totaling under around 1 cluster-day can be in a 
> running or queued state, and anything past that is ignored until their 
> running jobs burn off some of their time). This allows other users’ jobs to 
> have a chance to run if resources are available, even if they were submitted 
> well after the heavy users’ blocked jobs.
> 
> We also make extensive use of the backfill scheduler to run small, short jobs 
> earlier than their queue time might allow, if and only if they don’t delay 
> other jo

Re: [slurm-users] Longer queuing times for larger jobs

2020-01-31 Thread Renfro, Michael

I missed reading what size your cluster was at first, but found it on a second 
read. Our cluster and typical maximum job size scales about the same way, 
though (our users’ typical job size is anywhere from a few cores up to 10% of 
our core count).

There are several recommendations to separate your priority weights by an order 
of magnitude or so. Our weights are dominated by fairshare, and we effectively 
ignore all other factors.

We also put TRES limits on by default, so that users can’t queue-stuff beyond a 
certain limit (any jobs totaling under around 1 cluster-day can be in a running 
or queued state, and anything past that is ignored until their running jobs 
burn off some of their time). This allows other users’ jobs to have a chance to 
run if resources are available, even if they were submitted well after the 
heavy users’ blocked jobs.

We also make extensive use of the backfill scheduler to run small, short jobs 
earlier than their queue time might allow, if and only if they don’t delay 
other jobs. If a particularly large job is about to run, we can see the nodes 
gradually empty out, which opens up lots of capacity for very short jobs.

Overall, our average wait times since September 2017 haven’t exceeded 90 hours 
for any job size, and I’m pretty sure a *lot* of that wait is due to a few 
heavy users submitting large numbers of jobs far beyond the TRES limit. Even 
our jobs of 5-10% cluster size have average start times of 60 hours or less 
(and we've managed under 48 hours for those size jobs for all but 2 months of 
that period), but those larger jobs tend to be run by our lighter users, and 
they get a major improvement to their queue time due to being far below their 
fairshare target.

We’ve been running at >50% capacity since May 2018, and >60% capacity since 
December 2018, and >80% capacity since February 2019. So our wait times aren’t 
due to having a ton of spare capacity for extended periods of time.

Not sure how much of that will help immediately, but it may give you some ideas.

> On Jan 31, 2020, at 10:14 AM, David Baker  wrote:
> 
> External Email Warning
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> Hello,
> 
> Thank you for your reply. in answer to Mike's questions...
> 
> Our serial partition nodes are partially shared by the high memory partition. 
> That is, the partitions overlap partially -- shared nodes move one way or 
> another depending upon demand. Jobs requesting up to and including 20 cores 
> are routed to the serial queue. The serial nodes are shared resources. In 
> other words, jobs from different users can share the nodes. The maximum time 
> for serial jobs is 60 hours. 
> 
> Overtime there hasn't been any particular change in the time that users are 
> requesting. Likewise I'm convinced that the overall job size spread is the 
> same over time. What has changed is the increase in the number of smaller 
> jobs. That is, one node jobs that are exclusive (can't be routed to the 
> serial queue) or that require more then 20 cores, and also jobs requesting up 
> to 10/15 nodes (let's say). The user base has increased dramatically over the 
> last 6 months or so. 
> 
> This over population is leading to the delay in scheduling the larger jobs. 
> Given the size of the cluster we may need to make decisions regarding which 
> types of jobs we allow to "dominate" the system. The larger jobs at the 
> expense of the small fry for example, however that is a difficult decision 
> that means that someone has got to wait longer for results..
> 
> Best regards,
> David
> From: slurm-users  on behalf of 
> Renfro, Michael 
> Sent: 31 January 2020 13:27
> To: Slurm User Community List 
> Subject: Re: [slurm-users] Longer queuing times for larger jobs
>  
> Greetings, fellow general university resource administrator.
> 
> Couple things come to mind from my experience:
> 
> 1) does your serial partition share nodes with the other non-serial 
> partitions?
> 
> 2) what’s your maximum job time allowed, for serial (if the previous answer 
> was “yes”) and non-serial partitions? Are your users submitting particularly 
> longer jobs compared to earlier?
> 
> 3) are you using the backfill scheduler at all?
> 
> --
> Mike Renfro, PhD  / HPC Systems Administrator, Information Technology Services
> 931 372-3601  / Tennessee Tech University
> 
>> On Jan 31, 2020, at 6:23 AM, David Baker  wrote:
>> 
>> Hello,
>> 
>> Our SLURM cluster is relatively small. We have 350 standard compute nodes 
>> each with 40 cores. The largest job that users  can run on the partition is 
>> one requesting 32 nodes. Our cluster is a general university research 
>>

1 2 >

1 - 100 of 149 matches

Mail list logo