Re: [slurm-users] Partition "exclude"

2018-05-21 Thread Brian Andrus
Unless you specify a partition, it should go to the partition defined as default. Do you mean not to run on particular nodes? In that case, you can use the --exclude option: *-x*,*--exclude*= Explicitly exclude certain nodes from the resources granted to the job. Brian Andrus On 5/21

Re: [slurm-users] Getting nodes in a partition

2018-05-18 Thread Brian Andrus
I saw you got some good answers, but a quick note on mpi. For some of them, you are compiling it yourself, they can be "slurm-aware" (eg: openmpi). Then when you do 'mpirun' it automatically knows your inherited hostlist and you need do nothing extra when running. Brian Andrus On

Re: [slurm-users] run bash script in spank plugin

2018-06-04 Thread Brian Andrus
Seems like there are better approaches. In this situation, I would use an epilogue script and give sudo access to the script. Check out https://slurm.schedmd.com/prolog_epilog.html That would likely be much easier and fit into the methodology slurm uses. Brian Andrus Firstspot, Inc. On 6/4

Re: [slurm-users] Is It Possible to change the node order for different partition

2018-06-26 Thread Brian Andrus
for scheduling individually*. The default value is 1./ Brian Andrus On 6/26/2018 3:06 PM, Bill wrote: Hi Everyone, For example, I have two partitions, high,low each has same nodes node[1-10], When we submit job to high partition the nodes order is node1,node2..node10,  when we submit job to low

Re: [slurm-users] Fwd: srun: error: Unable to allocate resources: Invalid partition name specified

2018-07-27 Thread Brian Andrus
You show you still have more that one partition with Default=YES. There should one and only one that is set to YES. That is the one partition that is used if it is not specified. Brian Andrus On 7/27/2018 6:34 AM, valeri...@cbpf.br wrote: Hi Merlin Do you accidentally have more than one

[slurm-users] Memory requirement in percentage

2018-08-11 Thread Brian Andrus
All, Is it possible to submit a job such that the memory limit is a percentage of that on the node? For instance a cluster with nodes in the same partition with varying memory installed. If it lands on a node with more memory, go ahead and use it. Brian Andrus

Re: [slurm-users] Accounting - running with 'wrong' account on cluster

2018-11-06 Thread Brian Andrus
uot; - Guy Fleegman (GalaxyQuest) Brian Andrus On Tue, Nov 6, 2018 at 4:39 PM Christopher Samuel wrote: > On 7/11/18 7:35 am, Brian Andrus wrote: > > > I am able to submit using account=projectB on cluster3. ??? > > Since 'projectB' is a child of account ' DevOps', which is only

Re: [slurm-users] Accounting - running with 'wrong' account on cluster

2018-11-06 Thread Brian Andrus
Ah just scontrol reconfigure doesn't actually make it take effect. Restarting slurmctld did it. On Tue, Nov 6, 2018 at 7:07 PM Christopher Samuel wrote: > On 7/11/18 1:57 pm, Brian Andrus wrote: > > > Ah. I thought I had set that. > > So I did and now it is: > >

Re: [slurm-users] bug 2119 with slurm 18.08.2

2018-11-08 Thread Brian Andrus
We use sssd with realmd enumeration is off. Brian Andrus On 11/8/2018 11:26 AM, Marcin Stolarek wrote: I have very similar issue for quite a time and I was unable to find its root cause. Are you using sssd and AD as a data source with only a subtree of entries searched - this is my case

[slurm-users] bug 2119 with slurm 18.08.2

2018-11-08 Thread Brian Andrus
? This is an issue in a production environment. We don't want to have to restart all the slurmctld daemons anytime there is a change to any associations. That could get painful Brian Andrus

Re: [slurm-users] Accounting - running with 'wrong' account on cluster

2018-11-06 Thread Brian Andrus
to allocate resources: Invalid account or account/partition combination specified* So now I don't seem to be able to run anything... On Tue, Nov 6, 2018 at 7:53 PM Christopher Samuel wrote: > On 7/11/18 2:44 pm, Brian Andrus wrote: > > > Ah just scontrol reconfigure doesn't actually

Re: [slurm-users] bug 2119 with slurm 18.08.2

2018-11-09 Thread Brian Andrus
not ideal. Brian Andrus On 11/8/2018 1:31 PM, Chris Samuel wrote: On Friday, 9 November 2018 5:38:22 AM AEDT Brian Andrus wrote: Where, slurmctld is not picking up new accounts unless it is restarted. This is usually because slurmdbd cannot connect back to the slurmctld on the management

[slurm-users] srun from slurmdb system

2018-12-17 Thread Brian Andrus
slurmctld[54739]: _job_complete: JobId=6 done Is this something that cannot be done from a system that is outside a federated cluster? Brian Andrus

Re: [slurm-users] sacct end time for failed jobs

2019-03-06 Thread Brian Andrus
9 10:34 PM, Chris Samuel wrote: On Tuesday, 5 March 2019 10:07:30 AM PST Brian Andrus wrote: Does anyone have a process they use to handle empty (aka "Unknown") end times for jobs that are not running? What does: sacctmgr list runawayjobs say?

Re: [slurm-users] sacct end time for failed jobs

2019-03-06 Thread Brian Andrus
ed for me. I don't know if this is your problem or not. If you choose this route, be careful and good luck! On 3/6/19 10:15 AM, Brian Andrus wrote: It shows several jobs that all have "Unknown" for end_time. Some are PENDING and some are RUNNING (none are truly in either state).

Re: [slurm-users] sacct end time for failed jobs

2019-03-05 Thread Brian Andrus
time keeps growing. Does anyone have a process they use to handle empty (aka "Unknown") end times for jobs that are not running? Brian Andrus On Wed, Feb 27, 2019 at 10:43 PM Chris Samuel wrote: > On Tuesday, 26 February 2019 10:03:34 AM PST Brian Andrus wrote: > > > On

[slurm-users] sacct end time for failed jobs

2019-02-26 Thread Brian Andrus
It seems to me that END should be filled with the time the job failed, no? Is there a setting or something that can be done to do this? Or a schema so I could update the table(s) myself for any job with a state of "FAILED"? All the Best, Brian Andrus

Re: [slurm-users] Counting total number of cores specified in the sbatch file

2019-06-08 Thread Brian Andrus
If you are using mpi, it should be aware automatically if everything was compiled with support (eg mpirun). If you are looking to just get the total tasks, $SLURM_NTASKS is probably what you are looking for Brian Andrus On 6/8/2019 2:46 AM, Mahmood Naderan wrote: Hi, A genetic program

[slurm-users] Pending (Resources) when nodes are available

2019-06-14 Thread Brian Andrus
is passed to bring up at once? ResumeRate is the default 300. Brian Andrus

[slurm-users] MinJobAge not honored

2019-06-19 Thread Brian Andrus
Using slurm 19.05.0-1 MinJobAge is set to 300 MaxJobCount is set to 1 There are only about 30 jobs running. However, when a job completes, it vanishes immediately from the output of 'squeue' Shouldn't it be staying there for 5 minutes? Brian Andrus

Re: [slurm-users] status of cloud nodes

2019-06-19 Thread Brian Andrus
Can you give the exact command/output you have from this? I suspect a typo in your slurm.conf for nodenames or what you are typing. Brian Andrus On 6/18/2019 11:29 PM, nathan norton wrote: Hi, It just shows "Node $NODE not found" Whereas others all work as expected (ie, they a

[slurm-users] Slurm accounting/slurmdbd way slow

2019-05-22 Thread Brian Andrus
All, So I am experiencing great frustrations with the associations and performance of slurmdbd with a mariadb backend. A simple example is where I have a user with access to 4 partitions each with the same 1200 account codes. I want to retire two of the partitions, but there is no simple

[slurm-users] options for ResumeProgram

2019-05-20 Thread Brian Andrus
All, I know the argument passed to ResumeProgram is the node to be started, but is there any way to access job info from within that script? In particular, the number of nodes and cores actually requested. Brian Andrus

Re: [slurm-users] Forward arrows to stdin through srun?

2019-06-27 Thread Brian Andrus
I think you need a pty instead of just running bash... try: srun --pty bash Or get specific on what resources you need, eg: srun --nodes=1 --exclusive --pty bash Brian Andrus On 6/27/2019 2:11 PM, Micael Carvalho wrote: Hello there, I am having trouble with arrow keys in srun. Example

Re: [slurm-users] Host not being a valid controller

2019-06-28 Thread Brian Andrus
That is because your configuration only lists node0 as the host. You can only have one slurmctld running at a time, so you can either define node1 as a backuphost or not bother trying to start slurmctld on it. Brian Andrus On 6/28/2019 6:31 AM, Pär Lundö wrote: Hi all slurm-experts

Re: [slurm-users] Hide Filesystem From Slurm

2019-07-11 Thread Brian Andrus
I don't think that is possible. At least not easily I just symlink /tmp to /scratch on systems I use. That way folks can get used to /scratch, but if anything has hard-coded /tmp, it will still work. Brian Andrus On 7/11/2019 8:19 AM, Douglas Duckworth wrote: Hello I am wondering

Re: [slurm-users] dual slurmctld and slurmdbd

2019-07-02 Thread Brian Andrus
slurm.conf your BackupController and your AccountingStorageBackupHost slurmctld and slurmdbd will run on each of those respectively. Brian Andrus On 7/2/2019 1:48 PM, Tina Fora wrote: Hi all, We run mysql on a dedicated machine with slurmctld and slurmdbd running on another machine. Now I want to add

Re: [slurm-users] dual slurmctld and slurmdbd

2019-07-03 Thread Brian Andrus
. May not exceed 65533. Brian Andrus On 7/3/2019 2:45 PM, Tina Fora wrote: Thanks Brian Andrus and Chris Samuel. I was able to get it to work on our dev setup as primary/backup. Already had the shared state directory. If I take primary down it takes about two minutes for slurm commands to work

Re: [slurm-users] sbatch tasks stuck in queue when a job is hung

2019-07-08 Thread Brian Andrus
in megabytes (e.g. "2048"). The default value is 1. I would suggest RealMemory=191879 , where I suspect you have RealMemory=196489092 Brian Andrus On 7/8/2019 11:59 AM, Robert Kudyba wrote: I’m new to Slurm and we have a 3 node + head node cluster running Centos 7 and Bright C

Re: [slurm-users] Substituions for "see META file" in slurm.spec file of 15.08.11-1 release

2019-07-08 Thread Brian Andrus
upgrade.. for so many reasons. Brian Andrus On 7/8/2019 12:49 PM, Pariksheet Nanda wrote: Hi SLURM devs, TL;DR: What magic incantations are needed to preprocess the slurm.spec file in SLURM 15? Our cluster is currently running SLURM version 15.08.11.  We are planning some downtime to upgrade

Re: [slurm-users] Installation troubles

2019-07-01 Thread Brian Andrus
for both. I tend to build the rpms in a very simple method: 1) yum install munge munge-devel 2) rpmbuild -ta If there are any special functions you need, ensure you have the -devel packages for them (eg: openmpi-devel) and slurm will detect that and include it in the build. Brian Andrus On 7

Re: [slurm-users] spawning a new terminal for each srun

2019-06-29 Thread Brian Andrus
. Clusters are meant to be something that does all the work for you while you are away (hence the batch concept). You likely want to look at getting your code to run without human interference and send it off to do so. Brian Andrus On 6/29/2019 7:48 AM, Valerio Bellizzomi wrote: On Sat, 2019-06-29

Re: [slurm-users] spawning a new terminal for each srun

2019-06-29 Thread Brian Andrus
trying to do and we may be able to advise the best way to accomplish it. Brian Andrus On 6/29/2019 12:53 AM, Valerio Bellizzomi wrote: How it gets done normally ?

Re: [slurm-users] Job error when using --job-name=`basename $PWD`

2019-07-29 Thread Brian Andrus
Yeah, you can't do that in that fashion. If you want to do that, I'd suggest you put the option in the sbatch command you use to submit the script so: sbatch --job-name=`basename $PWD` /path/to/script.sh Brian Andrus On 7/28/2019 10:51 PM, Verzelloni Fabio wrote: Hi Everyone, I'm

Re: [slurm-users] Trouble installing slurm-19.05.1-2.el7.centos.x86_64

2019-08-16 Thread Brian Andrus
lease... I have installed 18.08.0, .3, .4 and .8 on the same server and nodes since Sep of 2018 using the same procedures and never had any issues... Currently running 18.08.8 Thanks. Lou On Thu, Aug 15, 2019 at 3:07 PM Brian Andrus <mailto:toomuc...@gmail.com>> wrote: Lou,

Re: [slurm-users] Trouble installing slurm-19.05.1-2.el7.centos.x86_64

2019-08-15 Thread Brian Andrus
Lou, Are you installing on the same machine you built? Are the nvidia libraries installed by RPM or a 'make install' on the box you compiled it on? Brian Andrus On 8/15/2019 7:53 AM, Lou Nicotra wrote: I have tried running ldconfig manually as suggested with slurm-19.05.1-2 and it fails

Re: [slurm-users] Dependencies with singleton and after

2019-08-21 Thread Brian Andrus
Have you tried adding the dependency at submit time? sbatch --dependency=singleton fakejob.sh Brian Andrus On 8/21/2019 1:51 PM, Jarno van der Kolk wrote: Hi, I am helping a researcher who encountered an unexpected behaviour with dependencies. He uses both "singleton"

Re: [slurm-users] Dependencies with singleton and after

2019-08-22 Thread Brian Andrus
-H --depend=afterany:${waitjob} fakejob.sh|sed 's/Submitted batch job //')*//* *//*done*/ Of course, if you are actually running the exact same script, I would recommend using arrays as well. Brian Andrus On 8/22/2019 6:23 AM, Jarno van der Kolk wrote: Hi Brian, Thanks for the suggestion

Re: [slurm-users] sbatch tasks stuck in queue when a job is hung

2019-08-30 Thread Brian Andrus
After you restart slurmctld do "scontrol reconfigure" Brian Andrus On 8/30/2019 6:57 AM, Robert Kudyba wrote: I had set RealMemory to a really high number as I mis-interpreted the recommendation. NodeName=node[001-003]  CoresPerSocket=12 RealMemory= 196489092  Sockets=2 Gres=gpu:1

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-27 Thread Brian Andrus
up a proper input file for a script, a single submission is all it takes. Then you can control how many are currently running (MaxArrayTask) and can change that to scale up/down. Brian Andrus On 8/25/2019 11:12 PM, Guillaume Perrault Archambault wrote: Hello, I wrote a regression-testing

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-27 Thread Brian Andrus
Here is where you may want to look into slurmdbd and sacct Then you can create a qos that has MaxJobsPerUser to limit the total number running on a per-user basis: https://slurm.schedmd.com/resource_limits.html Brian Andrus On 8/27/2019 9:38 AM, Guillaume Perrault Archambault wrote: Hi

Re: [slurm-users] slurm node weights

2019-09-05 Thread Brian Andrus
with each installation. Brian Andrus On 9/5/2019 8:48 AM, Douglas Duckworth wrote: Hello We added some newer Epyc nodes, with NVMe scratch, to our cluster and so want jobs to run on these over others.  So we added "Weight=100" /*to the older nodes*/ and left the new ones blank. So indee

Re: [slurm-users] Different Memory Nodes

2019-09-04 Thread Brian Andrus
are in use, you can add weights to the node definitions. This would mean users could request >192GB memory, so it has to go to one of the updated nodes, which will only be taken if the other nodes are used up, or a job needing > 192GB is running on them. Brian Andrus On 9/4/2019 9:53 AM

Re: [slurm-users] SLURM in Virtual Machine

2019-09-12 Thread Brian Andrus
. However, there are definite use cases that make it worthwhile. So long as you allocate enough resources for the node (be it the controller or other) you will be fine. Brian Andrus On 9/12/2019 7:23 AM, Jose A wrote: Dear all, In the expansion of our Cluster we are considering to install SLURM

[slurm-users] MaxRSS not showing up in sacct

2019-09-14 Thread Brian Andrus
Quick question? When I use sacct to show job stats, it always has a blank entry for the MaxRSS field. Is there something that needs enabled to get that in? I do see it if I use sstat while the job is running. Brian Andrus

Re: [slurm-users] MaxRSS not showing up in sacct

2019-09-16 Thread Brian Andrus
is used to collect accounting information. Supported values are > *jobacct_gather/linux* (recommended), *jobacct_gather/cgroup* and > *jobacct_gather/none* (no information collected). > > Antony > > > On Mon, 16 Sep 2019, 14:07 Brian Andrus, wrote: > >> Yep, the ma

Re: [slurm-users] MaxRSS not showing up in sacct

2019-09-15 Thread Brian Andrus
Hmm. We are only using allocations and have slurm.conf configured with: AccountingStorageEnforce=associations,nosteps Are steps required to capture Max RSS? Brian On 9/15/2019 1:48 PM, Mark Hahn wrote: When I use sacct to show job stats, it always has a blank entry for the MaxRSS field. Is

Re: [slurm-users] MaxRSS not showing up in sacct

2019-09-16 Thread Brian Andrus
=18446744073709551614,4=1,5=4 | ++-++ Brian Andrus On Mon, Sep 16, 2019 at 2:58 PM Brian Andrus wrote: > I have > JobAcctGatherType = jobacct_gather/linux > > Brian > > On Mon, Sep 16, 2019 at 12:40 PM Antony Cleave

Re: [slurm-users] MaxRSS not showing up in sacct

2019-09-15 Thread Brian Andrus
The jobs have definitely completed when I try to gather the info. Brian On 9/15/2019 4:01 PM, Steven Dick wrote: I don't think it shows up until the job completes. On Sat, Sep 14, 2019 at 2:25 AM Brian Andrus wrote: Quick question? When I use sacct to show job stats, it always has a blank

Re: [slurm-users] MaxRSS not showing up in sacct

2019-09-16 Thread Brian Andrus
, Christopher Samuel wrote: On 9/15/19 4:17 PM, Brian Andrus wrote: Are steps required to capture Max RSS? No, you should see a MaxRSS reported for the batch step, for instance: $ sacct -j $JOBID -o jobid,jobname,maxrss All the best, Chris

Re: [slurm-users] Unexpected MPI process distribution with the --exclusive flag

2019-07-30 Thread Brian Andrus
s includes PPR, where the pattern would be terminated by another colon to separate it from the modifiers. so adding "--map-by node" would give you what you are looking for. Of course, this syntax is for Openmpi's mpirun command, so YMMV Brian Andrus On 7/30/2019 5:14 AM, CB

[slurm-users] Errors after removing partition

2019-07-26 Thread Brian Andrus
]) for JobId=52545 I suspect this is in the saved state directory and if I were to down the entire cluster and delete those files up, it would clear it up, but I prefer to not have to down the cluster... Is there a way to clean up "phantom" nodes and partitions that were deleted? Brian Andrus

Re: [slurm-users] Errors after removing partition

2019-07-27 Thread Brian Andrus
The jobs themselves no longer exist. They had completed before I deleted the partition, which is odd to me. I may have did 'reconfigure' before restarting slurmctld, it was awhile ago, so I don't recall. Brian Andrus On 7/26/2019 8:10 PM, Chris Samuel wrote: On 26/7/19 8:28 am, Jeffrey

Re: [slurm-users] sacct command to show time for node to start

2019-09-21 Thread Brian Andrus
Lyn, That was it, thanks! sacct -o reserved Brian On 9/21/2019 9:26 AM, Lyn Gerner wrote: Hey Brian, I think the discussion was in the context of suspend/resume, and it was the Reserved value that effectively represents that time. Regards, Lyn On Sat, Sep 21, 2019 at 9:15 AM Brian Andrus

[slurm-users] sacct command to show time for node to start

2019-09-21 Thread Brian Andrus
There was a command shared at the SLUG that showed how long it took a node to go from a power_down (idle~) state to up and having a job running on it, but I cannot remember what it was. Does anyone recall that? Brian Andrus

Re: [slurm-users] Store sstat information permanently on job completion?

2019-10-30 Thread Brian Andrus
Except sstat can give you the MaxRSS without having cgroups and it will give you a simple MaxRSS, whereas sacct provides a MaxRSS for every step... have to play with that data to get the high water mark grrr. I had tried to use sstat in an epilogue but apparently that is too late... Brian

Re: [slurm-users] RHEL8 support

2019-10-30 Thread Brian Andrus
ckages except pmix-devel. Haven't figured that one yet. Brian Andrus On 10/30/2019 11:18 AM, Christopher Benjamin Coffey wrote: Yes, I'd be interested too. Best, Chris

Re: [slurm-users] How to use a pyhon virtualenv with srun?

2019-11-17 Thread Brian Andrus
t actually sharing homes could be the cause. Brian Andrus On 11/17/2019 11:24 AM, Yann Bouteiller wrote: Hello, I am trying to do this on computecanada, which is managed by slurm: https://ray.readthedocs.io/en/latest/deploying-on-slurm.html However, on computecanada, you cannot inst

[slurm-users] nss_slurm not passing groups

2019-11-22 Thread Brian Andrus
, I get back 41 groups I am in. Bug? Brian Andrus

[slurm-users] Timeout and Epilogue

2019-12-04 Thread Brian Andrus
Quick question: Is the epilogue script run if a job exceeds its time limits and is being canceled? What about just cancelled? I need to be able to clean up some job-specific files regardless of how the job ends and I'm not sure epilogue is sufficient. Brian Andrus

Re: [slurm-users] Slurm 19-05-4-1 and Centos8

2019-12-08 Thread Brian Andrus
s have had the same issue and even add to comments in the bugs, but no responses/resolution for this have been posted. FWIW, I also see the issue with the latest slurm 20.05 pre1 code. Brian Andrus On 12/5/2019 11:46 PM, von St. Vieth, Benedikt wrote: Hi again, I answered this question on Oct 2

Re: [slurm-users] Slurm 19-05-4-1 and Centos8

2019-12-05 Thread Brian Andrus
Tim claims it works... I have compiled it, but when you try to run slurmd, it throws some errors and will not start. From a previous thread: While I can successfully build/run slurmctld, slurmd is failing because ALL of the SelectType libraries are missing symbols. Example from

Re: [slurm-users] Timeout and Epilogue

2019-12-09 Thread Brian Andrus
crickets.  I think in our case we were not able to ensure that the epilog always ran for different types of job failures, so we just had the users add some more cleanup code to the end of their jobs _and_ also run separate cleanup jobs. Regards, Alex On Wed, Dec 4, 2019 at 7:29 PM Brian Andrus

Re: [slurm-users] Partition question

2019-12-16 Thread Brian Andrus
depends on what best suits the specific needs. Brian Andrus On 12/16/2019 2:29 PM, Ransom, Geoffrey M. wrote: Hello    I am looking into switching from Univa (sge) to slurm and am figuring out how to implement some of our usage policy in slurm. We have a Univa queue which uses job classes

[slurm-users] cleanup script after timeout

2019-12-11 Thread Brian Andrus
a cleanup script run on jobs that have timed out? Brian Andrus

Re: [slurm-users] cleanup script after timeout

2019-12-11 Thread Brian Andrus
You prompted me to dig even deeper into my epilog. I was trying to access a semaphore file in the user's home directory. It seems that when the epilogue is run the ~ is not expanded in anyway. So I can't even use ~${SLURM_JOB_USER} to access their semaphore file. Potentially problematic for

Re: [slurm-users] RHEL8 support

2019-10-28 Thread Brian Andrus
-1.el8.x86_64.rpm slurm-slurmdbd-19.05.3-1.el8.x86_64.rpm slurm-torque-19.05.3-1.el8.x86_64.rpm Brian Andrus On 10/28/2019 2:32 AM, Benjamin Redling wrote: On 28/10/2019 08.26, Bjørn-Helge Mevik wrote: Taras Shapovalov writes: Do I understand correctly that Slurm19 is not compatible

[slurm-users] Sacct selecting jobs outside range

2019-10-16 Thread Brian Andrus
:34 2019-10-01T00:00:44 00:00:10 Brian Andrus

Re: [slurm-users] Execute scripts on suspend and cancel

2019-10-15 Thread Brian Andrus
handling until they have it as part of their app. Brian Andrus On 10/14/2019 4:40 AM, Oytun Peksel wrote: It is quite weird if slurm has no mechanism as described. I have been digging more into it and someone suggested a workaround using mail notifications. You use a script instead of the mail

Re: [slurm-users] Execute scripts on suspend and cancel

2019-10-16 Thread Brian Andrus
tun Peksel* oytun.pek...@semcon.com <mailto:oytun.pek...@semcon.com> +46739205917 *From:*slurm-users *On Behalf Of *Brian Andrus *Sent:* den 15 oktober 2019 20:58 *To:* slurm-users@lists.schedmd.com *Subject:* Re: [slurm-users] Execute scripts on su

Re: [slurm-users] jobacct_gather/linux vs jobacct_gather/cgroup

2019-10-24 Thread Brian Andrus
IIRC, the big difference is if you want to use cgroups on the nodes. You must use the cgroup plugin. Brian Andrus On 10/24/2019 3:54 PM, Christopher Benjamin Coffey wrote: Hi Juergen, From what I see so far, there is nothing missing from the jobacct_gather/linux plugin vs the cgroup

Re: [slurm-users] RHEL8 support - Missing Symbols in SelectType libraries

2019-10-29 Thread Brian Andrus
I prefer building packages. I did have to extract and change the .spec file to accommodate some of the changes as well as set up the environment to complete. Brian On 10/29/2019 8:11 AM, Christopher Benjamin Coffey wrote: Brian, I've actually just started attempting to build slurm 19 on

Re: [slurm-users] RHEL8 support - Missing Symbols in SelectType libraries

2019-10-28 Thread Brian Andrus
/libslurmfull.so|grep powercap_*//* *//*0010f7b8 T slurm_free_powercap_info_msg*//* *//*00060060 T slurm_print_powercap_info_msg*/ So, sure enough powercap_get_cluster_current_cap is not in there. Methinks the linking needs examined. Brian Andrus On 10/28/2019 2:32 AM, Benjamin Redling

Re: [slurm-users] How to create a partition where only one job can run concurrently?

2019-10-18 Thread Brian Andrus
. Brian Andrus On 10/18/2019 1:03 PM, bbenede...@goodyear.com wrote: Greetings! I am trying to set up a partition that will only allow one job at a time to run, regardless of who submits it. So multiple jobs from multiple users can be in the queue. But I only want the partition to run one

Re: [slurm-users] Environment modules

2019-11-24 Thread Brian Andrus
/openmpi), which forces only one version to be able to be loaded. I also set paths so specific versions of libraries become available depending on what environment you select (gcc vs intel for example). Is there something besides versioning that lmod shines at? Brian Andrus On 11/24/2019 12:48 AM

Re: [slurm-users] Filter slurm e-mail notification

2019-11-26 Thread Brian Andrus
server you use. The best solution, of course, is to educate the users. You could create a job_submit plugin that removes mail options for arrays, but you may negatively impact users that do need that. Brian Andrus On 11/25/2019 10:55 PM, ichebo...@univ.haifa.ac.il wrote: I meant on the admin

Re: [slurm-users] Filter slurm e-mail notification

2019-11-25 Thread Brian Andrus
FAIL apply to a job array as a whole rather than generating individual email messages for each task in the job array./ Brian Andrus On 11/25/2019 1:48 AM, ichebo...@univ.haifa.ac.il wrote: Hi, I would like to ask if there is some options to configure the e-mail notification of slurm job

Re: [slurm-users] job priority keeping resources from being used?

2019-11-01 Thread Brian Andrus
Are you specifying memory for each of the jobs? Can't run a small job if there isn't enough memory available for it. Brian Andrus On 11/1/2019 7:42 AM, c b wrote: I have: SelectType=select/cons_res SelectTypeParameters=CR_CPU_Memory On Fri, Nov 1, 2019 at 10:39 AM Mark Hahn <mailt

Re: [slurm-users] job priority keeping resources from being used?

2019-11-01 Thread Brian Andrus
Brian Andrus <mailto:toomuc...@gmail.com>> wrote: Are you specifying memory for each of the jobs? Can't run a small job if there isn't enough memory available for it. Brian Andrus On 11/1/2019 7:42 AM, c b wrote: I have: SelectType=select/cons_res SelectTypeP

Re: [slurm-users] Limiting the number of CPU

2019-11-11 Thread Brian Andrus
You are trying to specifically run on node cn110, so you may want to check that out with sinfo A quick "sinfo -R" can list any down machines and the reasons. Brian Andrus On 11/10/2019 11:23 PM, Sukman wrote: Hi Brian, I see. Thank you for your suggestion. I definitely will try i

[slurm-users] ResumeProgram not running

2019-10-10 Thread Brian Andrus
that are idle~ but no calls to the script. If I restart slurmctld, the backlog starts running and things work. Any ideas what could cause this? Brian Andrus

[slurm-users] nss_slurm and sudo

2019-12-09 Thread Brian Andrus
So it seems nss_slurm does not play well with sudo. If I connect to a box that uses it and try to use sudo, I get: *sudo: PAM account management error: Authentication service cannot retrieve authentication info* Has anyone else seen this? Is there a workaround? Brian Andrus

Re: [slurm-users] Slurm version 20.02.0 is now available

2020-02-25 Thread Brian Andrus
Bright is not needed... for much of anything... On 2/25/2020 12:48 PM, Robert Kudyba wrote: I suppose I can ask Bright Computing but does anyone know what version of Bright is needed? I would guess 8.2 or 9.0. Definitely want to dive into this.

[slurm-users] Hybrid compiling options

2020-02-28 Thread Brian Andrus
on that are. Brian Andrus

Re: [slurm-users] Setup for backup slurmctld

2020-02-26 Thread Brian Andrus
I would say so. Certainly, if you have many nodes and/or many jobs being submitted, you will see an impact, but in my experience comparing Slurm to SGE, Slurm has much less overhead to cause as much impact. Brian Andrus On 2/26/2020 1:05 PM, Joshua Baker-LePain wrote: On Wed, 26 Feb 2020

Re: [slurm-users] Setup for backup slurmctld

2020-02-26 Thread Brian Andrus
easy to do. Just add the lines to your slurm.conf for the backup controller, start it up and reconfigure for all running nodes to be aware of it. Brian Andrus On 2/26/2020 12:48 PM, Joshua Baker-LePain wrote: We're planning the migration of our moderately sized cluster (~400 nodes, 40K jobs

Re: [slurm-users] problem running slurm

2020-02-07 Thread Brian Andrus
Your trying to run bash which, without special configuration, needs a pty Try srun -v -p debug --pty bash Brian Andrus On 2/6/2020 10:28 PM, Hector Yuen wrote: Hello, I am setting up a very simple configuration: one node running slurmd and another one running slurmctld. In the slurmctld

Re: [slurm-users] Node appears to have a different slurm.conf than the slurmctld; update_node: node reason set to: Kill task failed

2020-02-10 Thread Brian Andrus
Usually means you updated the slurm.conf but have not done "scontrol reconfigure" yet. Brian Andrus On 2/10/2020 8:55 AM, Robert Kudyba wrote: We are using Bright Cluster 8.1 with and just upgraded to slurm-17.11.12. We're getting the below errors when I restart the slurmct

Re: [slurm-users] Node node00x has low real_memory size & slurm_rpc_node_registration node=node003: Invalid argument

2020-01-20 Thread Brian Andrus
ster generically, so their configs are not getting matched to the specific info in your main config Brian Andrus On 1/20/2020 10:37 AM, Robert Kudyba wrote: I've posted about this previously here <https://groups.google.com/forum/#!searchin/slurm-users/kudyba%7Csort:date/slurm-users/mMECjerUmFE/V

Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Brian Andrus
Check the slurmd log file on the node. Ensure slurmd is still running. Sounds possible that OOM Killer or such may be killing slurmd Brian Andrus On 1/20/2020 1:12 PM, Dean Schulze wrote: If I restart slurmd the asterisk goes away.  Then I can run the job once and the asterisk is back

Re: [slurm-users] slurm elastic compute / power saving

2020-01-07 Thread Brian Andrus
I think we would need to see your SuspendScript to get a better idea of what is happening. That error indicates the nodes are likely not running slurmd and the control daemon things they are still up. What is the output of 'sinfo -R'? Brian Andrus On 1/7/2020 3:42 AM, Steve Brasier wrote

Re: [slurm-users] srun --reboot option is not working

2020-03-10 Thread Brian Andrus
. It could probably be worked around, but not in a simple way. Easier to upgrade to the newest release :) Brian Andrus On 3/9/2020 10:14 AM, MrBr @ GMail wrote: Hi Brian The nodes work with slurm without any issues till I try the "--reboot" option. I can successfully allocate the no

Re: [slurm-users] Munge decode failing on new node

2020-04-19 Thread Brian Andrus
the next uid on any node. The error below looks like you may have a different uid for the slurm user on the node. What uid is slurmd running as on the bad node vs a good node? Brian Andrus On 4/17/2020 2:38 PM, Dean Schulze wrote: Just noticed this.  On the problem node the munged.log file

Re: [slurm-users] Alternative to munge for use with slurm?

2020-04-20 Thread Brian Andrus
For CentOS/RHEL, it is in the OpenFusion repo: http://repo.openfusion.net/centos7-x86_64/ just     yum install http://repo.openfusion.net/centos7-x86_64/openfusion-release-0.7-1.of.el7.noarch.rpm then     yum install libjwt-devel Brian Andrus On 4/18/2020 2:27 PM, Daniel Letai wrote

Re: [slurm-users] Reset TMPDIR for All Jobs

2020-05-12 Thread Brian Andrus
Maybe too obvious, but have you checked your .bashrc, .bash_profile and such? Brian Andrus On 5/12/2020 10:27 AM, Ellestad, Erik wrote: Which SLURM prolog specifically? I’m not finding that to work for me in either task-prolog or prolog. SLURM_TMPDIR and TMPDIR are still both set to /tmp

Re: [slurm-users] srun --reboot option is not working

2020-03-09 Thread Brian Andrus
' from the node and verify it is able to talk to slurmctld from the node and verify slurmd started successfully. Brian Andrus On 3/9/2020 4:38 AM, MrBr @ GMail wrote: Hi all I'm trying to use the --reboot option of srun to reboot the nodes before allocation. However the nodes not been

Re: [slurm-users] srun --reboot option is not working

2020-03-09 Thread Brian Andrus
normal users cannot use "--reboot" Brian Andrus On 3/9/2020 10:14 AM, MrBr @ GMail wrote: Hi Brian The nodes work with slurm without any issues till I try the "--reboot" option. I can successfully allocate the nodes or any other slurm related operation > You may want to dou

Re: [slurm-users] Slurmctld and log file

2020-09-08 Thread Brian Andrus
both. I do high debug to the journal and info to the log file. Brian Andrus On 9/8/2020 2:41 AM, Gestió Servidors wrote: Hello, I don’t know why, but my SLURM server (that is running fine) has its slurmdctl.log file with size 0 bytes... so... where is writting logs? It seems that log file has

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Brian Andrus
do you have your gres.conf on the nodes also? Brian Andrus On 10/8/2020 11:57 AM, Sajesh Singh wrote: Slurm 18.08 CentOS 7.7.1908 I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and gres.conf of the cluster, but if I launch a job requesting GPUs the environment

Re: [slurm-users] Limit nodes of a partition without managing users

2020-08-18 Thread Brian Andrus
they will wait a relatively shorter amount of time. There are numerous other factors you can use. If you have accounting and associations configured, you can manipulate it all the way to the association and qos. Brian Andrus On 8/17/2020 11:23 PM, Gerhard Strangar wrote: Brian Andrus wrote: Most likely, b

  1   2   3   4   >