Re: [slurm-users] SLURM PAM support?

2018-06-18 Thread Yair Yarom
Hi, We encountered this issue some time ago (see: https://www.mail-archive.com/slurm-dev@schedmd.com/msg06628.html). You need to add pam_systemd to the slurm pam file, but pam_systemd will try to take over the slurm's cgroups. Our current solution is to add pam_systemd to the slurm pam file, but i

Re: [slurm-users] Slurmd could not set up environment for batch job

2018-06-20 Thread Yair Yarom
Hi, I haven't encountered this specific error (it probably means some permissions issues somewhere), but my first try will be to look at the slurmd log file. You can also update the SlurmdDebug (and maybe SlurmdLogFile) config option to get more information. Yair. On Tue, Jun 19, 2018 at 6:1

Re: [slurm-users] Jobs blocking scheduling progress

2018-07-04 Thread Yair Yarom
Hi, As Paul mentioned, we once encountered a starvation issue with the backfill algorithm and since set up the bf_window to match the maximum running time of all the partitions. This could be the case here. Also make sure that indeed the jobs can run on the non-gpu nodes (we constantly encounter

Re: [slurm-users] Transparently assign different walltime limit to a group of nodes ?

2018-08-13 Thread Yair Yarom
Hi, We have a short partition to give a reasonable waiting time for shorter jobs. We use the job_submit/all_partitions plugin so if a user doesn't specify a partition, it will add all the partitions. The downside of the plugin is that if a job is too long for the short partition (or the job can't

Re: [slurm-users] Setting up a separate timeout for interactive jobs

2018-09-20 Thread Yair Yarom
Hi, We also have multiple partitions, but in addition we use a job submit plugin to distinguish between srun/salloc and sbatch submissions. This plugin forces a specific partition for interactive jobs (and the timelimit with it) and using the license system it limits the number of simultaneous int

Re: [slurm-users] Accounting: set default account with no access

2018-11-06 Thread Yair Yarom
Hi, You can set the maxsubmitjob=0 on that default account. That should prevent anyone from using it, but it won't have a specific message like with the lua plugin. E.g. sacctmgr update account default set maxsubmitjob=0 Regards, Yair. On Tue, Nov 6, 2018 at 12:58 AM Renfro, Michael wrote:

Re: [slurm-users] strigger on CG, completing state

2019-05-29 Thread Yair Yarom
r nodes stuck in CG (completing) state? Some user jobs, mostly > Julia notebook can get hung in completing state is the user kills the > running job or cancels it with cntrl. When this happens we can have many > many nodes stuck in CG. Slurm 17.02.6. Thanks! > > > > -

Re: [slurm-users] Environment modules

2019-11-24 Thread Yair Yarom
software then it can take care >> of the modules too for you. I'd also echo the recommendation from >> others to use Lmod. >> >> Website: https://easybuilders.github.io/easybuild/ >> Documentation: https://easybuild.readthedocs.io/ >> >> All the best,

Re: [slurm-users] good practices

2019-11-25 Thread Yair Yarom
Hi, I'm not sure what queue time limit of 10 hours is. If you can't have jobs waiting for more than 10 hours, than it seems to be very small for 8 hours jobs. Generally, a few options: a. The --dependency option (either afterok or singleton) b. The --array option of sbatch with limit of 1 job at a

Re: [slurm-users] Problem with squeue reporting of GPUs in use

2020-02-25 Thread Yair Yarom
Hi, I've also encountered this issue of the deprecated %b. I'm currently parsing the output of "scontrol show jobs -dd" to see what was requested (and which exact GPUs were allocated). Hope this helps, Yair. On Mon, Feb 24, 2020 at 11:56 PM Venable, Richard (NIH/NHLBI) [E] < venab...@nhlbi.n

Re: [slurm-users] Hybrid compiling options

2020-03-01 Thread Yair Yarom
Hi, We also have hybrid cluster(s). We use the same nfsroot for all nodes, so technically everything is installed everywhere. And we compile slurm once with everything needed. Users can run "module load cuda" and/or "module load nvidia" to have access to nvcc and nvidia's libraries (cuda and nvid

Re: [slurm-users] Slurm Perl API use and examples

2020-03-24 Thread Yair Yarom
I also haven't got along with the Perl API shipped with slurm. I got it to work, but there were things missing. Currently I have some wrapper functions for most of slurm commands, and a general parsing function to slurm's common outputs (of scontrol, sacctmgr, etc.). Not in CPAN, but you can see it

[slurm-users] Job with srun is still RUNNING after node reboot

2020-03-31 Thread Yair Yarom
Hi, We have an issue where running srun (with --pty zsh), and rebooting the node (from a different shell), the srun reports: srun: error: eio_message_socket_accept: slurm_receive_msg[an.ip.addr.ess]: Zero Bytes were transmitted or received and hangs. After the node boots, the slurm claims that jo

Re: [slurm-users] Job with srun is still RUNNING after node reboot

2020-04-01 Thread Yair Yarom
I've checked it now, it isn't listed as a runaway job. On Tue, Mar 31, 2020 at 5:24 PM David Rhey wrote: > Hi, Yair, > > Out of curiosity have you checked to see if this is a runaway job? > > David > > On Tue, Mar 31, 2020 at 7:49 AM Yair Yarom wrote: > &

Re: [slurm-users] [External] How to detect Job submission by srun / interactive jobs

2020-05-19 Thread Yair Yarom
ecific partitions. > > Any recommendation or best practice on how to handle interactive jobs is > welcome. > > Thank you, > Stephan > > -- /| | \/ | Yair Yarom | Senior DevOps Architect [] | The Rachel and Selim Benin School [] /\| of Computer Science and Engineering []//\\/ | The Hebrew University of Jerusalem [// \\ | T +972-2-5494522 | F +972-2-5494522 //\ | ir...@cs.huji.ac.il //|

[slurm-users] Backfill fails to start jobs (when preemptable QOS is involved)

2020-11-15 Thread Yair Yarom
Hi list, We have GrpTRES limits on all accounts which causes a lot of higher priority jobs to stay in the queue due to limits. As such we rely heavily on the backfill scheduler. We also have a special lower priority preemptable QOS with no limits. We've noticed that when the cluster is loaded, se

[slurm-users] NoDecay on accounts (or on GrpTRESMins in general)

2020-11-17 Thread Yair Yarom
Hi all, We have around 50 accounts, each has its own GrpTRES limits. We want to add another set of accounts (probably another 50) with different priority which will have GrpTRESMins, such that users could "buy" TRES*minutes with higher priority. For that we require that the GrpTRESMins won't get

Re: [slurm-users] NoDecay on accounts (or on GrpTRESMins in general)

2020-11-23 Thread Yair Yarom
On Fri, Nov 20, 2020 at 12:11 AM Sebastian T Smith wrote: > Hi, > > We're setting GrpTRESMins on the account association and have NoDecay > QOS's for different user classes. All user associations with a > GrpTRESMins-limited account are assigned a NoDecay QOS. I'm not sure if > it's a better ap

Re: [slurm-users] how do slurm schedule health check when setting "HealthCheckNodeState=CYCLE"

2020-12-02 Thread Yair Yarom
Hi, We also noticed this. We eventually placed the max time on the HealthCheckInterval (65535), and created a systemd.timer which runs the scripts externally of slurm, with proper intervals and randomized delays. Yair. On Wed, Dec 2, 2020 at 9:03 AM wrote: > Hello, > > > > Our slurm cluste

Re: [slurm-users] Using "Environment Modules" in a SLURM script

2021-01-24 Thread Yair Yarom
several examples I have read about SLURM scripts, nobody comments > > that. So, have I forgotten a parameter in SLURM to "capture" > > environment variables into the script or is it a problem due to my > > distribution (CentOS-7)??? > > > > Thanks. > > > -

Re: [slurm-users] Job flexibility with cons_tres

2021-02-09 Thread Yair Yarom
-around? > > Thanks, > > A. > > -- > Ansgar Esztermann > Sysadmin Dep. Theoretical and Computational Biophysics > http://www.mpibpc.mpg.de/grubmueller/esztermann > -- /| | \/ | Yair Yarom | System Group (DevOps) [] | The Rachel and Sel

Re: [slurm-users] slurmd running on IBM Power9 systems

2021-06-27 Thread Yair Yarom
l available frequencies not scanned > > > Any idea how I can fix this problem? > > Regards, > Karl > > -- /| | \/ | Yair Yarom | System Group (DevOps) [] | The Rachel and Selim Benin School [] /\| of Computer Science and Engineering []//\\

[slurm-users] Long term archiving

2021-06-28 Thread Yair Yarom
don't load the archives into a secondary db. We now have a use-case which might require us to save job information for more than that, and we're considering how to do that. Thanks in advance, -- /| | \/ | Yair Yarom | System Group (DevOps) [] | The Rachel an

Re: [slurm-users] Long term archiving

2021-06-29 Thread Yair Yarom
. The archive data itself is available for > reimport and historical investigation. We've done this when importing > historical data into XDMod. > > -Paul Edmon- > On 6/28/2021 10:43 AM, Yair Yarom wrote: > > Hi list, > > I was wondering if you could share your

Re: [slurm-users] Slurm Multi-cluster implementation

2021-10-31 Thread Yair Yarom
Hi , >> > > >> > > I am looking for a stepwise guide to setup multi cluster >> > implementation. >> > > We wanted to set up 3 clusters and one Login Node to run the job >> > using >> > > -M cluster option.

Re: [slurm-users] Slurm Multi-cluster implementation

2021-11-01 Thread Yair Yarom
2021 at 6:36 PM Brian Andrus wrote: > That is interesting to me. > > How do you use ulimit and systemd to limit user usage on the login nodes? > This sounds like something very useful. > > Brian Andrus > On 10/31/2021 1:08 AM, Yair Yarom wrote: > > Hi, > > If it hel

Re: [slurm-users] WTERMSIG 15

2021-11-29 Thread Yair Yarom
t; 4711 xterm43Gn 2021-11-29T14:47:09 > FAILED 00:20:08 > > 4711.batchbatch43Gn 2021-11-29T14:47:09 > CANCELLED 00:20:08 37208K > > 4711.extern extern43Gn 2021-11-29T14:47:09 > COMPLETED 00:2

Re: [slurm-users] WTERMSIG 15

2021-12-01 Thread Yair Yarom
md/system/slurmd.service > > KillMode=process > > > > Instead of (for ubuntu nodes) > > KillMode=control-group > > > > *De :* slurm-users *De la part de* > Yair Yarom > *Envoyé :* mardi 30 novembre 2021 08:50 > *À :* Slurm User Community List > *Obj

Re: [slurm-users] Kernel keyrings on Slurm node inside Slurm job

2022-08-24 Thread Yair Yarom
ingMode=" in systemd.exec man page, but that didn't really help me. > > Can you explain to me how it would be possible to get "private" keyrings > inside a Slurm job on the executing node? > > thx > Matthias > > -- /| | \/ | Yair Yaro

Re: [slurm-users] Kernel keyrings on Slurm node inside Slurm job

2022-08-25 Thread Yair Yarom
der deprecated > > functionality". > > > > https://bugs.schedmd.com/show_bug.cgi?id=4098 > > Warning: Do NOT configure UsePAM=1 in slurm.conf (this advice can be found > on the net). See > > https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-prologflags &g

Re: [slurm-users] gres.conf and select/cons_res plugin

2022-09-14 Thread Yair Yarom
; > > On my current (inherited) Slurm cluster we have: > > > >SelectType=select/cons_res > > > > but users are primarily using GPU resources, so I know Gres is working. > > Why then is select/cons_tres required? > > -- /| | \/ | Ya

Re: [slurm-users] NVIDIA MIG question

2022-11-16 Thread Yair Yarom
which > would mean 14 different jobs could run at the same time using their own > particular MIG, right? > > Do those questions make sense to anyone? 🙂 > > Rob > > > -- /| | \/ | Yair Yarom | System Group (DevOps) [] | The Rachel and

Re: [slurm-users] NVIDIA MIG question

2022-11-17 Thread Yair Yarom
0 --account=1gc5gb > --partition=sla-prio > salloc: Job allocation 5015 has been revoked. > salloc: error: Job submit/allocate failed: Requested node configuration is > not available > > > Rob > > -- > *From:* slurm-users on behalf of > Ya

Re: [slurm-users] Regarding Multi-Cluster Accounting Information

2023-03-15 Thread Yair Yarom
>> it should be different like: >> sacctmgr add user user1 account=alpha_grp cluster=Alpha >> sacctmgr add user1 account=beta_grp cluster=Beta >> >> Please let me know in case of any additional information. >> >> Regards, >> Shaghuf Rahman >>

Re: [slurm-users] Regarding Multi-Cluster Accounting Information

2023-03-16 Thread Yair Yarom
rt is quite handy. On Wed, 15 Mar 2023 at 14:55, Shaghuf Rahman wrote: > Hi Yair, > > Thank you for clarification. > > Could you please tell me which way is better for accounting related > reports. > > Thanks & Regards, > Shaghuf > On Wed, 15 Mar 2023 at 15:0

Re: [slurm-users] Mixing GPU Types on Same Node

2023-04-02 Thread Yair Yarom
-powerful >if people don’t care)? > > > > Any help and/or advice here is much appreciated. Slurm has been amazing > for our lab (albeit challenging to setup at first) and I want to get > everything dialed before I graduate :D . > > > > Thanks, > > -Co

[slurm-users] lmod and slurm

2017-12-19 Thread Yair Yarom
Hi list, We use here lmod[1] for some software/version management. There are two issues encountered (so far): 1. The submission node can have different software than the execution nodes - different cpu, different gpu (if any), infiniband, etc. When a user runs 'module load something' on th

Re: [slurm-users] lmod and slurm

2017-12-19 Thread Yair Yarom
hance to run the "module add ${SLURM_CONSTRAINT}" or remove the unwanted modules that were loaded (maybe automatically) on the submission node and aren't working on the execution node. Thanks, Yair. On Tue, Dec 19 2017, "Loris Bennett" wrote: > Hi Yair, >

Re: [slurm-users] lmod and slurm

2017-12-19 Thread Yair Yarom
NE myprogram > > > >> On Dec 19, 2017, at 8:37 AM, Yair Yarom wrote: >> >> >> Thanks for your reply, >> >> The problem is that users are running on the submission node e.g. >> >> module load tensorflow >> srun myprogram >>

Re: [slurm-users] lmod and slurm

2017-12-20 Thread Yair Yarom
nd > he > goes away a bit happier, and wiser. > > Plan to work with your users and be prepared to train them on nuance. > > Gerry > > On Tue, Dec 19, 2017 at 9:33 AM, Loris Bennett > wrote: > > Yair Yarom writes: > > > There are two issu

Re: [slurm-users] Mixed x86 and ARM cluster

2018-01-07 Thread Yair Yarom
Hi, We have here a linux x86 submission node for a power8 compute nodes, where the slurmctld and slurmdbd are running on an altogether different freebsd x86 machine. So yes, it should work :) Just make sure all the daemons are the same version, and take notes of where the monitoring and maintena

[slurm-users] GrpTRES value changes on upgrade from 17.02.1 to 17.11.2

2018-01-28 Thread Yair Yarom
Hi, We have a license, limited using the GrpTRES of an association (this is a "license/interactive" for https://github.com/irush-cs/slurm-plugins/). On upgrade to 17.11.2, I've noticed that all our "license/interactive" GrpTRES where changed to "billing". Judging by the current tres_table and ou

Re: [slurm-users] Is QOS always inherited explicitly?

2018-02-07 Thread Yair Yarom
Hi, >From my experience - yes, new associations will be associated with the QOS of the account. I believe it doesn't explicitly modifies all the associations, just notifies you which associations will be affected. Looking at my database suggests that indeed most associations don't have explicit

[slurm-users] Should I join the federation?

2018-02-12 Thread Yair Yarom
Hi all, I was wondering if any of you can share your insights regarding federations. What unexpected caveats have you encountered? We have here about about 15 "small" clusters (due to political and technical reasons), and most users have access to more than one cluster. Federation seems like a g

Re: [slurm-users] Free Gres resources

2018-02-13 Thread Yair Yarom
Hi, I haven't found a direct way. Here I have my own script that parses the output of "scontrol show node" and "scontrol show job", summing up and displaying the allocated gres. Yair. On Tue, Feb 13 2018, Nadav Toledo wrote: > Hello everyone, > > Does anyone know of way to get amount of i

Re: [slurm-users] GPU allocation problems

2018-03-12 Thread Yair Yarom
Hi, This is just a guess, but there's also a cgroup.conf file where you might need to add: ConstrainDevices=yes see: https://slurm.schedmd.com/cgroup.conf.html for more details. HTH, Yair. On Mon, Mar 12 2018, Sefa Arslan wrote: > Dear all, > > We have upgraded our cluster from 13 to s

Re: [slurm-users] Limit job_submit.lua script for only srun

2018-04-25 Thread Yair Yarom
Hi, We are also limiting "interactive" jobs through a plugin. What I've found is that in the job_descriptor the following holds: for salloc: argc = 0, script = NULL for srun: argc > 0, script = NULL for sbatch: argc = 0, script != NULL You can look at our plugin in https://github.com/irush-cs/slu

Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread Yair Yarom
Hi, This is what we did, not sure those are the best solutions :) ## Queue stuffing We have set PriorityWeightAge several magnitudes lower than PriorityWeightFairshare, and we also have PriorityMaxAge set to cap of older jobs. As I see it, the fairshare is far more important than age. Besides t

Re: [slurm-users] run bash script in spank plugin

2018-05-31 Thread Yair Yarom
Hi, I'm not sure how slurm/spank handles child processes but this might be intentional. So there might be some issues if this were to work. You can try instead of calling system(), to use fork() + exec(). If that still doesn't work, try calling setsid() before the exec(). I can think of situation

Re: [slurm-users] run bash script in spank plugin

2018-06-05 Thread Yair Yarom
gt;>>> job sleep 6secondes and reboot my machine (i test with reboot command, but >>>>> we can make other bash command, it's just example) >>>>> >>>>> pid_t cpid; //process id's and process groups >>>>> >>>>>