Hi,
We encountered this issue some time ago (see:
https://www.mail-archive.com/slurm-dev@schedmd.com/msg06628.html). You
need to add pam_systemd to the slurm pam file, but pam_systemd will
try to take over the slurm's cgroups. Our current solution is to add
pam_systemd to the slurm pam file, but i
Hi,
I haven't encountered this specific error (it probably means some
permissions issues somewhere), but my first try will be to look at the
slurmd log file. You can also update the SlurmdDebug (and maybe
SlurmdLogFile) config option to get more information.
Yair.
On Tue, Jun 19, 2018 at 6:1
Hi,
As Paul mentioned, we once encountered a starvation issue with the
backfill algorithm and since set up the bf_window to match the maximum
running time of all the partitions. This could be the case here.
Also make sure that indeed the jobs can run on the non-gpu nodes (we
constantly encounter
Hi,
We have a short partition to give a reasonable waiting time for shorter
jobs. We use the job_submit/all_partitions plugin so if a user doesn't
specify a partition, it will add all the partitions.
The downside of the plugin is that if a job is too long for the short
partition (or the job can't
Hi,
We also have multiple partitions, but in addition we use a job submit
plugin to distinguish between srun/salloc and sbatch submissions. This
plugin forces a specific partition for interactive jobs (and the timelimit
with it) and using the license system it limits the number of simultaneous
int
Hi,
You can set the maxsubmitjob=0 on that default account. That should prevent
anyone from using it, but it won't have a specific message like with the
lua plugin. E.g.
sacctmgr update account default set maxsubmitjob=0
Regards,
Yair.
On Tue, Nov 6, 2018 at 12:58 AM Renfro, Michael wrote:
r nodes stuck in CG (completing) state? Some user jobs, mostly
> Julia notebook can get hung in completing state is the user kills the
> running job or cancels it with cntrl. When this happens we can have many
> many nodes stuck in CG. Slurm 17.02.6. Thanks!
> >
>
>
-
software then it can take care
>> of the modules too for you. I'd also echo the recommendation from
>> others to use Lmod.
>>
>> Website: https://easybuilders.github.io/easybuild/
>> Documentation: https://easybuild.readthedocs.io/
>>
>> All the best,
Hi,
I'm not sure what queue time limit of 10 hours is. If you can't have jobs
waiting for more than 10 hours, than it seems to be very small for 8 hours
jobs.
Generally, a few options:
a. The --dependency option (either afterok or singleton)
b. The --array option of sbatch with limit of 1 job at a
Hi,
I've also encountered this issue of the deprecated %b. I'm currently
parsing the output of "scontrol show jobs -dd" to see what was requested
(and which exact GPUs were allocated).
Hope this helps,
Yair.
On Mon, Feb 24, 2020 at 11:56 PM Venable, Richard (NIH/NHLBI) [E] <
venab...@nhlbi.n
Hi,
We also have hybrid cluster(s).
We use the same nfsroot for all nodes, so technically everything is
installed everywhere. And we compile slurm once with everything needed.
Users can run "module load cuda" and/or "module load nvidia" to have access
to nvcc and nvidia's libraries (cuda and nvid
I also haven't got along with the Perl API shipped with slurm. I got it to
work, but there were things missing.
Currently I have some wrapper functions for most of slurm commands, and a
general parsing function to slurm's common outputs (of scontrol, sacctmgr,
etc.).
Not in CPAN, but you can see it
Hi,
We have an issue where running srun (with --pty zsh), and rebooting the
node (from a different shell), the srun reports:
srun: error: eio_message_socket_accept: slurm_receive_msg[an.ip.addr.ess]:
Zero Bytes were transmitted or received
and hangs.
After the node boots, the slurm claims that jo
I've checked it now, it isn't listed as a runaway job.
On Tue, Mar 31, 2020 at 5:24 PM David Rhey wrote:
> Hi, Yair,
>
> Out of curiosity have you checked to see if this is a runaway job?
>
> David
>
> On Tue, Mar 31, 2020 at 7:49 AM Yair Yarom wrote:
>
&
ecific partitions.
>
> Any recommendation or best practice on how to handle interactive jobs is
> welcome.
>
> Thank you,
> Stephan
>
>
--
/| |
\/ | Yair Yarom | Senior DevOps Architect
[] | The Rachel and Selim Benin School
[] /\| of Computer Science and Engineering
[]//\\/ | The Hebrew University of Jerusalem
[// \\ | T +972-2-5494522 | F +972-2-5494522
//\ | ir...@cs.huji.ac.il
//|
Hi list,
We have GrpTRES limits on all accounts which causes a lot of higher
priority jobs to stay in the queue due to limits. As such we rely heavily
on the backfill scheduler. We also have a special lower priority
preemptable QOS with no limits.
We've noticed that when the cluster is loaded, se
Hi all,
We have around 50 accounts, each has its own GrpTRES limits. We want to add
another set of accounts (probably another 50) with different priority which
will have GrpTRESMins, such that users could "buy" TRES*minutes with higher
priority.
For that we require that the GrpTRESMins won't get
On Fri, Nov 20, 2020 at 12:11 AM Sebastian T Smith wrote:
> Hi,
>
> We're setting GrpTRESMins on the account association and have NoDecay
> QOS's for different user classes. All user associations with a
> GrpTRESMins-limited account are assigned a NoDecay QOS. I'm not sure if
> it's a better ap
Hi,
We also noticed this. We eventually placed the max time on the
HealthCheckInterval (65535), and created a systemd.timer which runs the
scripts externally of slurm, with proper intervals and randomized delays.
Yair.
On Wed, Dec 2, 2020 at 9:03 AM wrote:
> Hello,
>
>
>
> Our slurm cluste
several examples I have read about SLURM scripts, nobody comments
> > that. So, have I forgotten a parameter in SLURM to "capture"
> > environment variables into the script or is it a problem due to my
> > distribution (CentOS-7)???
> >
> > Thanks.
>
>
>
-
-around?
>
> Thanks,
>
> A.
>
> --
> Ansgar Esztermann
> Sysadmin Dep. Theoretical and Computational Biophysics
> http://www.mpibpc.mpg.de/grubmueller/esztermann
>
--
/| |
\/ | Yair Yarom | System Group (DevOps)
[] | The Rachel and Sel
l available frequencies not scanned
>
>
> Any idea how I can fix this problem?
>
> Regards,
> Karl
>
>
--
/| |
\/ | Yair Yarom | System Group (DevOps)
[] | The Rachel and Selim Benin School
[] /\| of Computer Science and Engineering
[]//\\
don't load the
archives into a secondary db.
We now have a use-case which might require us to save job information for
more than that, and we're considering how to do that.
Thanks in advance,
--
/| |
\/ | Yair Yarom | System Group (DevOps)
[] | The Rachel an
. The archive data itself is available for
> reimport and historical investigation. We've done this when importing
> historical data into XDMod.
>
> -Paul Edmon-
> On 6/28/2021 10:43 AM, Yair Yarom wrote:
>
> Hi list,
>
> I was wondering if you could share your
Hi ,
>> > >
>> > > I am looking for a stepwise guide to setup multi cluster
>> > implementation.
>> > > We wanted to set up 3 clusters and one Login Node to run the job
>> > using
>> > > -M cluster option.
2021 at 6:36 PM Brian Andrus wrote:
> That is interesting to me.
>
> How do you use ulimit and systemd to limit user usage on the login nodes?
> This sounds like something very useful.
>
> Brian Andrus
> On 10/31/2021 1:08 AM, Yair Yarom wrote:
>
> Hi,
>
> If it hel
t; 4711 xterm43Gn 2021-11-29T14:47:09
> FAILED 00:20:08
>
> 4711.batchbatch43Gn 2021-11-29T14:47:09
> CANCELLED 00:20:08 37208K
>
> 4711.extern extern43Gn 2021-11-29T14:47:09
> COMPLETED 00:2
md/system/slurmd.service
>
> KillMode=process
>
>
>
> Instead of (for ubuntu nodes)
>
> KillMode=control-group
>
>
>
> *De :* slurm-users *De la part de*
> Yair Yarom
> *Envoyé :* mardi 30 novembre 2021 08:50
> *À :* Slurm User Community List
> *Obj
ingMode=" in systemd.exec man page, but that didn't really help me.
>
> Can you explain to me how it would be possible to get "private" keyrings
> inside a Slurm job on the executing node?
>
> thx
> Matthias
>
>
--
/| |
\/ | Yair Yaro
der deprecated
> > functionality".
> >
> > https://bugs.schedmd.com/show_bug.cgi?id=4098
>
> Warning: Do NOT configure UsePAM=1 in slurm.conf (this advice can be found
> on the net). See
>
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-prologflags
&g
;
> > On my current (inherited) Slurm cluster we have:
> >
> >SelectType=select/cons_res
> >
> > but users are primarily using GPU resources, so I know Gres is working.
> > Why then is select/cons_tres required?
>
>
--
/| |
\/ | Ya
which
> would mean 14 different jobs could run at the same time using their own
> particular MIG, right?
>
> Do those questions make sense to anyone? 🙂
>
> Rob
>
>
>
--
/| |
\/ | Yair Yarom | System Group (DevOps)
[] | The Rachel and
0 --account=1gc5gb
> --partition=sla-prio
> salloc: Job allocation 5015 has been revoked.
> salloc: error: Job submit/allocate failed: Requested node configuration is
> not available
>
>
> Rob
>
> --
> *From:* slurm-users on behalf of
> Ya
>> it should be different like:
>> sacctmgr add user user1 account=alpha_grp cluster=Alpha
>> sacctmgr add user1 account=beta_grp cluster=Beta
>>
>> Please let me know in case of any additional information.
>>
>> Regards,
>> Shaghuf Rahman
>>
rt is quite
handy.
On Wed, 15 Mar 2023 at 14:55, Shaghuf Rahman wrote:
> Hi Yair,
>
> Thank you for clarification.
>
> Could you please tell me which way is better for accounting related
> reports.
>
> Thanks & Regards,
> Shaghuf
> On Wed, 15 Mar 2023 at 15:0
-powerful
>if people don’t care)?
>
>
>
> Any help and/or advice here is much appreciated. Slurm has been amazing
> for our lab (albeit challenging to setup at first) and I want to get
> everything dialed before I graduate :D .
>
>
>
> Thanks,
>
> -Co
Hi list,
We use here lmod[1] for some software/version management. There are two
issues encountered (so far):
1. The submission node can have different software than the execution
nodes - different cpu, different gpu (if any), infiniband, etc. When
a user runs 'module load something' on th
hance to run the "module add ${SLURM_CONSTRAINT}" or
remove the unwanted modules that were loaded (maybe automatically) on
the submission node and aren't working on the execution node.
Thanks,
Yair.
On Tue, Dec 19 2017, "Loris Bennett" wrote:
> Hi Yair,
>
NE myprogram
>
>
>
>> On Dec 19, 2017, at 8:37 AM, Yair Yarom wrote:
>>
>>
>> Thanks for your reply,
>>
>> The problem is that users are running on the submission node e.g.
>>
>> module load tensorflow
>> srun myprogram
>>
nd
> he
> goes away a bit happier, and wiser.
>
> Plan to work with your users and be prepared to train them on nuance.
>
> Gerry
>
> On Tue, Dec 19, 2017 at 9:33 AM, Loris Bennett
> wrote:
>
> Yair Yarom writes:
>
> > There are two issu
Hi,
We have here a linux x86 submission node for a power8 compute nodes,
where the slurmctld and slurmdbd are running on an altogether different
freebsd x86 machine. So yes, it should work :)
Just make sure all the daemons are the same version, and take notes of
where the monitoring and maintena
Hi,
We have a license, limited using the GrpTRES of an association (this is
a "license/interactive" for https://github.com/irush-cs/slurm-plugins/).
On upgrade to 17.11.2, I've noticed that all our "license/interactive"
GrpTRES where changed to "billing". Judging by the current tres_table
and ou
Hi,
>From my experience - yes, new associations will be associated with the
QOS of the account.
I believe it doesn't explicitly modifies all the associations, just
notifies you which associations will be affected. Looking at my database
suggests that indeed most associations don't have explicit
Hi all,
I was wondering if any of you can share your insights regarding
federations. What unexpected caveats have you encountered?
We have here about about 15 "small" clusters (due to political and
technical reasons), and most users have access to more than one
cluster. Federation seems like a g
Hi,
I haven't found a direct way. Here I have my own script that parses the
output of "scontrol show node" and "scontrol show job", summing up and
displaying the allocated gres.
Yair.
On Tue, Feb 13 2018, Nadav Toledo wrote:
> Hello everyone,
>
> Does anyone know of way to get amount of i
Hi,
This is just a guess, but there's also a cgroup.conf file where you
might need to add:
ConstrainDevices=yes
see:
https://slurm.schedmd.com/cgroup.conf.html
for more details.
HTH,
Yair.
On Mon, Mar 12 2018, Sefa Arslan wrote:
> Dear all,
>
> We have upgraded our cluster from 13 to s
Hi,
We are also limiting "interactive" jobs through a plugin. What I've
found is that in the job_descriptor the following holds:
for salloc: argc = 0, script = NULL
for srun: argc > 0, script = NULL
for sbatch: argc = 0, script != NULL
You can look at our plugin in
https://github.com/irush-cs/slu
Hi,
This is what we did, not sure those are the best solutions :)
## Queue stuffing
We have set PriorityWeightAge several magnitudes lower than
PriorityWeightFairshare, and we also have PriorityMaxAge set to cap of
older jobs. As I see it, the fairshare is far more important than age.
Besides t
Hi,
I'm not sure how slurm/spank handles child processes but this might be
intentional. So there might be some issues if this were to work.
You can try instead of calling system(), to use fork() + exec(). If
that still doesn't work, try calling setsid() before the exec(). I can
think of situation
gt;>>> job sleep 6secondes and reboot my machine (i test with reboot command, but
>>>>> we can make other bash command, it's just example)
>>>>>
>>>>> pid_t cpid; //process id's and process groups
>>>>>
>>>>>
50 matches
Mail list logo