[slurm-users] Re: Executing srun -n X where X is greater than total CPU in entire cluster

2024-05-30 Thread Diego Zuccato via slurm-users
" SlurmdLogFile: "/var/log/slurm/slurmd.log" SlurmdSpoolDir: "/var/spool/slurm/d" SlurmUser: "{{ slurm_user.name <http://slurm_user.name> }}" SrunPortRange: "6-61000" StateSaveLocation: "/var/spool/slurm/ctld"

[slurm-users] Re: [EXTERN] Re: scheduling according time requirements

2024-04-30 Thread Diego Zuccato via slurm-users
t all. Any thoughts? Dietmar -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786 -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe sen

[slurm-users] Re: Lua script

2024-03-06 Thread Diego Zuccato via slurm-users
Il 06/03/2024 13:49, Gestió Servidors via slurm-users ha scritto: And how can I reject the job inside the lua script? Just use return slurm.FAILURE and job will be refused. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna

[slurm-users] Re: Lua script

2024-03-06 Thread Diego Zuccato via slurm-users
  end     end     return slurm.SUCCESS end However, if I submit a job with TimeLimit of 5 hours, lua script doesn’t modify submit and job remains “pending”… What am I doing wrong? Thanks. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum

Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

2024-01-23 Thread Diego Zuccato
A had something in mind when they developed MPS, so I guess our pattern may not be typical (or at least not universal), and in that case the MPS plugin may well be what you need. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bolog

Re: [slurm-users] Database cluster

2024-01-23 Thread Diego Zuccato
eplication, then manually switch to slurmdbd to a replication slave if the master goes down? Do you do something else? Thanks. Daniel -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna -

Re: [slurm-users] Unable to submit job (ReqNodeNotAvail, UnavailableNodes)

2023-11-07 Thread Diego Zuccato
to drained. Another possibility is that slurmctld detects a mismatch between the node and its config: in this case you'll find the reason in slurmctld.log . -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 4

Re: [slurm-users] Unable to submit job (ReqNodeNotAvail, UnavailableNodes)

2023-11-07 Thread Diego Zuccato
all_nodes*     drained 32      2:8:2  6        0      1   (null) batch job complete f You have to RESUME the node so it starts accepting jobs. scontrol update nodename=compute-0 state=resume -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di

Re: [slurm-users] Fairshare: Penalising unused memory rather than used memory?

2023-10-11 Thread Diego Zuccato
penalise everyone who requests large amounts of memory, whether it is needed or not. Therefore I would be interested in knowing whether one can take into account the *requested but unused memory* when calculating usage.  Is this possible? Cheers, Loris -- Dr. Loris Bennett (Herr/Mr) ZEDAT, Freie Un

Re: [slurm-users] job not running if partition MaxCPUsPerNode < actual max

2023-10-03 Thread Diego Zuccato
array range. I tried to add "-v" to the sbatch to see if that gives more useful info, but I couldn't get any more insight.  Does anyone have any idea why it's rejecting my job? thanks, Noam -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum -

[slurm-users] Mismatch between scontrol and sacctmgr ?

2023-09-22 Thread Diego Zuccato
es have 15 different values (including 0). -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] Weirdness with partitions

2023-09-21 Thread Diego Zuccato
scription is misleading. Noam -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] Weirdness with partitions

2023-09-21 Thread Diego Zuccato
that snippet in job_submit.lua ...  Would you expect that to prevent the job from ever running on any partition?Currently (and, I think, wrongly) that's exactly what happens. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le

Re: [slurm-users] Weirdness with partitions

2023-09-21 Thread Diego Zuccato
hu, Sep 21, 2023 at 3:11 AM Diego Zuccato <mailto:diego.zucc...@unibo.it>> wrote: Hello all. We have one partition (b4) that's reserved for an account while the others are "free for all". The problem is that sbatch --partition=b1,b2,b3,b4,b5 test.sh fails

[slurm-users] Weirdness with partitions

2023-09-21 Thread Diego Zuccato
cate scheduler logic in job_submit.lua... :) -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

[slurm-users] "Node count specification invalid" when specifying multiple partitions???

2023-08-09 Thread Diego Zuccato
ain enough nodes to satisfy the request. That seems to also apply to all_partitions jobsubmitplugin, making it nearly useless. We're using Slurm 22.05.6 . On 20.11.4 it worked as expected (excluding partitions that couldn't satisfy the request). Any hint? TIA -- Diego Zuccato DIFA - Dip. d

[slurm-users] Reservations deleted when group is empty?

2023-08-03 Thread Diego Zuccato
be a great problem if the reservation remained... A reservation should only get deleted when expired, IMO (but I can understand that there are cases where the current behaviour is desired). -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bo

Re: [slurm-users] Dynamic Node Shrinking/Expanding for Running Jobs in Slurm

2023-06-28 Thread Diego Zuccato
topic. Your expertise and assistance would greatly help me in successfully completing my project. Thank you in advance for your time and support. Best regards, Maysam Johannes Gutenberg University of Mainz -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Stu

Re: [slurm-users] Reservations and groups

2023-05-04 Thread Diego Zuccato
Ok, PEBKAC :) When creating the reservation, I set account=root . Just adding "account=" to the update fixed both errors. Sorry for the noise. Diego Il 04/05/2023 07:51, Diego Zuccato ha scritto: Hello all. I'm trying to define a reservation that only allows users

[slurm-users] Reservations and groups

2023-05-03 Thread Diego Zuccato
@slurmctl ~]# getent group res-TEST res-TEST:*:1180406822:testuser The group comes from AD via sssd. What am I missing? TIA -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051

Re: [slurm-users] Multiple default partitions

2023-04-17 Thread Diego Zuccato
default partitions? In the best case in a way that slurm schedules to partition1 on default and only to partition2 when partition1 can't handle the job right now. Best regards, Xaver Stiensmeier -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum

Re: [slurm-users] GPUs not available after making use of all threads?

2023-02-14 Thread Diego Zuccato
tal runs of jobs and gather timings. We have yet to see a 100% efficient process, but folks are improving things all the time. Brian Andrus On 2/13/2023 9:56 PM, Diego Zuccato wrote: I think that's incorrect: > The concept of hyper-threading is not doubling cores. It is a single > core that

Re: [slurm-users] GPUs not available after making use of all threads?

2023-02-13 Thread Diego Zuccato
78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126) ||Has anyone faced this or a similar issue and can give me some directions? Best wishes Sebastian || -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] I just had a "conversation" with ChatGPT about working DMTCP, OpenMPI and SLURM. Here are the results

2023-02-12 Thread Diego Zuccato
h.utexas.edu/~daneel/ <http://www.ph.utexas.edu/~daneel/> -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] [External] Hibernating a whole cluster

2023-02-07 Thread Diego Zuccato
That's probably not optimal, but could work. I'd go with brutal preemption: swapping 90+G can be quite time-consuming. Diego Il 07/02/2023 14:18, Analabha Roy ha scritto: On Tue, 7 Feb 2023, 18:12 Diego Zuccato, <mailto:diego.zucc...@unibo.it>> wrote: RAM used by a susp

Re: [slurm-users] [External] Hibernating a whole cluster

2023-02-07 Thread Diego Zuccato
ics/department/physics> The University of Burdwan <http://www.buruniv.ac.in/> Golapbag Campus, Barddhaman 713104 West Bengal, India Emails: dan...@utexas.edu <mailto:dan...@utexas.edu>, a...@phys.buruniv.ac.in <mailto:a...@phys.buruniv.ac.in>, hariseldo...@gmail.com <mail

Re: [slurm-users] How to set default partition in slurm configuration

2023-01-25 Thread Diego Zuccato
reference to the "default partition" in `JobSubmitPlugins` and this might be the solution. However, I think this is something so basic that it probably shouldn't need a plugin so I am unsure. Can anyone point me towards how setting the default partition is done? Best regards, Xaver Stie

Re: [slurm-users] GPU jobs not allocated correctly when requesting more than 1 CPU

2022-10-26 Thread Diego Zuccato
Il 21/10/2022 19:14, Rohith Mohan ha scritto: IIUC this could be the source of your problem: SelectTypeParameters=CR_CPU_Memory Maybe try CR_Core_memory . CR_CPU* does not have notion of sockets/cores/threads. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma

Re: [slurm-users] Ideal NFS exported StateSaveLocation size.

2022-10-24 Thread Diego Zuccato
d between controllers, right? Possibly use NVME-backed (or even better NVDIMM-backed) NFS share. Or replica-3 Gluster volume with NVDIMMs for the bricks, for the paranoid :) Diego -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Ber

Re: [slurm-users] Persistent Interactive Jobs

2022-06-10 Thread Diego Zuccato
gards, -- Willy Markuske HPC Systems Engineer Research Data Services P: (619) 519-4435 -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] temporary SLURM directories

2022-05-26 Thread Diego Zuccato
Il 26/05/2022 11:48, Diego Zuccato ha scritto: Still can't export TMPDIR=... from TaskProlog script. Surely missing something important. Maybe TaskProlog is called as a subshell? In that case it can't alter caller's env... But IIUC someone made it work, and that confuses me... Seems I

Re: [slurm-users] temporary SLURM directories

2022-05-26 Thread Diego Zuccato
by the job).Still can't export TMPDIR=... from TaskProlog script. Surely missing something important. Maybe TaskProlog is called as a subshell? In that case it can't alter caller's env... But IIUC someone made it work, and that confuses me... -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi

Re: [slurm-users] temporary SLURM directories

2022-05-23 Thread Diego Zuccato
ne, but I'm sure there must be a better way to do this. Thanks in advance for the help. best regards, Alain -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

[slurm-users] MPICH

2022-04-28 Thread Diego Zuccato
}/usr/mpich-4.0.2 gives an executable that only uses 1 CPU even if sbatch requested 52. :( Any hint appreciated. Tks. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20

Re: [slurm-users] issue with --cpus-per-task=1

2022-03-10 Thread Diego Zuccato
duced (on newer versions)? Can this somehow be avoided by setting a default number of tasks or some other (partition) parameter? Sorry for asking but I couldn't find anything in the documentation. Let me know if you need more information. Best Regards, Benjamin -- Diego Zuccato DIFA - Dip

Re: [slurm-users] slurmctld/slurmdbd filesystem/usermap requirements

2022-02-10 Thread Diego Zuccato
. == Paul Brunk, system administrator Georgia Advanced Resource Computing Center Enterprise IT Svcs, the University of Georgia On 2/10/22, 6:26 AM, "slurm-users" wrote: [EXTERNAL SENDER - PROCEED CAUTIOUSLY] On Thu, 2022-02-10 at 11:59:58 +0100, Diego Zuccato wrote: > Hello all.

[slurm-users] slurmctld/slurmdbd filesystem/usermap requirements

2022-02-10 Thread Diego Zuccato
slurmctld need read access to /home/userA/myjob.sh or does it receive the job script as a "blob" or as a path? Does it even need to know userA's GID or will it simply use 'userA' to lookup associations in dbd? Tks. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici

Re: [slurm-users] memory per node default

2022-01-21 Thread Diego Zuccato
one (been there, done that... :( ). -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] Updated "pestat" tool for printing Slurm nodes status including GRES/GPU

2021-12-17 Thread Diego Zuccato
Tks. Will be useful soon :) Are there other monitoring plugin you'd suggest? Il 17/12/2021 11:15, Loris Bennett ha scritto: Hi Diego, Diego Zuccato writes: Hi Loris. Il 14/12/2021 14:16, Loris Bennett ha scritto: spectrum, today, via our Zabbix monitoring, I spotted some jobs

Re: [slurm-users] Updated "pestat" tool for printing Slurm nodes status including GRES/GPU

2021-12-17 Thread Diego Zuccato
Hi Loris. Il 14/12/2021 14:16, Loris Bennett ha scritto: spectrum, today, via our Zabbix monitoring, I spotted some jobs with an unusually high GPU-efficiencies which turned out to be doing cryptomining :-/ What are you using to collect data for Zabbix? -- Diego Zuccato DIFA - Dip. di Fisica

Re: [slurm-users] Prevent users from updating their jobs

2021-12-17 Thread Diego Zuccato
list,user"        JobID        NodeList      User --- ----- 791                    smp-1    user01 -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] Requirement of one GPU job should run in GPU nodes in a cluster

2021-12-17 Thread Diego Zuccato
... Best, Steffen -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] nvml autodetect is ignoring gpus

2021-11-30 Thread Diego Zuccato
impact autodetection (so it "just" requires manual config) or GPU jobs won't be able to start at all? -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] enable_configless, srun and DNS vs. hosts file

2021-11-14 Thread Diego Zuccato
you saw. Restarting slurmd on the submit node fixes it. This is the documented behavior (adding nodes needs slurmd restarted everywhere). Could this be what you're seeing (as opposed to /etc/hosts vs DNS)? -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater

Re: [slurm-users] How to get an estimate of job completion for planned maintenance?

2021-11-08 Thread Diego Zuccato
time could be backfilled till the reservation/maintenance starts. You can put the reservation anytime in the system but at least or before " minus ", e.g. scontrol create reservation= starttime= duration=  user=root flags=maint nodes=ALL Hope, that helps a little bit, Carsten -

Re: [slurm-users] Wrong hwloc detected?

2021-11-08 Thread Diego Zuccato
isted here: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#install-prerequisites /Ole On 05-11-2021 15:38, Diego Zuccato wrote: They aren't using modules so it must be something system-wide :( But not all jobs are impacted. And it seems it's a bit random (doesn't happen always). I'm ou

Re: [slurm-users] Wrong hwloc detected?

2021-11-05 Thread Diego Zuccato
They aren't using modules so it must be something system-wide :( But not all jobs are impacted. And it seems it's a bit random (doesn't happen always). I'm out of ideas, currently :( Il 05/11/2021 13:10, Ole Holm Nielsen ha scritto: On 11/5/21 12:47, Diego Zuccato wrote: Some users

[slurm-users] Wrong hwloc detected?

2021-11-05 Thread Diego Zuccato
ConstrainRAMSpace=yes ConstrainSwapSpace=yes MemorySwappiness=0 MaxSwapPercent=0 AllowedSwapSpace=0 Any ideas? Tks. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20

Re: [slurm-users] slurm.conf syntax checker?

2021-10-18 Thread Diego Zuccato
tax errors and the most common errors is already a big help, expecially for noobs :) [OK]: All nodeweights are correct. What do you mean with this? How can weights be "incorrect"? If someone is interested ...Surely I am :) -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Serv

Re: [slurm-users] "Low RealMem" after upgrade

2021-10-05 Thread Diego Zuccato
That's why I upgraded the whole cluster at once. Tks for the help. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] "Low RealMem" after upgrade

2021-10-05 Thread Diego Zuccato
ecified"). SLURM 20.11.4. Tks. Diego Il 01/10/2021 21:32, Paul Brunk ha scritto: Hi: If you mean "why are the nodes still Drained, now that I fixed the slurm.conf and restarted (never mind whether the RealMem parameter is correct)?", try 'scontrol update nodename=str957-bl0-0[1-2]

[slurm-users] "Low RealMem" after upgrade

2021-10-01 Thread Diego Zuccato
lt;-- I also tried lowering RealMemory setting to 6, in case MemSpecLimit interfered, but the result remains the same. Any ideas? TIA! -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bolo

Re: [slurm-users] restarting slurmctld restarts jobs???

2021-09-20 Thread Diego Zuccato
20/09/2021 13:49, Diego Zuccato ha scritto: Tks. Checked it: it's on the home filesystem, NFS-shared between the nodes. Well, actually a bit more involved than that: JobCompLoc points to /var/spool/jobscompleted.txt but /var/spool/slurm is actually a symlink to /home/conf/slurm_spool . root

Re: [slurm-users] restarting slurmctld restarts jobs???

2021-09-20 Thread Diego Zuccato
ory. The explanation at below is taken from slurm web site: "The backup controller recovers state information from the StateSaveLocation directory, which must be readable and writable from both the primary and backup controllers." Regards; Ahmet M. 20.09.2021 12:08 tarihinde Diego Zuccato yazd

[slurm-users] restarting slurmctld restarts jobs???

2021-09-20 Thread Diego Zuccato
ntly in the process of adding some nodes, but I already did it other times w/ no issues (actually the second slurmctld node have been installed to catch the race of a job terminating while the main slurmctld was shut down). Anything I should double-check? Tks. -- Diego Zuccato DIFA - Dip. di Fi

Re: [slurm-users] FreeMem is not equal to (RealMem - AllocMem)

2021-09-14 Thread Diego Zuccato
right now): RealMemory=257433 AllocMem=0 FreeMem=159610 That's probably due to buffer/caches remaining allocated between jobs. They're handled by the SO and should be automatically freed when a program needs mem. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma

Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-06 Thread Diego Zuccato
IIRC we increased SlurmdTimeout to 7200 . Il 06/08/2021 13:33, Adrian Sevcenco ha scritto: On 8/6/21 1:56 PM, Diego Zuccato wrote: We had a similar problem some time ago (slow creation of big core files) and solved it by increasing the Slurm timeouts oh, i see.. well, in principle i should

Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-06 Thread Diego Zuccato
:46, Adrian Sevcenco ha scritto: On 8/6/21 1:27 PM, Diego Zuccato wrote: Hi. Hi! Might it be due to a timeout (maybe the killed job is creating a core file, or caused heavy swap usage)? i will have to search for culprit .. the problem is why would the node be put in drain for the reason

Re: [slurm-users] Compact scheduling strategy for small GPU jobs

2021-08-06 Thread Diego Zuccato
submit one job with 8 gpus, it will pending because of gpu fragments: nodes A has 2 idle gpus, node b 6 idle gpus Thanks in advance! -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy

Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-06 Thread Diego Zuccato
of task fails and how can i disable it? (i use cgroups) moreover, how can the killing of task fails? (this is on slurm 19.05) Thank you! Adrian -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna

Re: [slurm-users] 4 sockets but "

2021-07-23 Thread Diego Zuccato
start". But pestat and slurmtop are different tools for different uses, no need to duplicate all functionality. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] slumctld don't start at boot

2021-07-23 Thread Diego Zuccato
pport is all right with no problems, but slurmctld still does not start on boot. Also in the log reported blade01 is the hostname of one of the nodes. You should probably fix /usr/lib/systemd/system/slurmdbd.service as well. /Ole -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Inform

Re: [slurm-users] slumctld don't start at boot

2021-07-23 Thread Diego Zuccato
elete it permanently from your computer system. Fai crescere i nostri giovani ricercatori dona il 5 per mille alla Sapienza *codice fiscale 80209930587* -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi In

Re: [slurm-users] 4 sockets but "

2021-07-23 Thread Diego Zuccato
don't quite see how one could integrate pestat itself directly into Zabbix, as it is more geared to producing a report, but maybe Ole has ideas :-) How to use the collected data is one of the big open problems in IT :) -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma

Re: [slurm-users] 4 sockets but "

2021-07-23 Thread Diego Zuccato
with it yet (for example I still can't understand how I can exclude some metrics from a host that got 'em added by a template... When I'll have enough time I'll find a way :) ). Maybe pestat can be added to the Zabbix metrics... -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Inform

Re: [slurm-users] 4 sockets but "

2021-07-21 Thread Diego Zuccato
restarted slurmctld and it keeps seeing all CPUs... What should I think? But another problem surfaces: slurmtop seems not to handle so many CPUs gracefully and throws a lot of errors, but that should be something manageable... Tks for the help. BYtE, Diego Il 21/07/2021 11:01, Diego Zuccato

Re: [slurm-users] 4 sockets but "

2021-07-21 Thread Diego Zuccato
Uff... A bit mangled... Correcting and resending. Il 21/07/2021 08:18, Diego Zuccato ha scritto: Il 20/07/2021 18:02, mercan ha scritto: Hi Ahmet. Did you check slurmctld log for a complain about the host line. if the slumctld can not recognize a parameter, may be it give up processing whole

Re: [slurm-users] 4 sockets but "

2021-07-21 Thread Diego Zuccato
] _build_node_list: No nodes satisfy JobId=33808 requirements in partition b4 (str957 is the second frontend/login node that I've had to take offline for an unrelated problem). -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le

Re: [slurm-users] 4 sockets but "

2021-07-21 Thread Diego Zuccato
in later versions... Maybe delete Boards=1 SocketsPerBoard=4 and try Sockets=4 in stead? Already tried. Actually, it's been the first try. The pam_slurm_adopt is very useful :-) IIUC only if you allow users to connect to the worker nodes. I don't. :) -- Diego Zuccato DIFA - Dip. di Fisica e

Re: [slurm-users] 4 sockets but "

2021-07-20 Thread Diego Zuccato
sik.dtu.dk/niflheim/Slurm_configuration#compute-node-configuration Tks. Interesting, but I don't se pam_slurm_adopt. Other than that, it seems very much like what I'm doing. BYtE, Diego On 7/20/21 12:49 PM, Diego Zuccato wrote: Hello all. It's been since yesterday that I'm facing this i

[slurm-users] 4 sockets but "

2021-07-20 Thread Diego Zuccato
ctld after every change in slurm.conf just to be sure. Any idea? Tks. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] incorrect number of cpu's being reported in srun job

2021-06-18 Thread Diego Zuccato
impacting other users. Even if you just make users "pay" for the resources used by applying fairshare, the temptation to game the system could be too big. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6

Re: [slurm-users] Job requesting two different GPUs on two different nodes

2021-06-10 Thread Diego Zuccato
experienced can refine it. No... it doesn't work... -Mensaje original- De: Diego Zuccato Enviado el: jueves, 10 de junio de 2021 10:37 Para: Slurm User Community List ; Gestió Servidors Asunto: Re: [slurm-users] Job requesting two different GPUs on two different nodes Il 08/06/2021 15

Re: [slurm-users] Job requesting two different GPUs on two different nodes

2021-06-10 Thread Diego Zuccato
, --gres=gpu:GeForceRTX2070:1” because line “#SBATCH --gres=” is for each node and, then, a line containing two “gres” would request a node with 2 different GPUs. So… is it possible to request 2 different GPUs in 2 different nodes? Thanks. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia

Re: [slurm-users] Conflicting --nodes and --nodelist

2021-06-03 Thread Diego Zuccato
your job. Brian Andrus On 6/1/2021 4:15 AM, Diego Zuccato wrote: Hello all. I just found that if an user tries to specify a nodelist (say including 2 nodes) and --nodes=1, the job gets rejected with sbatch: error: invalid number of nodes (-N 2-1) The expected behaviour is that slurm schedules

[slurm-users] Conflicting --nodes and --nodelist

2021-06-01 Thread Diego Zuccato
conflicting info about the issue. Is it version-dependant? If so, we're currently using 18.08.5-2 (from Debian stable). Should we expect changes when Debian will ship a newer version? Is it possible to have the expected behaviour? Tks. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi

Re: [slurm-users] Determining Cluster Usage Rate

2021-05-17 Thread Diego Zuccato
s page. Tks. I upgrade Slurm frequently and have no problems doing so.  We're at 20.11.7 now.  You should avoid 20.11.{0-2} due to a bug in MPI. That's a really useful info. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le

Re: [slurm-users] Determining Cluster Usage Rate

2021-05-17 Thread Diego Zuccato
). As Ole said, it's an old version. I'd love to be able to keep up with the newest releases, but ... :( -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] Determining Cluster Usage Rate

2021-05-14 Thread Diego Zuccato
- -- --- ophcpu 81.93%0.00%0.00% 15.85% 2.22% 100.00% ophmem 80.60%0.00%0.00% 19.40% 0.00% 100.00% BYtE, Diego -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater

Re: [slurm-users] Determining Cluster Usage Rate

2021-05-14 Thread Diego Zuccato
Il 14/05/2021 08:19, Christopher Samuel ha scritto: sreport -t percent -T ALL cluster utilization "sreport: fatal: No valid TRES given" :( -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 401

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-13 Thread Diego Zuccato
, but there's a very low-volume mailing list at ccr-xdmod-l...@listserv.buffalo.edu <mailto:ccr-xdmod-l...@listserv.buffalo.edu> you could inquire at. [1] https://github.com/ubccr/xdmod/releases/tag/v9.5.0-rc.4 <https://github.com/ubccr/xdmod/releases/tag/v9.5.0-rc.4> *From: *Diego Zu

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-12 Thread Diego Zuccato
Il 12/05/21 13:30, Diego Zuccato ha scritto: Anyway, at a first glance, it uses a bit too many technologies for my taste (php, java, js...) and could be a problem integrating it in a vhost managed by one of our ISPConfig instances. But I'll try it. Somehow I'll make it work :) The more I look

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-12 Thread Diego Zuccato
ances. But I'll try it. Somehow I'll make it work :) -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-12 Thread Diego Zuccato
to the bare numbers is definitely a no-no :) -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-11 Thread Diego Zuccato
to do some changes (re field witdh: our usernames are quite long, being from AD), but first I have to check if it extracts the info our users want to see :) -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 401

[slurm-users] Cluster usage, filtered by partition

2021-05-11 Thread Diego Zuccato
s (or at least the data to put in a spreadsheet for further processing)? Tks. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] [External] Re: PropagateResourceLimits

2021-04-28 Thread Diego Zuccato
propagated (as implied by PropagateResourceLimits default value of ALL). And I can confirm that setting it to NONE seems to have solved the issue: users on the frontend get limited resources, and jobs on the nodes get the resources they asked. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia

[slurm-users] PropagateResourceLimits

2021-04-22 Thread Diego Zuccato
tried to limit to 1GB soft / 4GB hard the memory users can use on the frontend, the jobs began to fail at startup even if they requested 200G (that are available on the worker nodes but not on the frontend)... Tks. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma

Re: [slurm-users] how to print all the key-values of "job_desc" in job_submit.lua?

2021-03-29 Thread Diego Zuccato
Il 29/03/21 09:35, taleinterve...@sjtu.edu.cn ha scritto: > Why the loop code cannot get the content in job_desc? And what is the > correct way to print all its content without manually specify each key? I already reported it quite some time ago. Seems pairs() is not working. -- Diego Z

Re: [slurm-users] MaxTime only for a user

2021-02-25 Thread Diego Zuccato
ition. So the definition will have to be reversed: set the partition limit to the max allowed (1h) and limit all users except one in the assoc. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologn

Re: [slurm-users] associations, limits,qos

2021-01-29 Thread Diego Zuccato
Il 29/01/21 08:47, Diego Zuccato ha scritto: >> Jobs submitted with sbatch cannot run on multiple partitions. The job >> will be submitted to the partition where it can start first. (from >> sbatch reference) > Did I misunderstand or heterogeneous jobs can workaround this

Re: [slurm-users] associations, limits,qos

2021-01-28 Thread Diego Zuccato
Il 25/01/21 14:46, Durai Arasan ha scritto: > Jobs submitted with sbatch cannot run on multiple partitions. The job > will be submitted to the partition where it can start first. (from > sbatch reference) Did I misunderstand or heterogeneous jobs can workaround this limitation? -- Dieg

Re: [slurm-users] Questions about sacctmgr load command

2021-01-14 Thread Diego Zuccato
hose changes. IIUC, if you don't specify "clean" when loading new config, users removed from the dump are left active. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-29 Thread Diego Zuccato
Just guessing, tho. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Re: [slurm-users] seff Not Caluculating [FIXED?]

2020-11-18 Thread Diego Zuccato
asks' => 1, 'nodes' => 'str957-bl0-17', 'start' => 1605621479, 'user_cpu_usec' => 116130, 'req_cpufreq_max' => 0 } ] }; Job ID: 9604

Re: [slurm-users] seff Not Caluculating [FIXED?]

2020-11-17 Thread Diego Zuccato
Il 09/11/20 12:53, Diego Zuccato ha scritto: > Seems my corrections actually work only for single-node jobs. > In case of multi-node jobs, it only considers the memory used on one > node, hence understimates the real efficiency. > Someone more knowledgeable than me can spot the e

Re: [slurm-users] seff Not Caluculating

2020-11-09 Thread Diego Zuccato
Il 15/09/20 10:14, Diego Zuccato ha scritto: Seems my corrections actually work only for single-node jobs. In case of multi-node jobs, it only considers the memory used on one node, hence understimates the real efficiency. Someone more knowledgeable than me can spot the error? TIA! >

Re: [slurm-users] Using hyperthreaded processors

2020-11-06 Thread Diego Zuccato
obs that are mostly CPU-intensive but needing very fast IPC. In our tests, *fixed the time* at 24h, using HT vs one-process-per-core lead to 1.8x the iterations. In other words there were twice the processes running at 90% "clock". -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia S

Re: [slurm-users] Using hyperthreaded processors

2020-11-06 Thread Diego Zuccato
ll run at about twice the speed it achieves when running on a single thread. Tested with FPU-intensive code on our cluster. What thrashes performance is trying to run different processes in the two threads of a core. Just my $.02 -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi I

[slurm-users] Reset all accounting data

2020-11-02 Thread Diego Zuccato
'. Are you sure you want to continue? (You have 30 seconds to decide) (N/y): y sacctmgr: error: An association name is required to remove usage Should I iterate all the accounts or is there a better/faster method? TIA! -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma

  1   2   >