Re: [slurm-users] runtime priority
On Tue, Jun 30, 2020 at 10:52:00AM -0400, Lawrence Stewart wrote: > How does one configure the runtime priority of a job? That is, how do you > set the CPU scheduling “nice” value? > > We’re using Slurm to share a large (16 core 768 GB) server among FPGA > compilation jobs. Slurm handles core and memory reservations just fine, but > runs everything nice -19, which makes for hugh load averages and terrible > interactive performance. > > Manually setting the compilation processes with “renice 19 ” works fine, > but is tedious. I would first check if /etc/security/limits.conf contains "priority": suppose it is set to -19 for root, slurmd typically runs as root, child processes inherit the value... > -Larry -- Kind regards Frank Lenaerts
Re: [slurm-users] runtime priority
As far as I can tell, sbatch —nice only affects scheduling priority, not CPU priority. I’ve made a workaround by putting “nice -n 19 xxx” as the job to run in my sbatch scripts > On 2020, Jun 30, at 11:07 AM, Renfro, Michael wrote: > > There’s a --nice flag to sbatch and srun, at least. Documentation indicates > it decreases priority by 100 by default. > > And untested, but it may be possible to use a job_submit.lua [1] to adjust > nice values automatically. At least I can see a nice property in [2], which I > assume means it'd be accessible as job_desc.nice in the Lua script. > > [1] https://github.com/SchedMD/slurm/blob/master/contribs/lua/job_submit.lua > [2] https://github.com/SchedMD/slurm/blob/master/src/lua/slurm_lua.c > >> On Jun 30, 2020, at 9:52 AM, Lawrence Stewart wrote: >> >> How does one configure the runtime priority of a job? That is, how do you >> set the CPU scheduling “nice” value? >> >> We’re using Slurm to share a large (16 core 768 GB) server among FPGA >> compilation jobs. Slurm handles core and memory reservations just fine, but >> runs everything nice -19, which makes for hugh load averages and terrible >> interactive performance. >> >> Manually setting the compilation processes with “renice 19 ” works >> fine, but is tedious. >> >> -Larry >> >> >
Re: [slurm-users] runtime priority
There’s a --nice flag to sbatch and srun, at least. Documentation indicates it decreases priority by 100 by default. And untested, but it may be possible to use a job_submit.lua [1] to adjust nice values automatically. At least I can see a nice property in [2], which I assume means it'd be accessible as job_desc.nice in the Lua script. [1] https://github.com/SchedMD/slurm/blob/master/contribs/lua/job_submit.lua [2] https://github.com/SchedMD/slurm/blob/master/src/lua/slurm_lua.c > On Jun 30, 2020, at 9:52 AM, Lawrence Stewart wrote: > > How does one configure the runtime priority of a job? That is, how do you > set the CPU scheduling “nice” value? > > We’re using Slurm to share a large (16 core 768 GB) server among FPGA > compilation jobs. Slurm handles core and memory reservations just fine, but > runs everything nice -19, which makes for hugh load averages and terrible > interactive performance. > > Manually setting the compilation processes with “renice 19 ” works fine, > but is tedious. > > -Larry > >
[slurm-users] runtime priority
How does one configure the runtime priority of a job? That is, how do you set the CPU scheduling “nice” value? We’re using Slurm to share a large (16 core 768 GB) server among FPGA compilation jobs. Slurm handles core and memory reservations just fine, but runs everything nice -19, which makes for hugh load averages and terrible interactive performance. Manually setting the compilation processes with “renice 19 ” works fine, but is tedious. -Larry
Re: [slurm-users] fail job
... [2020-06-30T11:46:52.740] error: select_nodes: calling _get_req_features() for JobId=964556 with not NULL job resources [2020-06-30T11:46:52.740] error: select_nodes: calling _get_req_features() for JobId=964574 with not NULL job resources [2020-06-30T11:46:52.741] error: select_nodes: calling _get_req_features() for JobId=964557 with not NULL job resources [2020-06-30T11:46:52.741] error: select_nodes: calling _get_req_features() for JobId=964558 with not NULL job resources [2020-06-30T11:46:52.741] error: select_nodes: calling _get_req_features() for JobId=964559 with not NULL job resources [2020-06-30T11:46:52.741] error: select_nodes: calling _get_req_features() for JobId=964560 with not NULL job resources [2020-06-30T11:46:52.741] error: select_nodes: calling _get_req_features() for JobId=964573 with not NULL job resources [2020-06-30T11:46:53.986] _job_complete: JobId=964580 WEXITSTATUS 0 [2020-06-30T11:46:53.986] _job_complete: JobId=964580 done [2020-06-30T11:46:54.377] error: select_nodes: calling _get_req_features() for JobId=964294 with not NULL job resources [2020-06-30T11:46:54.377] error: select_nodes: calling _get_req_features() for JobId=964295 with not NULL job resources [2020-06-30T11:46:54.378] error: select_nodes: calling _get_req_features() for JobId=964296 with not NULL job resources [2020-06-30T11:46:54.378] error: select_nodes: calling _get_req_features() for JobId=964297 with not NULL job resources [2020-06-30T11:46:54.378] error: select_nodes: calling _get_req_features() for JobId=964298 with not NULL job resources [2020-06-30T11:46:54.379] error: select_nodes: calling _get_req_features() for JobId=964299 with not NULL job resources [2020-06-30T11:46:54.379] error: select_nodes: calling _get_req_features() for JobId=964300 with not NULL job resources [2020-06-30T11:46:54.379] error: select_nodes: calling _get_req_features() for JobId=964301 with not NULL job resources [2020-06-30T11:46:54.380] error: select_nodes: calling _get_req_features() for JobId=964302 with not NULL job resources [2020-06-30T11:46:54.380] error: select_nodes: calling _get_req_features() for JobId=964303 with not NULL job resources I have a limit about cores/nodes per user and this error are about it. Angelines Alberto Morillas Unidad de Arquitectura Informática Despacho: 22.1.32 Telf.: +34 91 346 6119 Fax: +34 91 346 6537 skype: angelines.alberto CIEMAT Avenida Complutense, 40 28040 MADRID El 30/6/20 10:54, "slurm-users en nombre de slurm-users-requ...@lists.schedmd.com" escribió: Send slurm-users mailing list submissions to slurm-users@lists.schedmd.com To subscribe or unsubscribe via the World Wide Web, visit https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users or, via email, send a message with subject or body 'help' to slurm-users-requ...@lists.schedmd.com You can reach the person managing the list at slurm-users-ow...@lists.schedmd.com When replying, please edit your Subject line so it is more specific than "Re: Contents of slurm-users digest..." Today's Topics: 1. Re: fail job (Gesti? Servidors) -- Message: 1 Date: Tue, 30 Jun 2020 08:55:01 + From: Gesti? Servidors To: "slurm-users@lists.schedmd.com" Subject: Re: [slurm-users] fail job Message-ID: Content-Type: text/plain; charset="iso-8859-1" Can you post, also, slurmdctl.conf log file from server (controller)? -- next part -- An HTML attachment was scrubbed... URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200630/44ff6839/attachment.htm> End of slurm-users Digest, Vol 32, Issue 71 ***
Re: [slurm-users] fail job
Can you post, also, slurmdctl.conf log file from server (controller)?
Re: [slurm-users] fail job
Hi, Can you post the output of the following commands on your master node?: sacctmgr show cluster scontrol show nodes Best, Durai Arasan Zentrum für Datenverarbeitung Tübingen On Tue, Jun 30, 2020 at 10:33 AM Alberto Morillas, Angelines < angelines.albe...@ciemat.es> wrote: > Hi, > > > > We have slurm version 18.08.6 > > One of my nodes is in drain state Reason=Kill task failed > [root@2020-06-27T02:25:29] > > > > In the node I can see in the slurmd.log > > > > 2020-06-27T01:24:26.242] task_p_slurmd_batch_request: 963771 > > [2020-06-27T01:24:26.242] task/affinity: job 963771 CPU input mask for > node: 0x0F > > [2020-06-27T01:24:26.242] task/affinity: job 963771 CPU final HW mask for > node: 0x55 > > [2020-06-27T01:24:26.247] _run_prolog: run job script took usec=4537 > > [2020-06-27T01:24:26.247] _run_prolog: prolog with lock for job 963771 ran > for 0 seconds > > [2020-06-27T01:24:26.247] Launching batch job 963771 for UID 5200 > > [2020-06-27T01:24:26.276] [963771.batch] task/cgroup: > /slurm/uid_5200/job_963771: alloc=147456MB mem.limit=147456MB > memsw.limit=147456MB > > [2020-06-27T01:24:26.284] [963771.batch] task/cgroup: > /slurm/uid_5200/job_963771/step_batch: alloc=147456MB mem.limit=147456MB > memsw.limit=147456MB > > [2020-06-27T01:24:26.310] [963771.batch] task_p_pre_launch: Using > sched_affinity for tasks > > [2020-06-27T02:24:26.933] [963771.batch] error: *** JOB 963771 ON > node0802 CANCELLED AT 2020-06-27T02:24:26 DUE TO TIME LIMIT *** > > [2020-06-27T02:25:27.009] [963771.batch] error: *** JOB 963771 STEPD > TERMINATED ON node0802 AT 2020-06-27T02:25:27 DUE TO JOB NOT ENDING WITH > SIGNALS *** > > [2020-06-27T02:25:27.009] [963771.batch] sending > REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status 15 > > [2020-06-27T02:25:27.011] [963771.batch] done with job > > > > If I try to get information about this job nothing get > > > > sacct -j 963771 > >JobIDJobName PartitionAccount AllocCPUS State > ExitCode > > -- -- -- > -- -- > > > > Why I don`t get information about this job??? > > > > Thanks in advance > > Angelines > > > > > > Angelines Alberto Morillas > > > > Unidad de Arquitectura Informática > > Despacho: 22.1.32 > > Telf.: +34 91 346 6119 > > Fax: +34 91 346 6537 > > > > skype: angelines.alberto > > > > CIEMAT > > Avenida Complutense, 40 > > 28040 MADRID > > > > > > >
[slurm-users] fail job
Hi, We have slurm version 18.08.6 One of my nodes is in drain state Reason=Kill task failed [root@2020-06-27T02:25:29] In the node I can see in the slurmd.log 2020-06-27T01:24:26.242] task_p_slurmd_batch_request: 963771 [2020-06-27T01:24:26.242] task/affinity: job 963771 CPU input mask for node: 0x0F [2020-06-27T01:24:26.242] task/affinity: job 963771 CPU final HW mask for node: 0x55 [2020-06-27T01:24:26.247] _run_prolog: run job script took usec=4537 [2020-06-27T01:24:26.247] _run_prolog: prolog with lock for job 963771 ran for 0 seconds [2020-06-27T01:24:26.247] Launching batch job 963771 for UID 5200 [2020-06-27T01:24:26.276] [963771.batch] task/cgroup: /slurm/uid_5200/job_963771: alloc=147456MB mem.limit=147456MB memsw.limit=147456MB [2020-06-27T01:24:26.284] [963771.batch] task/cgroup: /slurm/uid_5200/job_963771/step_batch: alloc=147456MB mem.limit=147456MB memsw.limit=147456MB [2020-06-27T01:24:26.310] [963771.batch] task_p_pre_launch: Using sched_affinity for tasks [2020-06-27T02:24:26.933] [963771.batch] error: *** JOB 963771 ON node0802 CANCELLED AT 2020-06-27T02:24:26 DUE TO TIME LIMIT *** [2020-06-27T02:25:27.009] [963771.batch] error: *** JOB 963771 STEPD TERMINATED ON node0802 AT 2020-06-27T02:25:27 DUE TO JOB NOT ENDING WITH SIGNALS *** [2020-06-27T02:25:27.009] [963771.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status 15 [2020-06-27T02:25:27.011] [963771.batch] done with job If I try to get information about this job nothing get sacct -j 963771 JobIDJobName PartitionAccount AllocCPUS State ExitCode -- -- -- -- -- Why I don`t get information about this job??? Thanks in advance Angelines Angelines Alberto Morillas Unidad de Arquitectura Informática Despacho: 22.1.32 Telf.: +34 91 346 6119 Fax: +34 91 346 6537 skype: angelines.alberto CIEMAT Avenida Complutense, 40 28040 MADRID
Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs
Hi Team, I have differentiated the CPU node and GPU nodes into two different queues. Now I have 20 Nodes having CPUS (20 cores)only but no GPU. Another set of nodes having GPU+CPU.some nodes are with 2 GPU and 20 CPU and some are with 8GPU and 48 CPU assigned to GPU queue user facing issues when in GPU queue. the scenario is as below: user submitting jobs with 4CPU+1GPU and also submitting jobs with 4CPU only. So the situation arises when all the GPU is full and the job submitted with GPU resources is waiting in queue but there is a large amount of CPU available but the job which is only required CPU jobs are not going through because the 4CPU+1GPU job has higher priority over CPU. is there any mechanism that once all GPU is full in use it will allow the CPU based job. Regards Navin. On Mon, Jun 22, 2020 at 6:09 PM Diego Zuccato wrote: > Il 16/06/20 16:23, Loris Bennett ha scritto: > > > Thanks for pointing this out - I hadn't been aware of this. Is there > > anywhere in the documentation where this is explicitly stated? > I don't remember. Seems Michael's experience is different. Possibly some > other setting influences that behaviour. Maybe different partition > priorities? > But on the small cluster I'm managing it's this way. I'm not an expert > and I'd like to understand. > > -- > Diego Zuccato > DIFA - Dip. di Fisica e Astronomia > Servizi Informatici > Alma Mater Studiorum - Università di Bologna > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > tel.: +39 051 20 95786 > >