Re: [slurm-users] runtime priority

2020-06-30 Thread Frank Lenaerts
On Tue, Jun 30, 2020 at 10:52:00AM -0400, Lawrence Stewart wrote:
> How does one configure the runtime priority of a job?  That is, how do you 
> set the CPU scheduling “nice” value?
> 
> We’re using Slurm to share a large (16 core 768 GB) server among FPGA 
> compilation jobs.  Slurm handles core and memory reservations just fine, but 
> runs everything nice -19, which makes for hugh load averages and terrible 
> interactive performance.
> 
> Manually setting the compilation processes with “renice 19 ” works fine, 
> but is tedious.

I would first check if /etc/security/limits.conf contains "priority":
suppose it is set to -19 for root, slurmd typically runs as root,
child processes inherit the value...


> -Larry

-- 
Kind regards

Frank Lenaerts




Re: [slurm-users] runtime priority

2020-06-30 Thread Lawrence Stewart
As far as I can tell, sbatch —nice only affects scheduling priority, not CPU 
priority.

I’ve made a workaround by putting “nice -n 19 xxx” as the job to run in my 
sbatch scripts

> On 2020, Jun 30, at 11:07 AM, Renfro, Michael  wrote:
> 
> There’s a --nice flag to sbatch and srun, at least. Documentation indicates 
> it decreases priority by 100 by default.
> 
> And untested, but it may be possible to use a job_submit.lua [1] to adjust 
> nice values automatically. At least I can see a nice property in [2], which I 
> assume means it'd be accessible as job_desc.nice in the Lua script.
> 
> [1] https://github.com/SchedMD/slurm/blob/master/contribs/lua/job_submit.lua
> [2] https://github.com/SchedMD/slurm/blob/master/src/lua/slurm_lua.c
> 
>> On Jun 30, 2020, at 9:52 AM, Lawrence Stewart  wrote:
>> 
>> How does one configure the runtime priority of a job?  That is, how do you 
>> set the CPU scheduling “nice” value?
>> 
>> We’re using Slurm to share a large (16 core 768 GB) server among FPGA 
>> compilation jobs.  Slurm handles core and memory reservations just fine, but 
>> runs everything nice -19, which makes for hugh load averages and terrible 
>> interactive performance.
>> 
>> Manually setting the compilation processes with “renice 19 ” works 
>> fine, but is tedious.
>> 
>> -Larry
>> 
>> 
> 




Re: [slurm-users] runtime priority

2020-06-30 Thread Renfro, Michael
There’s a --nice flag to sbatch and srun, at least. Documentation indicates it 
decreases priority by 100 by default.

And untested, but it may be possible to use a job_submit.lua [1] to adjust nice 
values automatically. At least I can see a nice property in [2], which I assume 
means it'd be accessible as job_desc.nice in the Lua script.

[1] https://github.com/SchedMD/slurm/blob/master/contribs/lua/job_submit.lua
[2] https://github.com/SchedMD/slurm/blob/master/src/lua/slurm_lua.c

> On Jun 30, 2020, at 9:52 AM, Lawrence Stewart  wrote:
> 
> How does one configure the runtime priority of a job?  That is, how do you 
> set the CPU scheduling “nice” value?
> 
> We’re using Slurm to share a large (16 core 768 GB) server among FPGA 
> compilation jobs.  Slurm handles core and memory reservations just fine, but 
> runs everything nice -19, which makes for hugh load averages and terrible 
> interactive performance.
> 
> Manually setting the compilation processes with “renice 19 ” works fine, 
> but is tedious.
> 
> -Larry
> 
> 



[slurm-users] runtime priority

2020-06-30 Thread Lawrence Stewart
How does one configure the runtime priority of a job?  That is, how do you set 
the CPU scheduling “nice” value?

We’re using Slurm to share a large (16 core 768 GB) server among FPGA 
compilation jobs.  Slurm handles core and memory reservations just fine, but 
runs everything nice -19, which makes for hugh load averages and terrible 
interactive performance.

Manually setting the compilation processes with “renice 19 ” works fine, 
but is tedious.

-Larry




Re: [slurm-users] fail job

2020-06-30 Thread Alberto Morillas, Angelines
...
[2020-06-30T11:46:52.740] error: select_nodes: calling _get_req_features() for 
JobId=964556 with not NULL job resources
[2020-06-30T11:46:52.740] error: select_nodes: calling _get_req_features() for 
JobId=964574 with not NULL job resources
[2020-06-30T11:46:52.741] error: select_nodes: calling _get_req_features() for 
JobId=964557 with not NULL job resources
[2020-06-30T11:46:52.741] error: select_nodes: calling _get_req_features() for 
JobId=964558 with not NULL job resources
[2020-06-30T11:46:52.741] error: select_nodes: calling _get_req_features() for 
JobId=964559 with not NULL job resources
[2020-06-30T11:46:52.741] error: select_nodes: calling _get_req_features() for 
JobId=964560 with not NULL job resources
[2020-06-30T11:46:52.741] error: select_nodes: calling _get_req_features() for 
JobId=964573 with not NULL job resources
[2020-06-30T11:46:53.986] _job_complete: JobId=964580 WEXITSTATUS 0
[2020-06-30T11:46:53.986] _job_complete: JobId=964580 done
[2020-06-30T11:46:54.377] error: select_nodes: calling _get_req_features() for 
JobId=964294 with not NULL job resources
[2020-06-30T11:46:54.377] error: select_nodes: calling _get_req_features() for 
JobId=964295 with not NULL job resources
[2020-06-30T11:46:54.378] error: select_nodes: calling _get_req_features() for 
JobId=964296 with not NULL job resources
[2020-06-30T11:46:54.378] error: select_nodes: calling _get_req_features() for 
JobId=964297 with not NULL job resources
[2020-06-30T11:46:54.378] error: select_nodes: calling _get_req_features() for 
JobId=964298 with not NULL job resources
[2020-06-30T11:46:54.379] error: select_nodes: calling _get_req_features() for 
JobId=964299 with not NULL job resources
[2020-06-30T11:46:54.379] error: select_nodes: calling _get_req_features() for 
JobId=964300 with not NULL job resources
[2020-06-30T11:46:54.379] error: select_nodes: calling _get_req_features() for 
JobId=964301 with not NULL job resources
[2020-06-30T11:46:54.380] error: select_nodes: calling _get_req_features() for 
JobId=964302 with not NULL job resources
[2020-06-30T11:46:54.380] error: select_nodes: calling _get_req_features() for 
JobId=964303 with not NULL job resources


I have a limit about cores/nodes per user and this error are about it.

 
Angelines Alberto Morillas
 
Unidad de Arquitectura Informática
Despacho: 22.1.32
Telf.: +34 91 346 6119
Fax:   +34 91 346 6537
 
skype: angelines.alberto
 
CIEMAT
Avenida Complutense, 40
28040 MADRID
 
 
 

El 30/6/20 10:54, "slurm-users en nombre de 
slurm-users-requ...@lists.schedmd.com"  escribió:

Send slurm-users mailing list submissions to
slurm-users@lists.schedmd.com

To subscribe or unsubscribe via the World Wide Web, visit
https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users
or, via email, send a message with subject or body 'help' to
slurm-users-requ...@lists.schedmd.com

You can reach the person managing the list at
slurm-users-ow...@lists.schedmd.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of slurm-users digest..."


Today's Topics:

   1. Re: fail job (Gesti? Servidors)


--

Message: 1
Date: Tue, 30 Jun 2020 08:55:01 +
From: Gesti? Servidors 
To: "slurm-users@lists.schedmd.com" 
Subject: Re: [slurm-users] fail job
Message-ID:



Content-Type: text/plain; charset="iso-8859-1"

Can you post, also, slurmdctl.conf log file from server (controller)?


-- next part --
An HTML attachment was scrubbed...
URL: 
<http://lists.schedmd.com/pipermail/slurm-users/attachments/20200630/44ff6839/attachment.htm>

End of slurm-users Digest, Vol 32, Issue 71
***



Re: [slurm-users] fail job

2020-06-30 Thread Gestió Servidors
Can you post, also, slurmdctl.conf log file from server (controller)?




Re: [slurm-users] fail job

2020-06-30 Thread Durai Arasan
Hi,

Can you post the output of the following commands on your master node?:

sacctmgr show cluster

scontrol show nodes

Best,
Durai Arasan
Zentrum für Datenverarbeitung
Tübingen


On Tue, Jun 30, 2020 at 10:33 AM Alberto Morillas, Angelines <
angelines.albe...@ciemat.es> wrote:

> Hi,
>
>
>
> We have slurm version 18.08.6
>
> One of my nodes is in drain state Reason=Kill task failed
> [root@2020-06-27T02:25:29]
>
>
>
> In the node I can see in the slurmd.log
>
>
>
> 2020-06-27T01:24:26.242] task_p_slurmd_batch_request: 963771
>
> [2020-06-27T01:24:26.242] task/affinity: job 963771 CPU input mask for
> node: 0x0F
>
> [2020-06-27T01:24:26.242] task/affinity: job 963771 CPU final HW mask for
> node: 0x55
>
> [2020-06-27T01:24:26.247] _run_prolog: run job script took usec=4537
>
> [2020-06-27T01:24:26.247] _run_prolog: prolog with lock for job 963771 ran
> for 0 seconds
>
> [2020-06-27T01:24:26.247] Launching batch job 963771 for UID 5200
>
> [2020-06-27T01:24:26.276] [963771.batch] task/cgroup:
> /slurm/uid_5200/job_963771: alloc=147456MB mem.limit=147456MB
> memsw.limit=147456MB
>
> [2020-06-27T01:24:26.284] [963771.batch] task/cgroup:
> /slurm/uid_5200/job_963771/step_batch: alloc=147456MB mem.limit=147456MB
> memsw.limit=147456MB
>
> [2020-06-27T01:24:26.310] [963771.batch] task_p_pre_launch: Using
> sched_affinity for tasks
>
> [2020-06-27T02:24:26.933] [963771.batch] error: *** JOB 963771 ON
> node0802 CANCELLED AT 2020-06-27T02:24:26 DUE TO TIME LIMIT ***
>
> [2020-06-27T02:25:27.009] [963771.batch] error: *** JOB 963771 STEPD
> TERMINATED ON node0802 AT 2020-06-27T02:25:27 DUE TO JOB NOT ENDING WITH
> SIGNALS ***
>
> [2020-06-27T02:25:27.009] [963771.batch] sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status 15
>
> [2020-06-27T02:25:27.011] [963771.batch] done with job
>
>
>
> If I try to get information about this job nothing get
>
>
>
> sacct -j 963771
>
>JobIDJobName  PartitionAccount  AllocCPUS  State
>   ExitCode
>
>   -- --  --
> --  --   
>
>
>
> Why I don`t get information about this job???
>
>
>
> Thanks in advance
>
> Angelines
>
> 
>
>
>
> Angelines Alberto Morillas
>
>
>
> Unidad de Arquitectura Informática
>
> Despacho: 22.1.32
>
> Telf.: +34 91 346 6119
>
> Fax:   +34 91 346 6537
>
>
>
> skype: angelines.alberto
>
>
>
> CIEMAT
>
> Avenida Complutense, 40
>
> 28040 MADRID
>
> 
>
>
>
>
>


[slurm-users] fail job

2020-06-30 Thread Alberto Morillas, Angelines
Hi,

We have slurm version 18.08.6
One of my nodes is in drain state Reason=Kill task failed 
[root@2020-06-27T02:25:29]

In the node I can see in the slurmd.log

2020-06-27T01:24:26.242] task_p_slurmd_batch_request: 963771
[2020-06-27T01:24:26.242] task/affinity: job 963771 CPU input mask for node: 
0x0F
[2020-06-27T01:24:26.242] task/affinity: job 963771 CPU final HW mask for node: 
0x55
[2020-06-27T01:24:26.247] _run_prolog: run job script took usec=4537
[2020-06-27T01:24:26.247] _run_prolog: prolog with lock for job 963771 ran for 
0 seconds
[2020-06-27T01:24:26.247] Launching batch job 963771 for UID 5200
[2020-06-27T01:24:26.276] [963771.batch] task/cgroup: 
/slurm/uid_5200/job_963771: alloc=147456MB mem.limit=147456MB 
memsw.limit=147456MB
[2020-06-27T01:24:26.284] [963771.batch] task/cgroup: 
/slurm/uid_5200/job_963771/step_batch: alloc=147456MB mem.limit=147456MB 
memsw.limit=147456MB
[2020-06-27T01:24:26.310] [963771.batch] task_p_pre_launch: Using 
sched_affinity for tasks
[2020-06-27T02:24:26.933] [963771.batch] error: *** JOB 963771 ON node0802 
CANCELLED AT 2020-06-27T02:24:26 DUE TO TIME LIMIT ***
[2020-06-27T02:25:27.009] [963771.batch] error: *** JOB 963771 STEPD TERMINATED 
ON node0802 AT 2020-06-27T02:25:27 DUE TO JOB NOT ENDING WITH SIGNALS ***
[2020-06-27T02:25:27.009] [963771.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, 
error:4001 status 15
[2020-06-27T02:25:27.011] [963771.batch] done with job

If I try to get information about this job nothing get

sacct -j 963771
   JobIDJobName  PartitionAccount  AllocCPUS  State   ExitCode
  -- --  -- --  
--   

Why I don`t get information about this job???

Thanks in advance
Angelines


Angelines Alberto Morillas

Unidad de Arquitectura Informática
Despacho: 22.1.32
Telf.: +34 91 346 6119
Fax:   +34 91 346 6537

skype: angelines.alberto

CIEMAT
Avenida Complutense, 40
28040 MADRID





Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs

2020-06-30 Thread navin srivastava
Hi Team,

I have differentiated the CPU node and GPU nodes into two different queues.

Now I have 20 Nodes having CPUS (20 cores)only but no GPU.
Another set of nodes having GPU+CPU.some nodes are with 2 GPU and 20 CPU
and some are with 8GPU and 48 CPU assigned to GPU queue

user facing issues when in GPU queue. the scenario is as below:

user submitting jobs with 4CPU+1GPU and also submitting jobs with 4CPU
only. So the situation arises when all the GPU is full and the job
submitted with GPU resources is waiting in queue but there is a large
amount of CPU available but the job which is only required CPU jobs are not
going through because the 4CPU+1GPU job has higher priority over CPU.

is there any mechanism that once all GPU is full in use it will allow the
CPU based job.

Regards
Navin.






On Mon, Jun 22, 2020 at 6:09 PM Diego Zuccato 
wrote:

> Il 16/06/20 16:23, Loris Bennett ha scritto:
>
> > Thanks for pointing this out - I hadn't been aware of this.  Is there
> > anywhere in the documentation where this is explicitly stated?
> I don't remember. Seems Michael's experience is different. Possibly some
> other setting influences that behaviour. Maybe different partition
> priorities?
> But on the small cluster I'm managing it's this way. I'm not an expert
> and I'd like to understand.
>
> --
> Diego Zuccato
> DIFA - Dip. di Fisica e Astronomia
> Servizi Informatici
> Alma Mater Studiorum - Università di Bologna
> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> tel.: +39 051 20 95786
>
>