Re: [slurm-users] fail job

2020-06-30 Thread Alberto Morillas, Angelines
...
[2020-06-30T11:46:52.740] error: select_nodes: calling _get_req_features() for 
JobId=964556 with not NULL job resources
[2020-06-30T11:46:52.740] error: select_nodes: calling _get_req_features() for 
JobId=964574 with not NULL job resources
[2020-06-30T11:46:52.741] error: select_nodes: calling _get_req_features() for 
JobId=964557 with not NULL job resources
[2020-06-30T11:46:52.741] error: select_nodes: calling _get_req_features() for 
JobId=964558 with not NULL job resources
[2020-06-30T11:46:52.741] error: select_nodes: calling _get_req_features() for 
JobId=964559 with not NULL job resources
[2020-06-30T11:46:52.741] error: select_nodes: calling _get_req_features() for 
JobId=964560 with not NULL job resources
[2020-06-30T11:46:52.741] error: select_nodes: calling _get_req_features() for 
JobId=964573 with not NULL job resources
[2020-06-30T11:46:53.986] _job_complete: JobId=964580 WEXITSTATUS 0
[2020-06-30T11:46:53.986] _job_complete: JobId=964580 done
[2020-06-30T11:46:54.377] error: select_nodes: calling _get_req_features() for 
JobId=964294 with not NULL job resources
[2020-06-30T11:46:54.377] error: select_nodes: calling _get_req_features() for 
JobId=964295 with not NULL job resources
[2020-06-30T11:46:54.378] error: select_nodes: calling _get_req_features() for 
JobId=964296 with not NULL job resources
[2020-06-30T11:46:54.378] error: select_nodes: calling _get_req_features() for 
JobId=964297 with not NULL job resources
[2020-06-30T11:46:54.378] error: select_nodes: calling _get_req_features() for 
JobId=964298 with not NULL job resources
[2020-06-30T11:46:54.379] error: select_nodes: calling _get_req_features() for 
JobId=964299 with not NULL job resources
[2020-06-30T11:46:54.379] error: select_nodes: calling _get_req_features() for 
JobId=964300 with not NULL job resources
[2020-06-30T11:46:54.379] error: select_nodes: calling _get_req_features() for 
JobId=964301 with not NULL job resources
[2020-06-30T11:46:54.380] error: select_nodes: calling _get_req_features() for 
JobId=964302 with not NULL job resources
[2020-06-30T11:46:54.380] error: select_nodes: calling _get_req_features() for 
JobId=964303 with not NULL job resources


I have a limit about cores/nodes per user and this error are about it.

 
Angelines Alberto Morillas
 
Unidad de Arquitectura Informática
Despacho: 22.1.32
Telf.: +34 91 346 6119
Fax:   +34 91 346 6537
 
skype: angelines.alberto
 
CIEMAT
Avenida Complutense, 40
28040 MADRID
 
 
 

El 30/6/20 10:54, "slurm-users en nombre de 
slurm-users-requ...@lists.schedmd.com"  escribió:

Send slurm-users mailing list submissions to
slurm-users@lists.schedmd.com

To subscribe or unsubscribe via the World Wide Web, visit
https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users
or, via email, send a message with subject or body 'help' to
slurm-users-requ...@lists.schedmd.com

You can reach the person managing the list at
slurm-users-ow...@lists.schedmd.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of slurm-users digest..."


Today's Topics:

   1. Re: fail job (Gesti? Servidors)


--

Message: 1
Date: Tue, 30 Jun 2020 08:55:01 +
From: Gesti? Servidors 
To: "slurm-users@lists.schedmd.com" 
    Subject: Re: [slurm-users] fail job
Message-ID:



Content-Type: text/plain; charset="iso-8859-1"

Can you post, also, slurmdctl.conf log file from server (controller)?


-- next part --
An HTML attachment was scrubbed...
URL: 
<http://lists.schedmd.com/pipermail/slurm-users/attachments/20200630/44ff6839/attachment.htm>

End of slurm-users Digest, Vol 32, Issue 71
***



Re: [slurm-users] fail job

2020-06-30 Thread Gestió Servidors
Can you post, also, slurmdctl.conf log file from server (controller)?




Re: [slurm-users] fail job

2020-06-30 Thread Durai Arasan
Hi,

Can you post the output of the following commands on your master node?:

sacctmgr show cluster

scontrol show nodes

Best,
Durai Arasan
Zentrum für Datenverarbeitung
Tübingen


On Tue, Jun 30, 2020 at 10:33 AM Alberto Morillas, Angelines <
angelines.albe...@ciemat.es> wrote:

> Hi,
>
>
>
> We have slurm version 18.08.6
>
> One of my nodes is in drain state Reason=Kill task failed
> [root@2020-06-27T02:25:29]
>
>
>
> In the node I can see in the slurmd.log
>
>
>
> 2020-06-27T01:24:26.242] task_p_slurmd_batch_request: 963771
>
> [2020-06-27T01:24:26.242] task/affinity: job 963771 CPU input mask for
> node: 0x0F
>
> [2020-06-27T01:24:26.242] task/affinity: job 963771 CPU final HW mask for
> node: 0x55
>
> [2020-06-27T01:24:26.247] _run_prolog: run job script took usec=4537
>
> [2020-06-27T01:24:26.247] _run_prolog: prolog with lock for job 963771 ran
> for 0 seconds
>
> [2020-06-27T01:24:26.247] Launching batch job 963771 for UID 5200
>
> [2020-06-27T01:24:26.276] [963771.batch] task/cgroup:
> /slurm/uid_5200/job_963771: alloc=147456MB mem.limit=147456MB
> memsw.limit=147456MB
>
> [2020-06-27T01:24:26.284] [963771.batch] task/cgroup:
> /slurm/uid_5200/job_963771/step_batch: alloc=147456MB mem.limit=147456MB
> memsw.limit=147456MB
>
> [2020-06-27T01:24:26.310] [963771.batch] task_p_pre_launch: Using
> sched_affinity for tasks
>
> [2020-06-27T02:24:26.933] [963771.batch] error: *** JOB 963771 ON
> node0802 CANCELLED AT 2020-06-27T02:24:26 DUE TO TIME LIMIT ***
>
> [2020-06-27T02:25:27.009] [963771.batch] error: *** JOB 963771 STEPD
> TERMINATED ON node0802 AT 2020-06-27T02:25:27 DUE TO JOB NOT ENDING WITH
> SIGNALS ***
>
> [2020-06-27T02:25:27.009] [963771.batch] sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status 15
>
> [2020-06-27T02:25:27.011] [963771.batch] done with job
>
>
>
> If I try to get information about this job nothing get
>
>
>
> sacct -j 963771
>
>JobIDJobName  PartitionAccount  AllocCPUS  State
>   ExitCode
>
>   -- --  --
> --  --   
>
>
>
> Why I don`t get information about this job???
>
>
>
> Thanks in advance
>
> Angelines
>
> 
>
>
>
> Angelines Alberto Morillas
>
>
>
> Unidad de Arquitectura Informática
>
> Despacho: 22.1.32
>
> Telf.: +34 91 346 6119
>
> Fax:   +34 91 346 6537
>
>
>
> skype: angelines.alberto
>
>
>
> CIEMAT
>
> Avenida Complutense, 40
>
> 28040 MADRID
>
> 
>
>
>
>
>


[slurm-users] fail job

2020-06-30 Thread Alberto Morillas, Angelines
Hi,

We have slurm version 18.08.6
One of my nodes is in drain state Reason=Kill task failed 
[root@2020-06-27T02:25:29]

In the node I can see in the slurmd.log

2020-06-27T01:24:26.242] task_p_slurmd_batch_request: 963771
[2020-06-27T01:24:26.242] task/affinity: job 963771 CPU input mask for node: 
0x0F
[2020-06-27T01:24:26.242] task/affinity: job 963771 CPU final HW mask for node: 
0x55
[2020-06-27T01:24:26.247] _run_prolog: run job script took usec=4537
[2020-06-27T01:24:26.247] _run_prolog: prolog with lock for job 963771 ran for 
0 seconds
[2020-06-27T01:24:26.247] Launching batch job 963771 for UID 5200
[2020-06-27T01:24:26.276] [963771.batch] task/cgroup: 
/slurm/uid_5200/job_963771: alloc=147456MB mem.limit=147456MB 
memsw.limit=147456MB
[2020-06-27T01:24:26.284] [963771.batch] task/cgroup: 
/slurm/uid_5200/job_963771/step_batch: alloc=147456MB mem.limit=147456MB 
memsw.limit=147456MB
[2020-06-27T01:24:26.310] [963771.batch] task_p_pre_launch: Using 
sched_affinity for tasks
[2020-06-27T02:24:26.933] [963771.batch] error: *** JOB 963771 ON node0802 
CANCELLED AT 2020-06-27T02:24:26 DUE TO TIME LIMIT ***
[2020-06-27T02:25:27.009] [963771.batch] error: *** JOB 963771 STEPD TERMINATED 
ON node0802 AT 2020-06-27T02:25:27 DUE TO JOB NOT ENDING WITH SIGNALS ***
[2020-06-27T02:25:27.009] [963771.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, 
error:4001 status 15
[2020-06-27T02:25:27.011] [963771.batch] done with job

If I try to get information about this job nothing get

sacct -j 963771
   JobIDJobName  PartitionAccount  AllocCPUS  State   ExitCode
  -- --  -- --  
--   

Why I don`t get information about this job???

Thanks in advance
Angelines


Angelines Alberto Morillas

Unidad de Arquitectura Informática
Despacho: 22.1.32
Telf.: +34 91 346 6119
Fax:   +34 91 346 6537

skype: angelines.alberto

CIEMAT
Avenida Complutense, 40
28040 MADRID