Re: [slurm-users] fail job
... [2020-06-30T11:46:52.740] error: select_nodes: calling _get_req_features() for JobId=964556 with not NULL job resources [2020-06-30T11:46:52.740] error: select_nodes: calling _get_req_features() for JobId=964574 with not NULL job resources [2020-06-30T11:46:52.741] error: select_nodes: calling _get_req_features() for JobId=964557 with not NULL job resources [2020-06-30T11:46:52.741] error: select_nodes: calling _get_req_features() for JobId=964558 with not NULL job resources [2020-06-30T11:46:52.741] error: select_nodes: calling _get_req_features() for JobId=964559 with not NULL job resources [2020-06-30T11:46:52.741] error: select_nodes: calling _get_req_features() for JobId=964560 with not NULL job resources [2020-06-30T11:46:52.741] error: select_nodes: calling _get_req_features() for JobId=964573 with not NULL job resources [2020-06-30T11:46:53.986] _job_complete: JobId=964580 WEXITSTATUS 0 [2020-06-30T11:46:53.986] _job_complete: JobId=964580 done [2020-06-30T11:46:54.377] error: select_nodes: calling _get_req_features() for JobId=964294 with not NULL job resources [2020-06-30T11:46:54.377] error: select_nodes: calling _get_req_features() for JobId=964295 with not NULL job resources [2020-06-30T11:46:54.378] error: select_nodes: calling _get_req_features() for JobId=964296 with not NULL job resources [2020-06-30T11:46:54.378] error: select_nodes: calling _get_req_features() for JobId=964297 with not NULL job resources [2020-06-30T11:46:54.378] error: select_nodes: calling _get_req_features() for JobId=964298 with not NULL job resources [2020-06-30T11:46:54.379] error: select_nodes: calling _get_req_features() for JobId=964299 with not NULL job resources [2020-06-30T11:46:54.379] error: select_nodes: calling _get_req_features() for JobId=964300 with not NULL job resources [2020-06-30T11:46:54.379] error: select_nodes: calling _get_req_features() for JobId=964301 with not NULL job resources [2020-06-30T11:46:54.380] error: select_nodes: calling _get_req_features() for JobId=964302 with not NULL job resources [2020-06-30T11:46:54.380] error: select_nodes: calling _get_req_features() for JobId=964303 with not NULL job resources I have a limit about cores/nodes per user and this error are about it. Angelines Alberto Morillas Unidad de Arquitectura Informática Despacho: 22.1.32 Telf.: +34 91 346 6119 Fax: +34 91 346 6537 skype: angelines.alberto CIEMAT Avenida Complutense, 40 28040 MADRID El 30/6/20 10:54, "slurm-users en nombre de slurm-users-requ...@lists.schedmd.com" escribió: Send slurm-users mailing list submissions to slurm-users@lists.schedmd.com To subscribe or unsubscribe via the World Wide Web, visit https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users or, via email, send a message with subject or body 'help' to slurm-users-requ...@lists.schedmd.com You can reach the person managing the list at slurm-users-ow...@lists.schedmd.com When replying, please edit your Subject line so it is more specific than "Re: Contents of slurm-users digest..." Today's Topics: 1. Re: fail job (Gesti? Servidors) -- Message: 1 Date: Tue, 30 Jun 2020 08:55:01 + From: Gesti? Servidors To: "slurm-users@lists.schedmd.com" Subject: Re: [slurm-users] fail job Message-ID: Content-Type: text/plain; charset="iso-8859-1" Can you post, also, slurmdctl.conf log file from server (controller)? -- next part -- An HTML attachment was scrubbed... URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200630/44ff6839/attachment.htm> End of slurm-users Digest, Vol 32, Issue 71 ***
Re: [slurm-users] fail job
Can you post, also, slurmdctl.conf log file from server (controller)?
Re: [slurm-users] fail job
Hi, Can you post the output of the following commands on your master node?: sacctmgr show cluster scontrol show nodes Best, Durai Arasan Zentrum für Datenverarbeitung Tübingen On Tue, Jun 30, 2020 at 10:33 AM Alberto Morillas, Angelines < angelines.albe...@ciemat.es> wrote: > Hi, > > > > We have slurm version 18.08.6 > > One of my nodes is in drain state Reason=Kill task failed > [root@2020-06-27T02:25:29] > > > > In the node I can see in the slurmd.log > > > > 2020-06-27T01:24:26.242] task_p_slurmd_batch_request: 963771 > > [2020-06-27T01:24:26.242] task/affinity: job 963771 CPU input mask for > node: 0x0F > > [2020-06-27T01:24:26.242] task/affinity: job 963771 CPU final HW mask for > node: 0x55 > > [2020-06-27T01:24:26.247] _run_prolog: run job script took usec=4537 > > [2020-06-27T01:24:26.247] _run_prolog: prolog with lock for job 963771 ran > for 0 seconds > > [2020-06-27T01:24:26.247] Launching batch job 963771 for UID 5200 > > [2020-06-27T01:24:26.276] [963771.batch] task/cgroup: > /slurm/uid_5200/job_963771: alloc=147456MB mem.limit=147456MB > memsw.limit=147456MB > > [2020-06-27T01:24:26.284] [963771.batch] task/cgroup: > /slurm/uid_5200/job_963771/step_batch: alloc=147456MB mem.limit=147456MB > memsw.limit=147456MB > > [2020-06-27T01:24:26.310] [963771.batch] task_p_pre_launch: Using > sched_affinity for tasks > > [2020-06-27T02:24:26.933] [963771.batch] error: *** JOB 963771 ON > node0802 CANCELLED AT 2020-06-27T02:24:26 DUE TO TIME LIMIT *** > > [2020-06-27T02:25:27.009] [963771.batch] error: *** JOB 963771 STEPD > TERMINATED ON node0802 AT 2020-06-27T02:25:27 DUE TO JOB NOT ENDING WITH > SIGNALS *** > > [2020-06-27T02:25:27.009] [963771.batch] sending > REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status 15 > > [2020-06-27T02:25:27.011] [963771.batch] done with job > > > > If I try to get information about this job nothing get > > > > sacct -j 963771 > >JobIDJobName PartitionAccount AllocCPUS State > ExitCode > > -- -- -- > -- -- > > > > Why I don`t get information about this job??? > > > > Thanks in advance > > Angelines > > > > > > Angelines Alberto Morillas > > > > Unidad de Arquitectura Informática > > Despacho: 22.1.32 > > Telf.: +34 91 346 6119 > > Fax: +34 91 346 6537 > > > > skype: angelines.alberto > > > > CIEMAT > > Avenida Complutense, 40 > > 28040 MADRID > > > > > > >
[slurm-users] fail job
Hi, We have slurm version 18.08.6 One of my nodes is in drain state Reason=Kill task failed [root@2020-06-27T02:25:29] In the node I can see in the slurmd.log 2020-06-27T01:24:26.242] task_p_slurmd_batch_request: 963771 [2020-06-27T01:24:26.242] task/affinity: job 963771 CPU input mask for node: 0x0F [2020-06-27T01:24:26.242] task/affinity: job 963771 CPU final HW mask for node: 0x55 [2020-06-27T01:24:26.247] _run_prolog: run job script took usec=4537 [2020-06-27T01:24:26.247] _run_prolog: prolog with lock for job 963771 ran for 0 seconds [2020-06-27T01:24:26.247] Launching batch job 963771 for UID 5200 [2020-06-27T01:24:26.276] [963771.batch] task/cgroup: /slurm/uid_5200/job_963771: alloc=147456MB mem.limit=147456MB memsw.limit=147456MB [2020-06-27T01:24:26.284] [963771.batch] task/cgroup: /slurm/uid_5200/job_963771/step_batch: alloc=147456MB mem.limit=147456MB memsw.limit=147456MB [2020-06-27T01:24:26.310] [963771.batch] task_p_pre_launch: Using sched_affinity for tasks [2020-06-27T02:24:26.933] [963771.batch] error: *** JOB 963771 ON node0802 CANCELLED AT 2020-06-27T02:24:26 DUE TO TIME LIMIT *** [2020-06-27T02:25:27.009] [963771.batch] error: *** JOB 963771 STEPD TERMINATED ON node0802 AT 2020-06-27T02:25:27 DUE TO JOB NOT ENDING WITH SIGNALS *** [2020-06-27T02:25:27.009] [963771.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status 15 [2020-06-27T02:25:27.011] [963771.batch] done with job If I try to get information about this job nothing get sacct -j 963771 JobIDJobName PartitionAccount AllocCPUS State ExitCode -- -- -- -- -- Why I don`t get information about this job??? Thanks in advance Angelines Angelines Alberto Morillas Unidad de Arquitectura Informática Despacho: 22.1.32 Telf.: +34 91 346 6119 Fax: +34 91 346 6537 skype: angelines.alberto CIEMAT Avenida Complutense, 40 28040 MADRID