[slurm-users] Checkpoint

2021-11-29 Thread Alberto Morillas, Angelines
Hi!
I need your help

How  could I use chekpoint (dmtcp) with slurm?

Thanks in advance

Angelines


[slurm-users] xalloc

2021-03-09 Thread Alberto Morillas, Angelines
Hi,
I need your help.
I have users that need an interactive shell on a compute node with the 
possibility of running programs with a graphical user interface directly on the 
compute node.
Looking for information I have found the xalloc command but it must be a 
wrapper because It isn`t installed in slurm.
Can someone help me ?
Thanks in advance
Angelines




Re: [slurm-users] Get original script of a job

2021-03-07 Thread Alberto Morillas, Angelines

Thanks for your help!


 
Dra. Angelines Alberto Morillas
 
Unidad de Arquitectura Informática
Despacho: 22.1.32
Telf.: +34 91 346 6119
Fax:   +34 91 346 6537
 
skype: angelines.alberto
 
CIEMAT
Avenida Complutense, 40
28040 MADRID

 

El 5/3/21 14:14, "slurm-users en nombre de 
slurm-users-requ...@lists.schedmd.com"  escribió:

Send slurm-users mailing list submissions to
slurm-users@lists.schedmd.com

To subscribe or unsubscribe via the World Wide Web, visit
https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users
or, via email, send a message with subject or body 'help' to
slurm-users-requ...@lists.schedmd.com

You can reach the person managing the list at
slurm-users-ow...@lists.schedmd.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of slurm-users digest..."


Today's Topics:

   1. Re: slurm-users Digest, Vol 41, Issue 13
      (Alberto Morillas, Angelines)
   2. Re: Get original script of a job (Ward Poelmans)
   3. Re: Get original script of a job (Carl Ponder)
   4. Use nodes exclusive and shared simultaneously (Heckes, Frank)


--

Message: 1
Date: Fri, 5 Mar 2021 12:15:21 +0000
From: "Alberto Morillas, Angelines" 
To: "slurm-users@lists.schedmd.com" 
Subject: Re: [slurm-users] slurm-users Digest, Vol 41, Issue 13
Message-ID:



Content-Type: text/plain; charset="us-ascii"

Thanks!


De: slurm-users  en nombre de 
slurm-users-requ...@lists.schedmd.com 
Enviado: Friday, March 5, 2021 1:00:01 PM
Para: slurm-users@lists.schedmd.com 
Asunto: slurm-users Digest, Vol 41, Issue 13

Send slurm-users mailing list submissions to
slurm-users@lists.schedmd.com

To subscribe or unsubscribe via the World Wide Web, visit
https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users
or, via email, send a message with subject or body 'help' to
slurm-users-requ...@lists.schedmd.com

You can reach the person managing the list at
slurm-users-ow...@lists.schedmd.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of slurm-users digest..."


Today's Topics:

   1. Re: Get original script of a job (Ole Holm Nielsen)


--

Message: 1
Date: Fri, 5 Mar 2021 11:56:05 +0100
From: Ole Holm Nielsen 
To: 
Subject: Re: [slurm-users] Get original script of a job
Message-ID: <61c47956-5fc9-5be5-9aef-08f8e27bf...@fysik.dtu.dk>
Content-Type: text/plain; charset="utf-8"; format=flowed

On 05-03-2021 11:29, Alberto Morillas, Angelines wrote:
> I would like to know if it will be possible to get the script that was
> used to send a job.
>
> I know that when I send a job with scontroI can get the path and the
> name of the script used to send this job, but normally the users change
> theirs scripts and sometimes all was wrong after that, so is there any
> possibility to reproduce the script of an old job???

The Slurm database doesn't store information about old jobs' Command
(i.e., the job script) information, let alone store the script itself!
See "man sacct" and look at the list below --helpformat (Print a list of
fields that can be specified with the --format option).

This was also discussed in a previous mailing list thread:
https://lists.schedmd.com/pipermail/slurm-users/2020-April/005307.html

/Ole




End of slurm-users Digest, Vol 41, Issue 13
***
-- next part --
An HTML attachment was scrubbed...
URL: 
<http://lists.schedmd.com/pipermail/slurm-users/attachments/20210305/3b9b478c/attachment-0001.htm>

--

Message: 2
Date: Fri, 5 Mar 2021 13:28:03 +0100
From: Ward Poelmans 
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] Get original script of a job
Message-ID: 
Content-Type: text/plain; charset=utf-8

Hi,

On 5/03/2021 11:29, Alberto Morillas, Angelines wrote:

> I know that when I send a job with scontroI can get the path and the
> name of the script used to send this job, but normally the users change
> theirs scripts and sometimes all was wrong after that, so is there any
> possibility to reproduce the script of an old job???

It's not stored by default. You can have a look at
https://github.com/itkovian/sarchive for archiving jobs scripts.


Re: [slurm-users] slurm-users Digest, Vol 41, Issue 13

2021-03-05 Thread Alberto Morillas, Angelines
Thanks!


De: slurm-users  en nombre de 
slurm-users-requ...@lists.schedmd.com 
Enviado: Friday, March 5, 2021 1:00:01 PM
Para: slurm-users@lists.schedmd.com 
Asunto: slurm-users Digest, Vol 41, Issue 13

Send slurm-users mailing list submissions to
slurm-users@lists.schedmd.com

To subscribe or unsubscribe via the World Wide Web, visit
https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users
or, via email, send a message with subject or body 'help' to
slurm-users-requ...@lists.schedmd.com

You can reach the person managing the list at
slurm-users-ow...@lists.schedmd.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of slurm-users digest..."


Today's Topics:

   1. Re: Get original script of a job (Ole Holm Nielsen)


--

Message: 1
Date: Fri, 5 Mar 2021 11:56:05 +0100
From: Ole Holm Nielsen 
To: 
Subject: Re: [slurm-users] Get original script of a job
Message-ID: <61c47956-5fc9-5be5-9aef-08f8e27bf...@fysik.dtu.dk>
Content-Type: text/plain; charset="utf-8"; format=flowed

On 05-03-2021 11:29, Alberto Morillas, Angelines wrote:
> I would like to know if it will be possible to get the script that was
> used to send a job.
>
> I know that when I send a job with scontroI can get the path and the
> name of the script used to send this job, but normally the users change
> theirs scripts and sometimes all was wrong after that, so is there any
> possibility to reproduce the script of an old job???

The Slurm database doesn't store information about old jobs' Command
(i.e., the job script) information, let alone store the script itself!
See "man sacct" and look at the list below --helpformat (Print a list of
fields that can be specified with the --format option).

This was also discussed in a previous mailing list thread:
https://lists.schedmd.com/pipermail/slurm-users/2020-April/005307.html

/Ole




End of slurm-users Digest, Vol 41, Issue 13
***


[slurm-users] Get original script of a job

2021-03-05 Thread Alberto Morillas, Angelines
Hi,

I would like to know if it will be possible to get the script that was used to 
send a job.
I know that when I send a job with scontroI can get the path and the name of 
the script used to send this job, but normally the users change theirs scripts 
and sometimes all was wrong after that, so is there any possibility to 
reproduce the script of an old job???

Thanks in advance


Angelines



Re: [slurm-users] fail job

2020-06-30 Thread Alberto Morillas, Angelines
...
[2020-06-30T11:46:52.740] error: select_nodes: calling _get_req_features() for 
JobId=964556 with not NULL job resources
[2020-06-30T11:46:52.740] error: select_nodes: calling _get_req_features() for 
JobId=964574 with not NULL job resources
[2020-06-30T11:46:52.741] error: select_nodes: calling _get_req_features() for 
JobId=964557 with not NULL job resources
[2020-06-30T11:46:52.741] error: select_nodes: calling _get_req_features() for 
JobId=964558 with not NULL job resources
[2020-06-30T11:46:52.741] error: select_nodes: calling _get_req_features() for 
JobId=964559 with not NULL job resources
[2020-06-30T11:46:52.741] error: select_nodes: calling _get_req_features() for 
JobId=964560 with not NULL job resources
[2020-06-30T11:46:52.741] error: select_nodes: calling _get_req_features() for 
JobId=964573 with not NULL job resources
[2020-06-30T11:46:53.986] _job_complete: JobId=964580 WEXITSTATUS 0
[2020-06-30T11:46:53.986] _job_complete: JobId=964580 done
[2020-06-30T11:46:54.377] error: select_nodes: calling _get_req_features() for 
JobId=964294 with not NULL job resources
[2020-06-30T11:46:54.377] error: select_nodes: calling _get_req_features() for 
JobId=964295 with not NULL job resources
[2020-06-30T11:46:54.378] error: select_nodes: calling _get_req_features() for 
JobId=964296 with not NULL job resources
[2020-06-30T11:46:54.378] error: select_nodes: calling _get_req_features() for 
JobId=964297 with not NULL job resources
[2020-06-30T11:46:54.378] error: select_nodes: calling _get_req_features() for 
JobId=964298 with not NULL job resources
[2020-06-30T11:46:54.379] error: select_nodes: calling _get_req_features() for 
JobId=964299 with not NULL job resources
[2020-06-30T11:46:54.379] error: select_nodes: calling _get_req_features() for 
JobId=964300 with not NULL job resources
[2020-06-30T11:46:54.379] error: select_nodes: calling _get_req_features() for 
JobId=964301 with not NULL job resources
[2020-06-30T11:46:54.380] error: select_nodes: calling _get_req_features() for 
JobId=964302 with not NULL job resources
[2020-06-30T11:46:54.380] error: select_nodes: calling _get_req_features() for 
JobId=964303 with not NULL job resources


I have a limit about cores/nodes per user and this error are about it.

 
Angelines Alberto Morillas
 
Unidad de Arquitectura Informática
Despacho: 22.1.32
Telf.: +34 91 346 6119
Fax:   +34 91 346 6537
 
skype: angelines.alberto
 
CIEMAT
Avenida Complutense, 40
28040 MADRID
 
 
 

El 30/6/20 10:54, "slurm-users en nombre de 
slurm-users-requ...@lists.schedmd.com"  escribió:

Send slurm-users mailing list submissions to
slurm-users@lists.schedmd.com

To subscribe or unsubscribe via the World Wide Web, visit
https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users
or, via email, send a message with subject or body 'help' to
slurm-users-requ...@lists.schedmd.com

You can reach the person managing the list at
slurm-users-ow...@lists.schedmd.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of slurm-users digest..."


Today's Topics:

   1. Re: fail job (Gesti? Servidors)


--

Message: 1
Date: Tue, 30 Jun 2020 08:55:01 +
From: Gesti? Servidors 
To: "slurm-users@lists.schedmd.com" 
Subject: Re: [slurm-users] fail job
Message-ID:



Content-Type: text/plain; charset="iso-8859-1"

Can you post, also, slurmdctl.conf log file from server (controller)?


-- next part --
An HTML attachment was scrubbed...
URL: 


End of slurm-users Digest, Vol 32, Issue 71
***



[slurm-users] fail job

2020-06-30 Thread Alberto Morillas, Angelines
Hi,

We have slurm version 18.08.6
One of my nodes is in drain state Reason=Kill task failed 
[root@2020-06-27T02:25:29]

In the node I can see in the slurmd.log

2020-06-27T01:24:26.242] task_p_slurmd_batch_request: 963771
[2020-06-27T01:24:26.242] task/affinity: job 963771 CPU input mask for node: 
0x0F
[2020-06-27T01:24:26.242] task/affinity: job 963771 CPU final HW mask for node: 
0x55
[2020-06-27T01:24:26.247] _run_prolog: run job script took usec=4537
[2020-06-27T01:24:26.247] _run_prolog: prolog with lock for job 963771 ran for 
0 seconds
[2020-06-27T01:24:26.247] Launching batch job 963771 for UID 5200
[2020-06-27T01:24:26.276] [963771.batch] task/cgroup: 
/slurm/uid_5200/job_963771: alloc=147456MB mem.limit=147456MB 
memsw.limit=147456MB
[2020-06-27T01:24:26.284] [963771.batch] task/cgroup: 
/slurm/uid_5200/job_963771/step_batch: alloc=147456MB mem.limit=147456MB 
memsw.limit=147456MB
[2020-06-27T01:24:26.310] [963771.batch] task_p_pre_launch: Using 
sched_affinity for tasks
[2020-06-27T02:24:26.933] [963771.batch] error: *** JOB 963771 ON node0802 
CANCELLED AT 2020-06-27T02:24:26 DUE TO TIME LIMIT ***
[2020-06-27T02:25:27.009] [963771.batch] error: *** JOB 963771 STEPD TERMINATED 
ON node0802 AT 2020-06-27T02:25:27 DUE TO JOB NOT ENDING WITH SIGNALS ***
[2020-06-27T02:25:27.009] [963771.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, 
error:4001 status 15
[2020-06-27T02:25:27.011] [963771.batch] done with job

If I try to get information about this job nothing get

sacct -j 963771
   JobIDJobName  PartitionAccount  AllocCPUS  State   ExitCode
  -- --  -- --  
--   

Why I don`t get information about this job???

Thanks in advance
Angelines


Angelines Alberto Morillas

Unidad de Arquitectura Informática
Despacho: 22.1.32
Telf.: +34 91 346 6119
Fax:   +34 91 346 6537

skype: angelines.alberto

CIEMAT
Avenida Complutense, 40
28040 MADRID





Re: [slurm-users] [EXTERNAL] problems with OpenMPI 4.0.3

2020-06-01 Thread Alberto Morillas, Angelines
Yes I tried it but whit the same result 
openmpi@4.0.3 -cuda +cxx_exceptions fabrics=ucx  -java -legacylaunchers 
-memchecker +pmi schedulers=slurm  -sqlite3 -thread_multiple +vt

You can compile wrf , when you sbatch your job it is running but it doesn´t do 
anything and we get the same, with  WCHAN=hrtime
0 S  4556  87383  87361  0  80   0 - 126676 hrtime ?   00:05:25 
real.exe

--

Message: 2
Date: Mon, 1 Jun 2020 16:56:05 +
From: "Pritchard Jr., Howard" 
To: Slurm User Community List 
Subject: Re: [slurm-users] [EXTERNAL]  problems with OpenMPI 4.0.3
Message-ID: <20dc51ae-9f58-4b1c-b619-1a2077d5c...@lanl.gov>
Content-Type: text/plain; charset="utf-8"

HI Angelines,

Could you try reinstalling with fabric=ucx and rerunning?  
UCX is the preferred way to use Infiniband in the Open MPI 4.0.x release 
stream.

Howard

?On 6/1/20, 10:29 AM, "slurm-users on behalf of Alberto Morillas, 
Angelines"  wrote:

Hello Howard

I installed it with spack: 
openmpi@4.0.3 -cuda +cxx_exceptions fabrics=verbs -java 
-legacylaunchers -memchecker  +pmi schedulers=slurm -sqlite3 -thread_multiple 
+vt
where - --> not enable
+ --> enable

Thanks in advance.


Angelines Alberto Morillas

Unidad de Arquitectura Inform?tica
Despacho: 22.1.32
Telf.: +34 91 346 6119
Fax:   +34 91 346 6537

skype: angelines.alberto

CIEMAT
Avenida Complutense, 40
28040 MADRID
 




--

Message: 2
Date: Mon, 1 Jun 2020 16:13:11 +
From: "Pritchard Jr., Howard" 
To: Slurm User Community List 
Subject: Re: [slurm-users] [EXTERNAL]  problems with OpenMPI 4.0.3
Message-ID: 
Content-Type: text/plain; charset="utf-8"

Hello Angelines,

Do you know how the Open MPI 4.0.3 package was configured and 
built?   That information would be useful to help diagnose the problem.

Thanks,

    Howard


    From: slurm-users  on behalf 
of "Alberto Morillas, Angelines" 
Reply-To: Slurm User Community List 
Date: Friday, May 29, 2020 at 4:25 AM
To: "slurm-users@lists.schedmd.com" 
Subject: [EXTERNAL] [slurm-users] problems with OpenMPI 4.0.3

Good morning,

We have a cluster with two kind of infiniband cards, one connectx-4 
and the other connectx-6.
Openmpi-3.1.3 works fine, but when we start with connectx-6 we 
started to use openmpi-4.0.3 (that support connectx-6) and the programs that 
have several parts, first a call to a secuencial program and inside it a call 
to a parallel program, ? (in our case the program is WRF, but we have others 
like this with the same problem),  this kind of programs suddenly stop,

?..
0 S  4556  87383  87361  0  80   0 - 126676 hrtime ?   00:05:25 
real.exe
0 S  4556  87384  87361  0  80   0 - 126677 hrtime ?   00:05:33 
real.exe
0 S  4556  87385  87361  0  80   0 - 126675 hrtime ?   00:05:28 
real.exe
??
The WCHAN=hrtime, and it looks that it is running, but really it 
doesn?t work

We don?t know if it could be  problem with slurm and this version 
of openmpi? Any idea?




Angelines Alberto Morillas

Unidad de Arquitectura Inform?tica
Despacho: 22.1.32
Telf.: +34 91 346 6119
Fax:   +34 91 346 6537

skype: angelines.alberto

CIEMAT
Avenida Complutense, 40
28040 MADRID



-- next part --
An HTML attachment was scrubbed...
URL: 
<http://lists.schedmd.com/pipermail/slurm-users/attachments/20200601/e0e1cbee/attachment-0001.htm>

--

Message: 3
Date: Mon, 1 Jun 2020 16:16:00 +
From: Songpon Srisawai 
To: Slurm User Community List 
Subject: Re: [slurm-users] Slurm Job Count Credit system
Message-ID: <9666f3be-d648-4ee9-9ad2-80df973f87cc@Spark>
Content-Type: text/plain; charset="utf-8"

Greatly appreciated for your help. I will try to implement 
following your suggestion.
On 1 Jun 2020 22:23 +0700, Renfro, Michael , 
wrote:
Even without the slurm-bank system, you can enforce a limit on 
resources w

Re: [slurm-users] [EXTERNAL] problems with OpenMPI 4.0.3

2020-06-01 Thread Alberto Morillas, Angelines
Hello Howard

I installed it with spack: 
openmpi@4.0.3 -cuda +cxx_exceptions fabrics=verbs -java -legacylaunchers 
-memchecker  +pmi schedulers=slurm -sqlite3 -thread_multiple +vt
where - --> not enable
+ --> enable

Thanks in advance.

 
Angelines Alberto Morillas
 
Unidad de Arquitectura Informática
Despacho: 22.1.32
Telf.: +34 91 346 6119
Fax:   +34 91 346 6537
 
skype: angelines.alberto
 
CIEMAT
Avenida Complutense, 40
28040 MADRID
 
 
 

El 1/6/20 18:16, "slurm-users en nombre de 
slurm-users-requ...@lists.schedmd.com"  escribió:

Send slurm-users mailing list submissions to
slurm-users@lists.schedmd.com

To subscribe or unsubscribe via the World Wide Web, visit
https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users
or, via email, send a message with subject or body 'help' to
slurm-users-requ...@lists.schedmd.com

You can reach the person managing the list at
slurm-users-ow...@lists.schedmd.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of slurm-users digest..."


Today's Topics:

   1. Re: Slurm Job Count Credit system (Renfro, Michael)
   2. Re: [EXTERNAL]  problems with OpenMPI 4.0.3
  (Pritchard Jr., Howard)
   3. Re: Slurm Job Count Credit system (Songpon Srisawai)


--

Message: 1
Date: Mon, 1 Jun 2020 15:15:29 +
From: "Renfro, Michael" 
To: Slurm User Community List 
Subject: Re: [slurm-users] Slurm Job Count Credit system
Message-ID: 
Content-Type: text/plain; charset="utf-8"

Even without the slurm-bank system, you can enforce a limit on resources 
with a QOS applied to those users. Something like:

=

sacctmgr add qos bank1 flags=NoDecay,DenyOnLimit
sacctmgr modify qos bank1 set grptresmins=cpu=1000

sacctmgr add account bank1
sacctmgr modify account name=bank1 set qos+=bank1

sacctmgr add user someuser account=bank1
sacctmgr modify user someuser set qos+=bank1

=

You can do lots with a QOS, including limiting the number of simultaneous 
running jobs, simultaneous running/queued jobs, etc. Unfortunately, the NoDecay 
flag is only documented to work on GrpTRESMins, GrpWall, and UsageRaw, not on 
the job count.

So if you can live with limiting the number of simultaneous jobs instead of 
a total number of jobs per time period, that?s possible with QOS. Otherwise, 
maybe someone else will have an idea.

-- 
Mike Renfro, PhD / HPC Systems Administrator, Information Technology 
Services
931 372-3601 / Tennessee Tech University

> On May 31, 2020, at 11:35 AM, Songpon Srisawai 
 wrote:
> 
> Hello all,
> 
> I?m Slurm beginner who try to implement our cluster. I would like to know 
whether there are any Slurm credit/token system plugin such as the number of 
job count.
> 
> I found Slurm-bank that deposit hour to an account. But, I would like to 
deposit the jobs token instead of hours.
> 
> Thanks for any recommendation
> Songpon 


--

Message: 2
Date: Mon, 1 Jun 2020 16:13:11 +
From: "Pritchard Jr., Howard" 
To: Slurm User Community List 
Subject: Re: [slurm-users] [EXTERNAL]  problems with OpenMPI 4.0.3
Message-ID: 
Content-Type: text/plain; charset="utf-8"

Hello Angelines,

Do you know how the Open MPI 4.0.3 package was configured and built?   That 
information would be useful to help diagnose the problem.

Thanks,

Howard


From: slurm-users  on behalf of 
"Alberto Morillas, Angelines" 
Reply-To: Slurm User Community List 
Date: Friday, May 29, 2020 at 4:25 AM
To: "slurm-users@lists.schedmd.com" 
Subject: [EXTERNAL] [slurm-users] problems with OpenMPI 4.0.3

Good morning,

We have a cluster with two kind of infiniband cards, one connectx-4 and the 
other connectx-6.
Openmpi-3.1.3 works fine, but when we start with connectx-6 we started to 
use openmpi-4.0.3 (that support connectx-6) and the programs that have several 
parts, first a call to a secuencial program and inside it a call to a parallel 
program, ? (in our case the program is WRF, but we have others like this with 
the same problem),  this kind of programs suddenly stop,

?..
0 S  4556  87383  87361  0  80   0 - 126676 hrtime ?   00:05:25 real.exe
0 S  4556  87384  87361  0  80   0 - 126677 hrtime ?   00:05:33 real.exe
0 S  4556  87385  87361  0  80   0 - 126675 hrtime ?   00:05:28 real.exe
??
The WCHAN=hrtime, and it looks that it is running, but really it doesn?t 
work

We don?t know if it could be  problem with

[slurm-users] problems with OpenMPI 4.0.3

2020-05-29 Thread Alberto Morillas, Angelines
Good morning,

We have a cluster with two kind of infiniband cards, one connectx-4 and the 
other connectx-6.
Openmpi-3.1.3 works fine, but when we start with connectx-6 we started to use 
openmpi-4.0.3 (that support connectx-6) and the programs that have several 
parts, first a call to a secuencial program and inside it a call to a parallel 
program, … (in our case the program is WRF, but we have others like this with 
the same problem),  this kind of programs suddenly stop,

…..
0 S  4556  87383  87361  0  80   0 - 126676 hrtime ?   00:05:25 real.exe
0 S  4556  87384  87361  0  80   0 - 126677 hrtime ?   00:05:33 real.exe
0 S  4556  87385  87361  0  80   0 - 126675 hrtime ?   00:05:28 real.exe
……
The WCHAN=hrtime, and it looks that it is running, but really it doesn´t work

We don´t know if it could be  problem with slurm and this version of openmpi… 
Any idea?




Angelines Alberto Morillas

Unidad de Arquitectura Informática
Despacho: 22.1.32
Telf.: +34 91 346 6119
Fax:   +34 91 346 6537

skype: angelines.alberto

CIEMAT
Avenida Complutense, 40
28040 MADRID





[slurm-users] problems with number of jobs with GrpTres

2020-05-19 Thread Alberto Morillas, Angelines
Hello,

I have a problem with GrpTres, I specify the limits with
sacctmgr --immediate modify user where user=  set GrpTres=cpu=144,node=4

but when the user send serial jobs, for example 5 jobs , the user only can 
execute 4,  and the rest of the jobs are PD with the reason=AssocGrpNodeLimit.
I could understand this if the jobs were in differents nodes, but all of them 
are running in the same node

 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)
887783 cluster  mut   PD   0:00  1 
(AssocGrpNodeLimit)
887784 cluster  mut   PD   0:00  1 
(AssocGrpNodeLimit)
887785 cluster  mut   PD   0:00  1 
(AssocGrpNodeLimit)
887780 cluster  mut    R0:02  1 
n1301
887781 cluster  mut    R0:02  1 
n1301
887782 cluster  mut    R0:02  1 
n1301
887779 cluster  mut    R0:05  1 
n1301

I want that the users could use until 4 nodes or/and 144 cores. With parallel 
jobs it works fine and if the user send a job with 144 serial jobs inside it 
then it  works too. The problem is when the user send serial jobs, then the 
limit of the node=4 works like job=4, and that isn´t my intention.

Any help, please?
Thanks in advance




Angelines Alberto Morillas

Unidad de Arquitectura Informática
Despacho: 22.1.32
Telf.: +34 91 346 6119
Fax:   +34 91 346 6537

skype: angelines.alberto

CIEMAT
Avenida Complutense, 40
28040 MADRID





[slurm-users] slurm problem with GrpTres

2020-05-04 Thread Alberto Morillas, Angelines
Hello,

I have a problem with GrpTres, I specify the limits with
sacctmgr --immediate modify user where user=  set GrpTres=cpu=144,node=4

but when the user send serial jobs, for example 5 jobs , the user only can 
execute 4,  and the rest of the jobs are PD with the reason=AssocGrpNodeLimit.
I could understand this if the jobs were in differents nodes, but all of them 
are running in the same node

 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)
887783 cluster  mut   PD   0:00  1 
(AssocGrpNodeLimit)
887784 cluster  mut   PD   0:00  1 
(AssocGrpNodeLimit)
887785 cluster  mut   PD   0:00  1 
(AssocGrpNodeLimit)
887780 cluster  mut    R0:02  1 
xula1301
887781 cluster  mut    R0:02  1 
xula1301
887782 cluster  mut    R0:02  1 
xula1301
887779 cluster  mut    R0:05  1 
xula1301

I want that the users could use until 4 nodes or/and 144 cores. With parallel 
jobs it works fine and if the user send a job with 144 serial jobs inside it 
then it  works too. The problem is when the user send serial jobs, then the 
limit of the node=4 works like job=4, and that isn´t my intention.

Any help, please?
Thanks in advance


Angelines Alberto Morillas

Unidad de Arquitectura Informática
Despacho: 22.1.32
Telf.: +34 91 346 6119
Fax:   +34 91 346 6537

skype: angelines.alberto

CIEMAT
Avenida Complutense, 40
28040 MADRID