Re: [slurm-users] Problem with cgroup plugin in Ubuntu22.04 and slurm 21.08.5

2023-04-21 Thread Angel de Vicente
Hello,

Michael Gutteridge  writes:

> Does this link help? 
>
>> Debian and derivatives (e.g. Ubuntu) usually exclude the memory and 
>> memsw (swap) cgroups by default. To include them, add the following 
>> parameters to the kernel command line: cgroup_enable=memory swapaccount=1

In the old machine (Ubuntu 18.04) we don't set those kernel parameters
and Slurm seems to have no issues with Cgroups? (what happens if those
are not defined? Would you get something like what I was reporting, that
the plugin cannot be loaded, or simply that cgroup would not be able to
enforce memory policies?)

> I'm using Bionic (18) and after applying those changes it seems to be
> working OK for me. I don't believe that Ubuntu has changed memory
> cgroup configuration between 18 and 22, but we're only starting to use
> 22.

I see that there are some differences in cgroup between 18.04 and 22.04,
but I don't understand them well enough to be able to figure out what
could be the issue...

Cheers,
-- 
Ángel de Vicente
 Research Software Engineer (Supercomputing and BigData)
 Tel.: +34 922-605-747
 Web.: http://research.iac.es/proyecto/polmag/

 GPG: 0x8BDC390B69033F52


smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] Problem with cgroup plugin in Ubuntu22.04 and slurm 21.08.5

2023-04-21 Thread Michael Gutteridge
Does this link

help?

> Debian and derivatives (e.g. Ubuntu) usually exclude the memory and
> memsw (swap) cgroups by default. To include them, add the following
> parameters to the kernel command line: cgroup_enable=memory swapaccount=1

I'm using Bionic (18) and after applying those changes it seems to be
working OK for me. I don't believe that Ubuntu has changed memory cgroup
configuration between 18 and 22, but we're only starting to use 22.

 - Michael

On Fri, Apr 21, 2023 at 9:01 AM Angel de Vicente 
wrote:

> Hello,
>
> Hermann Schwärzler  writes:
>
> > which version of cgroups does Ubuntu 22.04 use?
>
> I'm a cgroups noob, but my understanding is that both v2 and v1 coexist
> in Ubuntu 22.04
> (https://manpages.ubuntu.com/manpages/jammy/man7/cgroups.7.html). I have
> another machine with Ubuntu 18.04, which also has (AFAIK) both versions
> (https://manpages.ubuntu.com/manpages/jammy/man7/cgroups.7.html) and
> where Slurm (slurm-wlm) 21.08.8-2 is installed, and I have no cgroups
> issues there.
>
> > What is the output of "mount | grep cgroup" on your system?
>
> ,
> | mount | grep cgroup
> | cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
> | cpuacct on /cgroup/cpuacct type cgroup (rw,relatime,cpuacct)
> | freezer on /cgroup/freezer type cgroup (rw,relatime,freezer)
> | cgroup on /sys/fs/cgroup/freezer type cgroup
> (rw,nosuid,nodev,noexec,relatime,freezer)
> `
>
> Thanks for any help/pointers,
> --
> Ángel de Vicente
>  Research Software Engineer (Supercomputing and BigData)
>  Tel.: +34 922-605-747
>  Web.: http://research.iac.es/proyecto/polmag/
>
>  GPG: 0x8BDC390B69033F52
>


Re: [slurm-users] Problem with cgroup plugin in Ubuntu22.04 and slurm 21.08.5

2023-04-21 Thread Angel de Vicente
Hello,

Hermann Schwärzler  writes:

> which version of cgroups does Ubuntu 22.04 use?

I'm a cgroups noob, but my understanding is that both v2 and v1 coexist
in Ubuntu 22.04
(https://manpages.ubuntu.com/manpages/jammy/man7/cgroups.7.html). I have
another machine with Ubuntu 18.04, which also has (AFAIK) both versions
(https://manpages.ubuntu.com/manpages/jammy/man7/cgroups.7.html) and
where Slurm (slurm-wlm) 21.08.8-2 is installed, and I have no cgroups
issues there.

> What is the output of "mount | grep cgroup" on your system?

,
| mount | grep cgroup
| cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
| cpuacct on /cgroup/cpuacct type cgroup (rw,relatime,cpuacct)
| freezer on /cgroup/freezer type cgroup (rw,relatime,freezer)
| cgroup on /sys/fs/cgroup/freezer type cgroup 
(rw,nosuid,nodev,noexec,relatime,freezer)
`

Thanks for any help/pointers,
-- 
Ángel de Vicente
 Research Software Engineer (Supercomputing and BigData)
 Tel.: +34 922-605-747
 Web.: http://research.iac.es/proyecto/polmag/

 GPG: 0x8BDC390B69033F52


smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] Resource LImits

2023-04-21 Thread Hoot Thompson
After assistance from an AWS colleague, GrpTRESMins seems to be working.

Hoot

> On Apr 21, 2023, at 4:43 AM, Ole Holm Nielsen  
> wrote:
> 
> Hi Jason,
> 
> On 4/20/23 20:11, Jason Simms wrote:
>> Hello Ole and Hoot,
>> First, Hoot, thank you for your question. I've managed Slurm for a few years 
>> now and still feel like I don't have a great understanding about managing or 
>> limiting resources.
>> Ole, thanks for your continued support of the user community with your 
>> documentation. I do wish not only that more of your information were 
>> contained within the official docs, but also that there were even clearer 
>> discussions around certain topics.
>> As an example, you write that "It is important to configure slurm.conf so 
>> that the locked memory limit isn’t propagated to the batch jobs" by setting 
>> PropagateResourceLimitsExcept=MEMLOCK. It's unclear to me whether you are 
>> suggesting that literally everyone should have that set, or whether it only 
>> applies to certain configurations. We don't have it set, for instance, but 
>> we've not run into trouble with jobs failing due to locked memory errors.
> 
> The link mentioned in the page hopefully explains it: 
> https://slurm.schedmd.com/faq.html#memlock
> 
>> Then, in the official docs, to which you link, it says that "it may also be 
>> desirable to lock the slurmd daemon's memory to help ensure that it keeps 
>> responding if memory swapping begins" by creating /etc/sysconfig/slurm 
>> containing the line SLURMD_OPTIONS="-M". Would there ever be a reason *not* 
>> to include that? That is, I can't think it would ever be desirable for 
>> slurmd to stop responding. So is that another "universal" recommendation, I 
>> wonder?
> 
> I'm not an expert on locking slurmd pages!  The -M option is documented in 
> the slurmd manual page, and I probably read a thread long ago abut this on 
> the slurm-users mailing list discussing this.  You could try it out in your 
> environment and see if all is well.
> 
>> It may be me talking as a new-ish user, but I would find a concise document 
>> laying out common or useful configuration options to be presented when 
>> setting up or reconfiguring Slurm. I'm certain I have inefficient or missing 
>> options that I should have.
> 
> IMHO, most sites have their own requirements and preferences, so I don't 
> think there is a one-size-fits-all Slurm installation solution.
> 
> Since requirements can be so different, and because Slurm is a fantastic 
> software that can be configured for many different scenarios, IMHO a support 
> contract with SchedMD is the best way to get consulting services, get general 
> help, and report bugs.  We have excellent experiences with SchedMD support 
> (https://www.schedmd.com/support.php).
> 
> Best regards,
> Ole
> 
>> On Thu, Apr 20, 2023 at 2:11 AM Ole Holm Nielsen > > wrote:
>>Hi Hoot,
>>On 4/20/23 00:15, Hoot Thompson wrote:
>> > Is there a ‘how to’ or recipe document for setting up and enforcing
>>resource limits? I can establish accounts, users, and set limits but
>>'current value' is not incrementing after running jobs.
>>I have written about resource limits in this Wiki page:
>>
>> https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#partition-limits
>>  
>> 
> 




Re: [slurm-users] Problem with cgroup plugin in Ubuntu22.04 and slurm 21.08.5

2023-04-21 Thread Hermann Schwärzler

Hi Ángel,

which version of cgroups does Ubuntu 22.04 use?

What is the output of
mount | grep cgroup
on your system?

Regards,
Hermann

On 4/21/23 14:33, Angel de Vicente wrote:

Hello,

I've installed Slurm in a workstation (this is a single-node install)
with Ubuntu 22.04, and have installed Slurm version 21.08.5 (I didn't
compile it myself, just installed it with "apt install").

In the slurm.conf file I have:

,
| ProctrackType=proctrack/cgroup
| TaskPlugin=task/affinity,task/cgroup
`

When I submit a job, the "slurmd.log" shows:

,
| [2023-04-21T12:22:14.128] task/affinity: task_p_slurmd_batch_request: 
task_p_slurmd_batch_request: 127018
| [2023-04-21T12:22:14.128] task/affinity: batch_bind: job 127018 CPU input 
mask for node: 0x0001
| [2023-04-21T12:22:14.129] task/affinity: batch_bind: job 127018 CPU final HW 
mask for node: 0x0001
| [2023-04-21T12:22:14.156] [127018.extern] error: cgroup namespace 'cpuset' 
not mounted. aborting
| [2023-04-21T12:22:14.156] [127018.extern] error: unable to create cpuset 
cgroup namespace
| [2023-04-21T12:22:14.156] [127018.extern] error: cgroup namespace 'memory' 
not mounted. aborting
| [2023-04-21T12:22:14.156] [127018.extern] error: unable to create memory 
cgroup namespace
| [2023-04-21T12:22:14.156] [127018.extern] error: failure enabling memory 
enforcement: Unspecified error
| [2023-04-21T12:22:14.156] [127018.extern] error: Couldn't load specified 
plugin name for task/cgroup: Plugin init() callback failed
| [2023-04-21T12:22:14.156] [127018.extern] error: cannot create task context 
for task/cgroup
| [2023-04-21T12:22:14.156] [127018.extern] error: job_manager: exiting 
abnormally: Plugin initialization failed
`

If I change TaskPlugin to be just

,
| TaskPlugin=task/affinity
`

then the job executes without any problems.

Do you know how I could fix this while keeping the cgroup plugin? My
intuition tells me that I should probably get the latest version of
Slurm and compile it myself, but I thought I would ask here before going
that route.

Any ideas/pointers? Many thanks,




[slurm-users] Problem with cgroup plugin in Ubuntu22.04 and slurm 21.08.5

2023-04-21 Thread Angel de Vicente
Hello,

I've installed Slurm in a workstation (this is a single-node install)
with Ubuntu 22.04, and have installed Slurm version 21.08.5 (I didn't
compile it myself, just installed it with "apt install").

In the slurm.conf file I have:

,
| ProctrackType=proctrack/cgroup
| TaskPlugin=task/affinity,task/cgroup
`

When I submit a job, the "slurmd.log" shows:

,
| [2023-04-21T12:22:14.128] task/affinity: task_p_slurmd_batch_request: 
task_p_slurmd_batch_request: 127018
| [2023-04-21T12:22:14.128] task/affinity: batch_bind: job 127018 CPU input 
mask for node: 0x0001
| [2023-04-21T12:22:14.129] task/affinity: batch_bind: job 127018 CPU final HW 
mask for node: 0x0001
| [2023-04-21T12:22:14.156] [127018.extern] error: cgroup namespace 'cpuset' 
not mounted. aborting
| [2023-04-21T12:22:14.156] [127018.extern] error: unable to create cpuset 
cgroup namespace
| [2023-04-21T12:22:14.156] [127018.extern] error: cgroup namespace 'memory' 
not mounted. aborting
| [2023-04-21T12:22:14.156] [127018.extern] error: unable to create memory 
cgroup namespace
| [2023-04-21T12:22:14.156] [127018.extern] error: failure enabling memory 
enforcement: Unspecified error
| [2023-04-21T12:22:14.156] [127018.extern] error: Couldn't load specified 
plugin name for task/cgroup: Plugin init() callback failed
| [2023-04-21T12:22:14.156] [127018.extern] error: cannot create task context 
for task/cgroup
| [2023-04-21T12:22:14.156] [127018.extern] error: job_manager: exiting 
abnormally: Plugin initialization failed
`

If I change TaskPlugin to be just

,
| TaskPlugin=task/affinity
`

then the job executes without any problems.

Do you know how I could fix this while keeping the cgroup plugin? My
intuition tells me that I should probably get the latest version of
Slurm and compile it myself, but I thought I would ask here before going
that route.

Any ideas/pointers? Many thanks,
-- 
Ángel de Vicente
 Research Software Engineer (Supercomputing and BigData)
 Tel.: +34 922-605-747
 Web.: http://research.iac.es/proyecto/polmag/

 GPG: 0x8BDC390B69033F52


smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] Resource LImits

2023-04-21 Thread Ole Holm Nielsen

Hi Jason,

On 4/20/23 20:11, Jason Simms wrote:

Hello Ole and Hoot,

First, Hoot, thank you for your question. I've managed Slurm for a few 
years now and still feel like I don't have a great understanding about 
managing or limiting resources.


Ole, thanks for your continued support of the user community with your 
documentation. I do wish not only that more of your information were 
contained within the official docs, but also that there were even clearer 
discussions around certain topics.


As an example, you write that "It is important to configure slurm.conf so 
that the locked memory limit isn’t propagated to the batch jobs" by 
setting PropagateResourceLimitsExcept=MEMLOCK. It's unclear to me whether 
you are suggesting that literally everyone should have that set, or 
whether it only applies to certain configurations. We don't have it set, 
for instance, but we've not run into trouble with jobs failing due to 
locked memory errors.


The link mentioned in the page hopefully explains it: 
https://slurm.schedmd.com/faq.html#memlock


Then, in the official docs, to which you link, it says that "it may also 
be desirable to lock the slurmd daemon's memory to help ensure that it 
keeps responding if memory swapping begins" by creating 
/etc/sysconfig/slurm containing the line SLURMD_OPTIONS="-M". Would there 
ever be a reason *not* to include that? That is, I can't think it would 
ever be desirable for slurmd to stop responding. So is that another 
"universal" recommendation, I wonder?


I'm not an expert on locking slurmd pages!  The -M option is documented in 
the slurmd manual page, and I probably read a thread long ago abut this on 
the slurm-users mailing list discussing this.  You could try it out in 
your environment and see if all is well.


It may be me talking as a new-ish user, but I would find a concise 
document laying out common or useful configuration options to be presented 
when setting up or reconfiguring Slurm. I'm certain I have inefficient or 
missing options that I should have.


IMHO, most sites have their own requirements and preferences, so I don't 
think there is a one-size-fits-all Slurm installation solution.


Since requirements can be so different, and because Slurm is a fantastic 
software that can be configured for many different scenarios, IMHO a 
support contract with SchedMD is the best way to get consulting services, 
get general help, and report bugs.  We have excellent experiences with 
SchedMD support (https://www.schedmd.com/support.php).


Best regards,
Ole

On Thu, Apr 20, 2023 at 2:11 AM Ole Holm Nielsen 
mailto:ole.h.niel...@fysik.dtu.dk>> wrote:


Hi Hoot,

On 4/20/23 00:15, Hoot Thompson wrote:
 > Is there a ‘how to’ or recipe document for setting up and enforcing
resource limits? I can establish accounts, users, and set limits but
'current value' is not incrementing after running jobs.

I have written about resource limits in this Wiki page:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#partition-limits