Re: [slurm-users] Does setting 'job_desc.mail_user' in job_submit.lua work?

2022-01-10 Thread Marcus Boden

Hi Loris,

I can confirm the problem: I am not able to modify the 
job_desc.mail_user. Other values can be modified, though.


We are also on 21.08.5

Best,
Marcus

On 10.01.22 11:14, Loris Bennett wrote:

Hi,

Does setting 'mail_user' in job_submit.lua actually work in Slurm
21.08.5?

I have, if certain conditions are met

   job_desc.mail_user = other_email_address

Logging tells me that this part of the code is indeed reached, but the
email address in the job is not changed.

Other parts of the plugin work, but they only read other elements of
job_desc and do not modify anything.

Am I doing something wrong?

Cheers,

Loris



--
Marcus Vincent Boden, M.Sc. (he/him)
Computing Group
Tel.:   +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de
-
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen 
(GWDG) Burckhardtweg 4, 37077 Göttingen, URL: https://gwdg.de


Support: Tel.: +49 551 39-3, URL: https://gwdg.de/support
Sekretariat: Tel.: +49 551 39-30001, E-Mail: g...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001
-


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Bug: incorrect output directory fails silently

2021-07-08 Thread Marcus Boden
I already answered tons of tickets due to this, when our users are 
confused, that the job silently fails.
The problem is, you cannot solve this with a job_submit or cli_filter, 
as you do not know the situation of the file system at job runtime. Or 
even on the node in the end.


At lest the slurmd gives an error, so you could scan the logs for this 
error and maybe use that to automate something.


Best,
Marcus

On 08.07.21 16:58, Jeffrey T Frey wrote:

I understand that there is no output file to write an error message to, but it 
might be good to check the `--output` path during the scheduling, just like 
`--account` is checked.

Does anybody know a workaround to be warned about the error?


I would make a feature request of SchedMD to fix the issue, then I would write 
a cli_filter plugin to validate the --output/--error/--input paths as desired 
until Slurm itself handles it.




--
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience, HPC-Team
Tel.:   +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de
-
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Am Faßberg 11, 37077 Göttingen, URL: https://www.gwdg.de

Support: Tel.: +49 551 201-1523, URL: https://www.gwdg.de/support
Sekretariat: Tel.: +49 551 201-1510, Fax: -2150, E-Mail: g...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001
-



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Best method to determine if a node is down

2021-06-27 Thread Marcus Boden

Hi Doug,

Slurm has the strigger[1] mechanism that can do exactly that, the 
manpage even has your use case as an example. It works quite well for us.


Best,
Marcus

[1] https://slurm.schedmd.com/strigger.html

On 26.06.21 19:10, Doug Niven wrote:

Hi Folks,

I’d like to setup an email notification, perhaps via cron (unless there’s a 
better method) of notifying the sysadmin when a Slurm node is down and/or not 
firing off jobs...

For example, using ‘squeue’ in NODELIST(REASON) I recently saw:

(Nodes required for job are DOWN, DRAINED or reserved for jobs in higher 
priority partitions)

And using ‘sinfo’ I saw:

% sinfo -Nl
Fri May 07 08:49:26 2021
NODELIST   NODES PARTITION   STATE CPUSS:C:T MEMORY TMP_DISK WEIGHT 
AVAIL_FE REASON
trom 1short*draining 1122:56:2 2048000  1   
(null) Kill task failed
trom 1  longdraining 1122:56:2 2048000  1   
(null) Kill task failed

I’m not sure what would be the best value to grep for, as I suspect there are 
other states than DOWN or DRAINED that might mean a node is down and not firing 
off jobs?

Thanks in advance for your ideas,

Doug




--
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience, HPC-Team
Tel.:   +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de
-
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Am Faßberg 11, 37077 Göttingen, URL: https://www.gwdg.de

Support: Tel.: +49 551 201-1523, URL: https://www.gwdg.de/support
Sekretariat: Tel.: +49 551 201-1510, Fax: -2150, E-Mail: g...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001
-



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] monitor draining/drain nodes

2021-06-14 Thread Marcus Boden
I think your slurm-user has /sbin/nologin as the the shell in 
/etc/passwd. Try `su -s /bin/bash slurm`.


Best,
Marcus

On 14.06.21 20:52, Rodrigo Santibáñez wrote:

Thank you Marcus, Ole and Samuel.

Regarding Samuel's answer, I added ifne from moreutils before mail to not
have empty emails.

Regarding strigger, I don't know how to become the slurm user. "su slurm"
complains "This account is currently not available.". The user "slurm"
exists and is the SlurmUser.

Best,

On Mon, Jun 14, 2021 at 5:09 AM Ole Holm Nielsen 
wrote:


On 6/14/21 7:50 AM, Marcus Boden wrote:

Slurm provides the strigger[1] utility for that. You can set it up to
automatically send mails when nodes go into drain.


I provide some Slurm triggers examples in
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/triggers


On 12.06.21 22:29, Rodrigo Santibáñez wrote:

Hi SLURM users,

Does anyone have a cronjob or similar to monitor and warn via e-mail

when a

node is in draining/drain status?


/Ole






--
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience, HPC-Team
Tel.:   +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de
-
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Am Faßberg 11, 37077 Göttingen, URL: https://www.gwdg.de

Support: Tel.: +49 551 201-1523, URL: https://www.gwdg.de/support
Sekretariat: Tel.: +49 551 201-1510, Fax: -2150, E-Mail: g...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001
-



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Information about finished jobs

2021-06-14 Thread Marcus Boden

Hi,

you will need to use sacct to get the information from the slurmdbd. 
It's not the same information and you will have to find the right fields 
to display, but it is pretty powerful. Have a look at the man page for 
the available fields.


Best,
Marcus

On 14.06.21 08:26, Gestió Servidors wrote:

Hello,

How can I get all information about a finished job in the same way as "scontrol show 
jobid=" when job is pending or running?

Thanks.



--
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience, HPC-Team
Tel.:   +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de
-
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Am Faßberg 11, 37077 Göttingen, URL: https://www.gwdg.de

Support: Tel.: +49 551 201-1523, URL: https://www.gwdg.de/support
Sekretariat: Tel.: +49 551 201-1510, Fax: -2150, E-Mail: g...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001
-



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] monitor draining/drain nodes

2021-06-13 Thread Marcus Boden

Hi,

Slurm provides the strigger[1] utility for that. You can set it up to 
automatically send mails when nodes go into drain.


Best,
Marcus

[1] https://slurm.schedmd.com/strigger.html

On 12.06.21 22:29, Rodrigo Santibáñez wrote:

Hi SLURM users,

Does anyone have a cronjob or similar to monitor and warn via e-mail when a
node is in draining/drain status?

Thank you.

Best regards.
Rodrigo Santibáñez



--
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience, HPC-Team
Tel.:   +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de
-
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Am Faßberg 11, 37077 Göttingen, URL: https://www.gwdg.de

Support: Tel.: +49 551 201-1523, URL: https://www.gwdg.de/support
Sekretariat: Tel.: +49 551 201-1510, Fax: -2150, E-Mail: g...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001
-



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Conflicting --nodes and --nodelist

2021-06-01 Thread Marcus Boden

Hi,

as per 
https://slurm.schedmd.com/archive/slurm-18.08.5/sbatch.html#OPT_nodelist



Request a specific list of hosts. The job will contain *all* of these hosts and 
possibly additional hosts as needed to satisfy resource requirements.


So at least in the sbatch manpage it explicitly states that all nodes 
are in the allocation. This is the same in the latest version, so I 
guess there are not many changes to be expected.


The only way I currently see to do that from user side is to exclude all 
the other nodes with -x/--exclude. If this is for testing and more from 
an admin side, you could also create a reservation or temporary partition.


Best,
Marcus

On 01.06.21 13:15, Diego Zuccato wrote:

Hello all.

I just found that if an user tries to specify a nodelist (say including 
2 nodes) and --nodes=1, the job gets rejected with

sbatch: error: invalid number of nodes (-N 2-1)
The expected behaviour is that slurm schedules the job on the first node 
available from the list.

I've found conflicting info about the issue. Is it version-dependant?
If so, we're currently using 18.08.5-2 (from Debian stable). Should we 
expect changes when Debian will ship a newer version? Is it possible to 
have the expected behaviour?


Tks.



--
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience, HPC-Team
Tel.:   +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de
-
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Am Faßberg 11, 37077 Göttingen, URL: https://www.gwdg.de

Support: Tel.: +49 551 201-1523, URL: https://www.gwdg.de/support
Sekretariat: Tel.: +49 551 201-1510, Fax: -2150, E-Mail: g...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001
-



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Building SLURM with X11 support

2021-05-28 Thread Marcus Boden
I have the same in our config.log and the x11 forwarding works fine. No 
other lines around it (about some failing checks or something), just this:


[...]
configure:22134: WARNING: unable to locate rrdtool installation
configure:22176: support for ucx disabled
configure:22296: checking whether Slurm internal X11 support is enabled
configure:22311: result:
configure:22350: checking for check >= 0.9.8
[...]

Best,
Marcus


On 28.05.21 09:26, Bjørn-Helge Mevik wrote:

Thekla Loizou  writes:


Also, when compiling SLURM in the config.log I get:

configure:22291: checking whether Slurm internal X11 support is enabled
configure:22306: result:

The result is empty. I read that X11 is build by default so I don't
expect a special flag to be given during compilation time right?


My guess is that some X development library is missing.  Perhaps look in
the configure script for how this test was done (typically it will try
to compile something with those devel libraries, and fail).  Then see
which package contains that library, install it and try again.



--
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience, HPC-Team
Tel.:   +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de
-
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Am Faßberg 11, 37077 Göttingen, URL: https://www.gwdg.de

Support: Tel.: +49 551 201-1523, URL: https://www.gwdg.de/support
Sekretariat: Tel.: +49 551 201-1510, Fax: -2150, E-Mail: g...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001
-



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Building SLURM with X11 support

2021-05-27 Thread Marcus Boden

Hi Thekla,

it is build in by default since... some time. You need to activate it by 
adding

PrologFlags=X11
to your slurm.conf (see here: 
https://slurm.schedmd.com/slurm.conf.html#OPT_PrologFlags)


Best,
Marcus

On 27.05.21 14:07, Thekla Loizou wrote:

Dear all,

I am trying to use X11 forwarding in SLURM with no success.

We are installing SLURM using RPMs that we generate with the command 
"rpmbuild -ta slurm*.tar.bz2" as per the documentation.


I am currently working with SLURM version 20.11.7-1.

What I am missing when it comes to build SLURM with X11 enabled? Which 
flags and packages are required?


Regards,

Thekla




--
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience, HPC-Team
Tel.:   +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de
-
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Am Faßberg 11, 37077 Göttingen, URL: https://www.gwdg.de

Support: Tel.: +49 551 201-1523, URL: https://www.gwdg.de/support
Sekretariat: Tel.: +49 551 201-1510, Fax: -2150, E-Mail: g...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001
-



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] PartitionName default

2021-04-07 Thread Marcus Boden

Hi everyone,

On 08.04.21 02:13, Christopher Samuel wrote:
I've not had issues with naming partitions in the past, though I can 
imagine `default` could cause confusion as there is a `default=yes` 
setting you can put on the one partition you want as the default choice.


more than that. The PartitionName 'default' is a special value to set 
default values for al Partitions:
Default values can be specified with a record in which PartitionName is 
"DEFAULT". The default entry values will apply only to lines following it 
in the configuration file and the default values can be reset multiple times in the configuration file with multiple entries where "PartitionName=DEFAULT". The "PartitionName=" specification must be placed on every line describing the configuration of partitions. Each line where PartitionName is "DEFAULT" will replace or add to previous default values and not a 
reinitialize the default values.


https://slurm.schedmd.com/slurm.conf.html#SECTION_PARTITION-CONFIGURATION

Best,
Marcus

--
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience, HPC-Team
Tel.:   +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de
-
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Am Faßberg 11, 37077 Göttingen, URL: https://www.gwdg.de

Support: Tel.: +49 551 201-1523, URL: https://www.gwdg.de/support
Sekretariat: Tel.: +49 551 201-1510, Fax: -2150, E-Mail: g...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001
-



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] How can I get complete field values with without specify the length

2021-03-10 Thread Marcus Boden
Yeah, I wondered something like that too, as it makes some of my scripts 
quite fragile. I just tried your name on a test system and now calling 
squeue paints my cli yellow :D


You could write a job_submit plugin to catch 'malicious' input, but so 
far no user ever did something like that on our system, so I don't think 
that's necessary.


Best,
Marcus

On 10.03.21 12:06, Reuti wrote:



Am 09.03.2021 um 13:37 schrieb Marcus Boden :

Then I have good news for you! There is the --delimiter option:
https://slurm.schedmd.com/sacct.html#OPT_delimiter=


Aha, perfect – thx. Maybe it should be noted in the man page for the "-p"/"-P".

But this leads to another question: there is no well defined character set for 
allowed job names, and I can have a job:

$ sbatch --job-name="fu"$'\007'$'\033[32m'"bar" slurm-job.sh

which might make some noise when using `squeue` and messes up the output?

-- Reuti



Best,
Marcus

On 09.03.21 12:10, Reuti wrote:

Hi:

Am 09.03.2021 um 08:19 schrieb Bjørn-Helge Mevik :

"xiaojingh...@163.com"  writes:


I am doing a parsing job on slurm fields. Sometimes when one field is
too long, slum will limit the length with a “+”.


You don't say which slurm command you are trying to parse the output
from, but if it is sacctmgr, it has an option --parsable2(*)
specifically designed for parsing output, and which does not truncate
long field values.

(*) There is also --parsable, but that puts an extra "|" at the end of
the line, so I prefer --parsable2.

It would even be better to have the option to use an argument for these two options or 
even --parsable='`\000' like in `tr`. For now users can use "|" in jobname, 
account or comment which might break any parsing.
-- Reuti

--
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



--
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience, HPC-Team
Tel.:   +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de
-
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Am Faßberg 11, 37077 Göttingen, URL: https://www.gwdg.de

Support: Tel.: +49 551 201-1523, URL: https://www.gwdg.de/support
Sekretariat: Tel.: +49 551 201-1510, Fax: -2150, E-Mail: g...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001
-






--
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience, HPC-Team
Tel.:   +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de
-
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Am Faßberg 11, 37077 Göttingen, URL: https://www.gwdg.de

Support: Tel.: +49 551 201-1523, URL: https://www.gwdg.de/support
Sekretariat: Tel.: +49 551 201-1510, Fax: -2150, E-Mail: g...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001
-



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] How can I get complete field values with without specify the length

2021-03-09 Thread Marcus Boden

Then I have good news for you! There is the --delimiter option:
https://slurm.schedmd.com/sacct.html#OPT_delimiter=

Best,
Marcus

On 09.03.21 12:10, Reuti wrote:

Hi:


Am 09.03.2021 um 08:19 schrieb Bjørn-Helge Mevik :

"xiaojingh...@163.com"  writes:


I am doing a parsing job on slurm fields. Sometimes when one field is
too long, slum will limit the length with a “+”.


You don't say which slurm command you are trying to parse the output
from, but if it is sacctmgr, it has an option --parsable2(*)
specifically designed for parsing output, and which does not truncate
long field values.

(*) There is also --parsable, but that puts an extra "|" at the end of
the line, so I prefer --parsable2.


It would even be better to have the option to use an argument for these two options or 
even --parsable='`\000' like in `tr`. For now users can use "|" in jobname, 
account or comment which might break any parsing.

-- Reuti



--
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo






--
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience, HPC-Team
Tel.:   +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de
-
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Am Faßberg 11, 37077 Göttingen, URL: https://www.gwdg.de

Support: Tel.: +49 551 201-1523, URL: https://www.gwdg.de/support
Sekretariat: Tel.: +49 551 201-1510, Fax: -2150, E-Mail: g...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001
-



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] About sacct --format: how can I get info about the fields

2021-03-05 Thread Marcus Boden

Hi Xiaojing,

my experience here is: you will have to try it out and see what works. 
At least that's what I do whenever I parse sacct, as I did not find a 
detailed description anywhere. The manpage is quite incomplete in that 
regard.


Best,
Marcus

On 05.03.21 03:02, xiaojingh...@163.com wrote:

Hello, slurm users and Brian,

Thanks a lot for your reply. The thing is actually I know the fields. I just 
need to know detailed info about them. For example, you may get an “Unknown” 
for some time fields. And the maxVMSize field is an empty string  except for 
some job steps. I need to know the possible values for each filed to do some 
calculation on them. But lack of info makes it very difficult.
It seems like there is no documentation on it except for the web pages of sacct 
command.


Hello, guys,

I am doing a parsing job on the output of the sacct command and I know that the 
?format option can specify the fields you'd like to be outputted.
The difficulty I am facing is that I am in lack of info about the fields. For 
example, what are the possible values for those fields? What are the default 
values of the fields if slurm doesn?t  have it? Is there any detailed 
documentation about those fields?

Any help is greatly appreciated!

Best Regards,
Xiaojing




--
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience, HPC-Team
Tel.:   +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de
-
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Am Faßberg 11, 37077 Göttingen, URL: https://www.gwdg.de

Support: Tel.: +49 551 201-1523, URL: https://www.gwdg.de/support
Sekretariat: Tel.: +49 551 201-1510, Fax: -2150, E-Mail: g...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001
-



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Raise the priority of a certain kind of jobs

2020-11-12 Thread Marcus Boden
Hi,

you could write a job_submit plugin:
https://slurm.schedmd.com/job_submit_plugins.html

The Site factor was added to priority for that exact reason.

Best,
Marcus

On 11/12/20 10:58 AM, SJTU wrote:
> Hello,
> 
> We want to raise the priority of a certain kind of slurm jobs. We considered 
> doing it in Prolog, but Prolog seems to run only at job starting time so may 
> not be useful for queued jobs. Is there any possible way to do this?
> 
> Thank you and look forward to your reply.
> 
> 
> Best,
> 
> Jianwen
> 

-- 
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience, HPC-Team
Tel.:   +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de
-
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Am Faßberg 11, 37077 Göttingen, URL: https://www.gwdg.de

Support: Tel.: +49 551 201-1523, URL: https://www.gwdg.de/support
Sekretariat: Tel.: +49 551 201-1510, Fax: -2150, E-Mail: g...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001
-



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] How to set association factor in Multifactor Priority

2020-09-24 Thread Marcus Boden
Hi Jianwen,

yes, you can give different accounts or users specific extra-priorities.
You can set it via sacctmgr:
https://slurm.schedmd.com/sacctmgr.html#SECTION_GENERAL-SPECIFICATIONS-FOR-ASSOCIATION-BASED-ENTITIES
(scroll down to 'Priority')
Priority
What priority will be added to a job's priority when using this
association. This is overridden if set directly on a user. Default is
the cluster's limit. To clear a previously set value use the modify
command with a new value of -1.

Best,
Marcus

On 9/24/20 4:35 AM, SJTU wrote:
> Hi,
> 
> I found that a new "Association Factor" is introduced in 19.05 to be part 
> of Job_priority calculation. Can I set it for each SLURM account so job 
> priority can be differentiated based on job accounts?
> 
>   https://groups.google.com/g/slurm-users/c/nzF8jOPZI_w/m/vj2wkUryBgAJ 
> 
>   https://slurm.schedmd.com/priority_multifactor.html#assoc 
> 
> 
> Association Factor
> Each association can be assigned an integer priority. The larger the 
> number, the greater the job priority will be for jobs that request this 
> association. This priority value is normalized to the highest priority of all 
> the association to become the association factor.
> 
> Job_priority =
>   site_factor +
>   (PriorityWeightAge) * (age_factor) +
>   (PriorityWeightAssoc) * (assoc_factor) +
>   (PriorityWeightFairshare) * (fair-share_factor) +
>   (PriorityWeightJobSize) * (job_size_factor) +
>   (PriorityWeightPartition) * (partition_factor) +
>   (PriorityWeightQOS) * (QOS_factor) +
>   SUM(TRES_weight_cpu * TRES_factor_cpu,
>   TRES_weight_ * TRES_factor_,
>   ...)
>   - nice_factor
> 
> 
> Thank you!
> 
> Jianwen
> 

-- 
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience, HPC-Team
Tel.:   +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de
-
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Am Faßberg 11, 37077 Göttingen, URL: https://www.gwdg.de

Support: Tel.: +49 551 201-1523, URL: https://www.gwdg.de/support
Sekretariat: Tel.: +49 551 201-1510, Fax: -2150, E-Mail: g...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001
-



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Submitting jobs with constraint option

2020-09-03 Thread Marcus Boden
Hi,

you can add those as "Features" for the nodes, see:
https://slurm.schedmd.com/slurm.conf.html#OPT_Feature

Best,
Marcus

On 9/3/20 2:52 PM, Gestió Servidors wrote:
> Hello,
> 
> I would like to apply some constraint options to my nodes. For example, 
> infiniband available, processor model, etc., but I don't know where I need to 
> detail that information. I know that in "sbatch" I can request for that 
> details with "--constraint=" but I suppose that I need to define them before 
> in slurm.conf or another configuration file.
> 
> Thanks.
> 

-- 
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience, HPC-Team
Tel.:   +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de
-
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Am Faßberg 11, 37077 Göttingen, URL: https://www.gwdg.de

Support: Tel.: +49 551 201-1523, URL: https://www.gwdg.de/support
Sekretariat: Tel.: +49 551 201-1510, Fax: -2150, E-Mail: g...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001
-



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Marcus Boden
Hi Navin,

try running slurmd in the foregrund with increased verbosity:
slurmd -D -v (add as many v as you deem necessary)

Hopefully it'll tell you more about why it times out.

Best,
Marcus


On 6/11/20 2:24 PM, navin srivastava wrote:
> Hi Team,
> 
> when i am trying to start the slurmd process i am getting the below error.
> 
> 2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node
> daemon...
> 2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start
> operation timed out. Terminating.
> 2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm
> node daemon.
> 2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit
> entered failed state.
> 2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed
> with result 'timeout'.
> 2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: pam_unix(crond:session):
> session opened for user root by (uid=0)
> 
> Slurm version is 17.11.8
> 
> The server and slurm is running from long time and we have not made any
> changes but today when i am starting it is giving this error message.
> Any idea what could be wrong here.
> 
> Regards
> Navin.
> 

-- 
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience
Tel.:   +49 (0)551 201-2191
E-Mail: mbo...@gwdg.de
---
Gesellschaft fuer wissenschaftliche
Datenverarbeitung mbH Goettingen (GWDG)
Am Fassberg 11, 37077 Goettingen
URL:http://www.gwdg.de
E-Mail: g...@gwdg.de
Tel.:   +49 (0)551 201-1510
Fax:+49 (0)551 201-2150
Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender:
Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Goettingen
Registergericht: Goettingen
Handelsregister-Nr. B 598
---



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] MaxJobs not working

2020-05-18 Thread Marcus Boden
Hi,

> Some minutes ago, I have applied "MaxJobs=3" for an user. After that, if I 
> ran "sacctmgr -s show user MYUSER format=account,user,maxjobs", system showed 
> a "3" at the maxjobs column. However, now, I have run a "squeue" and I'm 
> seeing 4 jobs (from that user) in "running" state... Shouldn't it be just 3 
> and not 4 in "running" state???

were the 4 jobs running beforehand? Slurm wouldn't cancel the jobs if
they were already running, but just prevent new jobs from starting.

> I have applied "maxjobs" in accounting only for a user (not account), so the 
> others users in the same account have "infinite" maxjobs, but a user have 3 
> (the number I have configured). If I run "sacctmgr -s show user MYUSER 
> format=user,maxjobs" I can see that 3 but how could I run "sacctmgr" to show 
> maxjobs limit for all users? I have test with "sacctmgr -s show account 
> MYACCOUNT format=user,account,maxjobs" and it works, but I have configured 
> several accounts, so I would like to show all accounts with only one command 
> execution "sacctmgr -s show".

Try:
sacctmgr -s show assoc format=user,account,maxjobs

Best,
Marcus

-- 
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience
Tel.:   +49 (0)551 201-2191
E-Mail: mbo...@gwdg.de
---
Gesellschaft fuer wissenschaftliche
Datenverarbeitung mbH Goettingen (GWDG)
Am Fassberg 11, 37077 Goettingen
URL:http://www.gwdg.de
E-Mail: g...@gwdg.de
Tel.:   +49 (0)551 201-1510
Fax:+49 (0)551 201-2150
Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender:
Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Goettingen
Registergericht: Goettingen
Handelsregister-Nr. B 598
---



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] sacct returns nothing after reboot

2020-05-13 Thread Marcus Boden
Hi,

the default time window starts at 00:00:00 of the current day:
-S, --starttime
  Select jobs in any state after the specified  time.  Default
  is  00:00:00  of  the  current  day, unless the '-s' or '-j'
  options are used. If the  '-s'  option  is  used,  then  the
  default  is  'now'. If states are given with the '-s' option
  then only jobs in this state at this time will be  returned.
  If  the  '-j'  option is used, then the default time is Unix
  Epoch 0. See the DEFAULT TIME WINDOW for more details.

Best,
Marcus


On 5/12/20 2:08 PM, Roger Mason wrote:
> Hello,
> 
> Yesterday I instituted job accounting via mysql on my (FreeBSD 11.3)
> test cluster.  The cluster consists of a machine running
> slurmctld+slurmdbd and two running slurmd (slurm version 20.02.1).
> After experiencing a slurmdbd core dump when using mysql-5.7.30
> (reported on this list on May 5) I installed 5.7.28 instead.
> 
> Before yesterday I had no accounting of any kind.  I had observed the
> behaviour that the job id's always restarted at 2 after a reboot.  After
> installing mysql and setting it up I ran a few test jobs and verified
> that sacct listed them: all seemed well.
> 
> This morning upon re-booting the machine running slurmctld+slurmdbd
> sacct returns nothing:
> 
> rmason sacct --allusers
>JobIDJobName  PartitionAccount  AllocCPUS  State ExitCode
>  -- -- -- -- -- 
> 
> so it seems that yesterday's jobs have been forgotten.
> 
> When I connect to mysql as the user owning the databases it seems there
> is information present.  For example,
> 
> select * from imacbeastie_job_table;
> 
> returns information about the test jobs I ran yesterday.
> 
> As a further test I just ran another test job:
> 
> squeue
> JOBID PARTITION NAME USER ST   TIME  NODES NODELIST(REASON)
>  2  imactest   rmason  R   0:03  1 patchperthite
> 
> I notice that the jobid starts at 2 (I ran 5 or 6 test jobs yesterday).
> 
> sacct now returns information:
> sacct --allusers
>JobIDJobName  PartitionAccount  AllocCPUS  State ExitCode
>  -- -- -- -- -- 
> 2  test   imac 2  COMPLETED  0:0
> 2.batch   batch2  COMPLETED  0:0
> 2.0hostname1  COMPLETED  0:0
> 2.1   sleep1  COMPLETED  0:0
> 
> but only for the test job I ran today.
> 
> I appreciate any help in getting accounting to work properly.
> 
> Thanks,
> Roger
> 

-- 
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience
Tel.:   +49 (0)551 201-2191
E-Mail: mbo...@gwdg.de
---
Gesellschaft fuer wissenschaftliche
Datenverarbeitung mbH Goettingen (GWDG)
Am Fassberg 11, 37077 Goettingen
URL:http://www.gwdg.de
E-Mail: g...@gwdg.de
Tel.:   +49 (0)551 201-1510
Fax:+49 (0)551 201-2150
Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender:
Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Goettingen
Registergericht: Goettingen
Handelsregister-Nr. B 598
---



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Question about SacctMgr....

2020-02-28 Thread Marcus Boden
Hi,

your looking for 'associations' between users, accounts and their limits.
Try `sacctmgr show assoc [tree]`

Best,
Marcus

On 20-02-28 09:38, Matthias Krawutschke wrote:
> Dear Slurm-User,
> 
>  
> 
> I have a simple question about User and Account – Management on SLURM.
> 
>  
> 
> How can I find /print out, which User is associated with which account?
> 
>  
> 
> I can list accounts and User, but not in combination. I had no found this on
> the documentation.
> 
>  
> 
> Best regards….
> 
>  
> 
>  
> 
>  
> 
> Matthias Krawutschke, Dipl. Inf.
> 
>  
> 
> Universität Potsdam
> ZIM - Zentrum für Informationstechnologie und Medienmanagement
> 
> Team Infrastruktur Server und Storage 
> Arbeitsbereich: High-Performance-Computing on Cluster - Environment
> 
>  
> 
> Campus Am Neuen Palais: Am Neuen Palais 10 | 14469 Potsdam
> Tel: +49 331 977-, Fax: +49 331 977-1750
> 
>  
> 
> Internet:  
> https://www.uni-potsdam.de/de/zim/angebote-loesungen/hpc.html
> 
>  
> 
>  
> 

-- 
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience
Tel.:   +49 (0)551 201-2191
E-Mail: mbo...@gwdg.de
---
Gesellschaft fuer wissenschaftliche
Datenverarbeitung mbH Goettingen (GWDG)
Am Fassberg 11, 37077 Goettingen
URL:http://www.gwdg.de
E-Mail: g...@gwdg.de
Tel.:   +49 (0)551 201-1510
Fax:+49 (0)551 201-2150
Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender:
Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Goettingen
Registergericht: Goettingen
Handelsregister-Nr. B 598
---


smime.p7s
Description: S/MIME cryptographic signature


[slurm-users] Inconsistent cpu bindings with cpu-bind=none

2020-02-17 Thread Marcus Boden
Hi everyone,

I am facing a bit of a weird issue with CPU bindings and mpirun:
My jobscript:
#SBATCH -N 20
#SBATCH --tasks-per-node=40
#SBATCH -p medium40
#SBATCH -t 30 
#SBATCH -o out/%J.out
#SBATCH -e out/%J.err
#SBATCH --reservation=root_98

module load impi/2019.4 2>&1

export I_MPI_DEBUG=6
export SLURM_CPU_BIND=none

. 
/sw/comm/impi/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/mpivars.sh
 realease
BENCH=/sw/comm/impi/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/IMB-MPI1

mpirun -np 800 $BENCH -npmin 800 -iter 50 -time 120 -msglog 16:18 -include 
Allreduce Bcast Barrier Exchange Gather PingPing PingPong Reduce Scatter 
Allgather Alltoall Reduce_scatter

My output is as follows:
[...]
[0] MPI startup(): 37  154426   gcn1311{37,77}
[0] MPI startup(): 38  154427   gcn1311{38,78}
[0] MPI startup(): 39  154428   gcn1311{39,79}
[0] MPI startup(): 40  161061   gcn1312{0}
[0] MPI startup(): 41  161062   gcn1312{40}
[0] MPI startup(): 42  161063   gcn1312{0}
[0] MPI startup(): 43  161064   gcn1312{40}
[0] MPI startup(): 44  161065   gcn1312{0}
[...]

On 8 out of 20 nodes I got the wrong pinning. In the slurmd logs I found
that on nodes, where the pinning was correct, manual binding was
communicated correctly:
  lllp_distribution jobid [2065227] manual binding: none
On those, where it did not work, not so much:
  lllp_distribution jobid [2065227] default auto binding: cores, dist 1

So, for some reason, slurm told some task to use CPU bindings and for
some, the cpu binding was (correctly) disabled.

Any ideas what could cause this?

Best,
Marcus
-- 
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience
Tel.:   +49 (0)551 201-2191
E-Mail: mbo...@gwdg.de
---
Gesellschaft fuer wissenschaftliche
Datenverarbeitung mbH Goettingen (GWDG)
Am Fassberg 11, 37077 Goettingen
URL:http://www.gwdg.de
E-Mail: g...@gwdg.de
Tel.:   +49 (0)551 201-1510
Fax:+49 (0)551 201-2150
Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender:
Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Goettingen
Registergericht: Goettingen
Handelsregister-Nr. B 598
---


smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] slurmstepd: error: _is_a_lwp

2020-02-04 Thread Marcus Boden
We had this issue recently. Some googling led me to the NERSC FAQs,
which state:
 > _is_a_lwp is a function called internally for Slurm job accounting. The 
 > message indicates a rare error situation with a function call. But the error 
 > shouldn't affect anything in the user job. Please ignore the message.

After looking into our logfiles, it seems that this error appears more
or less at random, but does not cause any jobs to fail (all errors I got
were for jobs that worked perfectly fine).
In your case, the job got cancelled an hour after that message.

Although it is curious that it does seem to happen to only one user in
your case.

Best,
Marcus

On 20-02-04 20:50, Luis Huang wrote:
> We have a user that keeps encountering this error with one type of her jobs. 
> Sometimes her jobs will cancel and other times it will run fine.
> 
> slurmstepd: error: _is_a_lwp: open() /proc/195420/status failed: No such file 
> or directory
> slurmstepd: error: *** JOB 17534 ON pe2dc5-0007 CANCELLED AT 
> 2020-01-23T14:11:36 ***
> 
> [root@pe2dc5-0007 ~]# grep 17534  /var/log/slurmd.log
> [2020-01-23T14:10:12.789] task_p_slurmd_batch_request: 17534
> [2020-01-23T14:10:12.789] task/affinity: job 17534 CPU input mask for node: 
> 0x03
> [2020-01-23T14:10:12.789] task/affinity: job 17534 CPU final HW mask for 
> node: 0x020020
> [2020-01-23T14:10:12.790] _run_prolog: prolog with lock for job 17534 ran for 
> 0 seconds
> [2020-01-23T14:10:12.875] Launching batch job 17534 for UID 50321
> [2020-01-23T14:10:16.937] [17534.batch] task_p_pre_launch: Using 
> sched_affinity for tasks
> [2020-01-23T14:10:42.895] [17534.batch] error: _is_a_lwp: open() 
> /proc/195420/status failed: No such file or directory
> [2020-01-23T14:11:36.386] [17534.batch] error: *** JOB 17534 ON pe2dc5-0007 
> CANCELLED AT 2020-01-23T14:11:36 ***
> [2020-01-23T14:11:37.394] [17534.batch] sending 
> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:15
> [2020-01-23T14:11:37.396] [17534.batch] done with job
> 
> I'm also seeing lots of spam in the slurmd.logs on the compute nodes 
> themselves whenever this users jobs lands on them.
> 
> [2020-02-04T15:29:11.073] [43816.batch] error: _is_a_lwp: 1 read() attempts 
> on /proc/234796/status failed: No such process
> [2020-02-04T15:37:24.238] [43682.batch] error: _is_a_lwp: open() 
> /proc/74338/status failed: No such file or directory
> [2020-02-04T15:40:42.064] [43916.batch] error: _is_a_lwp: open() 
> /proc/87034/status failed: No such file or directory
> [2020-02-04T15:41:11.304] [43840.batch] error: _is_a_lwp: open() 
> /proc/151191/status failed: No such file or directory
> 
> Has anyone seen this issue before?
> 
> Regards,
> 
> 
> Luis Huang | Systems Administrator II, Research Computing
> New York Genome Center
> 101 Avenue of the Americas
> New York, NY 10013
> O: (646) 977-7291
> lhu...@nygenome.org
> 
> 
> 
> 
> 
> 
> This message is for the recipient’s use only, and may contain confidential, 
> privileged or protected information. Any unauthorized use or dissemination of 
> this communication is prohibited. If you received this message in error, 
> please immediately notify the sender and destroy all copies of this message. 
> The recipient should check this email and any attachments for the presence of 
> viruses, as we accept no liability for any damage caused by any virus 
> transmitted by this email.

-- 
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience
Tel.:   +49 (0)551 201-2191
E-Mail: mbo...@gwdg.de
---
Gesellschaft fuer wissenschaftliche
Datenverarbeitung mbH Goettingen (GWDG)
Am Fassberg 11, 37077 Goettingen
URL:http://www.gwdg.de
E-Mail: g...@gwdg.de
Tel.:   +49 (0)551 201-1510
Fax:+49 (0)551 201-2150
Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender:
Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Goettingen
Registergericht: Goettingen
Handelsregister-Nr. B 598
---


smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] Upgrade or /-date to Release 20.02p1 .....

2020-02-04 Thread Marcus Boden
HI,

to your first question: I don't know the exact reason, but SchedMD made
it pretty clear, that there is a spcific sequence for updates:
slurmdbd -> slurmctld -> slurmd -> commands
See https://slurm.schedmd.com/SLUG19/Field_Notes_3.pdf (or any of the
other field notes) for details.
So, I'd advise that you don't try it any other way.

for your second question, see https://slurm.schedmd.com/slurm.conf.html:
> MpiParams
>MPI parameters. Used to identify ports used by older versions of
>OpenMPI and native Cray systems. The input format is
>"ports=12000-12999" to identify a range of communication ports to
>be used. NOTE: This is not needed for modern versions of OpenMPI,
>taking it out can cause a small boost in scheduling performance.
>NOTE: This is require for Cray's PMI. 

Best,
Marcus

On 20-02-04 11:44, Matthias Krawutschke wrote:
> Hello together,
> 
>  
> 
> on the RELEASE_NOTES I read the following:
> 
> 
> 
> 
> Slurm can be upgraded from version 18.08 or 19.05 to version 20.02 without
> loss
> 
> of jobs or other state information. Upgrading directly from an earlier
> version
> 
> of Slurm will result in loss of state information.
> 
> 
> 
> 
>  
> 
> I want to upgrade or update my ComputeNode first, to see if it work or not,
> but this is not able to do.
> 
> I´ve got the message on ComputeNode, that this Node can´t connect to
> SLURMCTLD…..
> 
>  
> 
> So, why can´t I test the NEW SLURM-Version on Node before I update /upgrade
> the Controller and Database?
> 
> It is really important, that I upgrade /update the Database & Controller
> first??
> 
>  
> 
> The other point (see my eMail from today) – I read on this file too:
> 
> 
> -
> 
> CONFIGURATION FILE CHANGES (see man appropriate man page for details)
> 
> =
> 
> -- The mpi/openmpi plugin has been removed as it does nothing.
> 
> MpiDefault=openmpi will be translated to the functionally-equivalent
> 
> MpiDefault=none.
> 
> So, how does OpenMPI and SLURM works together? 
> 
> What was the reason, that you disable this function?
> 
>  
> 
>  
> 
> I hope, you can help me here with my questions…….
> 
>  
> 
> Best regards….
> 
>  
> 
>  
> 
>  
> 
> Matthias Krawutschke, Dipl. Inf.
> 
>  
> 
> Universität Potsdam
> ZIM - Zentrum für Informationstechnologie und Medienmanagement
> Team High-Performance-Computing
> 
> Campus Am Neuen Palais: Am Neuen Palais 10 | 14469 Potsdam
> Tel: +49 331 977-, Fax: +49 331 977-1750
> 
>  
> 
> Internet:  
> https://www.uni-potsdam.de/de/zim/angebote-loesungen/hpc.html
> 
>  
> 
>  
> 

-- 
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience
Tel.:   +49 (0)551 201-2191
E-Mail: mbo...@gwdg.de
---
Gesellschaft fuer wissenschaftliche
Datenverarbeitung mbH Goettingen (GWDG)
Am Fassberg 11, 37077 Goettingen
URL:http://www.gwdg.de
E-Mail: g...@gwdg.de
Tel.:   +49 (0)551 201-1510
Fax:+49 (0)551 201-2150
Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender:
Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Goettingen
Registergericht: Goettingen
Handelsregister-Nr. B 598
---


smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-08 Thread Marcus Boden
Hi Jürgen,

you're looking for KillOnBadExit in the slurm.conf:
KillOnBadExit
If set to 1, a step will be terminated immediately if any task is crashed 
or aborted, as indicated by a non-zero exit code. With the default value of 0, 
if one of the processes is crashed or aborted the other processes will continue 
to run while the crashed or aborted process waits. The user can override this 
configuration parameter by using srun's -K, --kill-on-bad-exit.

this should terminate the job if a step or a process gets oom-killed.

Best,
Marcus

On 19-10-08 10:36, Juergen Salk wrote:
> * Bjørn-Helge Mevik  [191008 08:34]:
> > Jean-mathieu CHANTREIN  writes:
> > 
> > > I tried using, in slurm.conf 
> > > TaskPlugin=task/affinity, task/cgroup 
> > > SelectTypeParameters=CR_CPU_Memory 
> > > MemLimitEnforce=yes 
> > >
> > > and in cgroup.conf: 
> > > CgroupAutomount=yes 
> > > ConstrainCores=yes 
> > > ConstrainRAMSpace=yes 
> > > ConstrainSwapSpace=yes 
> > > MaxSwapPercent=10 
> > > TaskAffinity=no 
> > 
> > We have a very similar setup, the biggest difference being that we have
> > MemLimitEnforce=no, and leave the killing to the kernel's cgroup.  For
> > us, jobs are killed as they should. [...] 
> 
> Hello Bjørn-Helge,
> 
> that is interesting. We have a very similar setup as well. However, in
> our Slurm test cluster I have noticed that it is not the *job* that
> gets killed. Instead, the OOM killer terminates one (or more)
> *processes* but keeps the job itself running in a potentially 
> unhealthy state.
> 
> Is there a way to tell Slurm to terminate the whole job as soon as 
> the first OOM kill event takes place during execution? 
> 
> Best regards
> Jürgen
> 
> -- 
> Jürgen Salk
> Scientific Software & Compute Services (SSCS)
> Kommunikations- und Informationszentrum (kiz)
> Universität Ulm
> Telefon: +49 (0)731 50-22478
> Telefax: +49 (0)731 50-22471
> 

-- 
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience
Tel.:   +49 (0)551 201-2191
E-Mail: mbo...@gwdg.de
---
Gesellschaft fuer wissenschaftliche
Datenverarbeitung mbH Goettingen (GWDG)
Am Fassberg 11, 37077 Goettingen
URL:http://www.gwdg.de
E-Mail: g...@gwdg.de
Tel.:   +49 (0)551 201-1510
Fax:+49 (0)551 201-2150
Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender:
Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Goettingen
Registergericht: Goettingen
Handelsregister-Nr. B 598
---


smime.p7s
Description: S/MIME cryptographic signature


[slurm-users] Monitoring with Telegraf

2019-09-26 Thread Marcus Boden
Hey everyone,

I am using Telegraf and InfluxDB to monitor our hardware and I'd like to
include some slurm metrics into this. Is there already a telegraf plugin
for monitoring slurm I don't know about, or do I have to start from
scratch?

Best,
Marcus
-- 
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience
Tel.:   +49 (0)551 201-2191
E-Mail: mbo...@gwdg.de
---
Gesellschaft fuer wissenschaftliche
Datenverarbeitung mbH Goettingen (GWDG)
Am Fassberg 11, 37077 Goettingen
URL:http://www.gwdg.de
E-Mail: g...@gwdg.de
Tel.:   +49 (0)551 201-1510
Fax:+49 (0)551 201-2150
Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender:
Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Goettingen
Registergericht: Goettingen
Handelsregister-Nr. B 598
---


smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] slurm node weights

2019-09-05 Thread Marcus Boden
Hello Doug,

tp quote the slurm.conf page:
It would be preferable to allocate smaller memory nodes rather than
larger memory nodes if either will satisfy a job's requirements.

So I guess the idea is, that if a smaller node satisfies all
requirements, why 'waste' a bigger one for it? It makes sense for
memory, though I agree that it is counterintuitive for preocessors.

Best,
Marcus

On 19-09-05 15:48, Douglas Duckworth wrote:
> Hello
> 
> We added some newer Epyc nodes, with NVMe scratch, to our cluster and so want 
> jobs to run on these over others.  So we added "Weight=100" to the older 
> nodes and left the new ones blank.  So indeed, ceteris paribus, srun reveals 
> that the faster nodes will accept jobs over older ones.
> 
> We have the desired outcome though I am a bit confused by two statements in 
> the manpage that seem to be 
> contradictory:
> 
> "All things being equal, jobs will be allocated the nodes with the lowest 
> weight which satisfies their requirements."
> 
> "...larger weights should be assigned to nodes with more processors, memory, 
> disk space, higher processor speed, etc."
> 
> 100 is larger than 1 and we do see jobs preferring the new nodes which have 
> the default weight of 1.  Yet we're also told to assign larger weights to 
> faster nodes?
> 
> Thanks!
> Doug
> 
> 
> --
> Thanks,
> 
> Douglas Duckworth, MSc, LFCS
> HPC System Administrator
> Scientific Computing Unit
> Weill Cornell Medicine"
> E: d...@med.cornell.edu
> O: 212-746-6305
> F: 212-746-8690

-- 
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience
Tel.:   +49 (0)551 201-2191
E-Mail: mbo...@gwdg.de
---
Gesellschaft fuer wissenschaftliche
Datenverarbeitung mbH Goettingen (GWDG)
Am Fassberg 11, 37077 Goettingen
URL:http://www.gwdg.de
E-Mail: g...@gwdg.de
Tel.:   +49 (0)551 201-1510
Fax:+49 (0)551 201-2150
Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender:
Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Goettingen
Registergericht: Goettingen
Handelsregister-Nr. B 598
---


smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] Job error when using --job-name=`basename $PWD`

2019-07-29 Thread Marcus Boden
Hi Fabio,

are you sure that command substition works in the #SBATCH part of the
jobscript? I don't think that slurm actally evaluates that, though I
might be wrong.

It seems like the #SBATCH after the --job-name line are not evaluated
anymore, therefore you can't start srun with two tasks (since slurm only
allocates one).

Best regards,
Marcus

On 19-07-29 05:51, Verzelloni  Fabio wrote:
> Hi Everyone, 
> I'm experiencing a weird issue, when submitting a job like that:
> -
> #!/bin/bash
> #SBATCH --job-name=`basename $PWD`
> #SBATCH --ntasks=2
> srun -n 2 hostname
> -
> Output:
> srun: error: Unable to create step for job 15387: More processors requested 
> than permitted
> 
> If I submit a job like that:
> -
> #!/bin/bash
> #SBATCH --job-name=myjob
> #SBATCH --ntasks=2
> srun -n 2 hostname
> -
> Output:
> Mynode-001
> Mynode-001
> 
> If I decrease the number of task it works fine:
> -
> #!/bin/bash
> #SBATCH --job-name=`basename $PWD`
> #SBATCH --ntasks=1
> srun -n 1 hostname
> -
> Output:
> Mynode-001
> 
> The slurm version is 18.08.8, is that a bug in slurm?
> 
> Thanks
> Fabio
> 
> --
> - Fabio Verzelloni - CSCS - Swiss National Supercomputing Centre
> via Trevano 131 - 6900 Lugano, Switzerland
> Tel: +41 (0)91 610 82 04
>  
> 

-- 
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience
Tel.:   +49 (0)551 201-2191
E-Mail: mbo...@gwdg.de
---
Gesellschaft fuer wissenschaftliche
Datenverarbeitung mbH Goettingen (GWDG)
Am Fassberg 11, 37077 Goettingen
URL:http://www.gwdg.de
E-Mail: g...@gwdg.de
Tel.:   +49 (0)551 201-1510
Fax:+49 (0)551 201-2150
Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender:
Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Goettingen
Registergericht: Goettingen
Handelsregister-Nr. B 598
---


smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] Hints, Cheatsheets, etc

2019-07-09 Thread Marcus Boden
> 
> Yeah, on our systems, I get:
>   Sorry, gawk version 4.0 or later is required.  Your version is: GNU Awk 
> 3.1.7
> (RHEL 6). So this one wasn't as useful for me. But thanks anyway!

Just an FYI: Building gawk locally is pretty easy (a simple configure,
make, make install), so that might be a solution if you want to try the
tool.

Best,
Marcus

-- 
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience
Tel.:   +49 (0)551 201-2191
E-Mail: mbo...@gwdg.de
---
Gesellschaft fuer wissenschaftliche
Datenverarbeitung mbH Goettingen (GWDG)
Am Fassberg 11, 37077 Goettingen
URL:http://www.gwdg.de
E-Mail: g...@gwdg.de
Tel.:   +49 (0)551 201-1510
Fax:+49 (0)551 201-2150
Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender:
Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Goettingen
Registergericht: Goettingen
Handelsregister-Nr. B 598
---



Re: [slurm-users] gpu count

2019-06-27 Thread Marcus Boden
Hi,

this is usually due to a misconfiguration in your gres.conf (at least it
was for me). Can you show your gres.conf?

Best,
Marcus

On 19-06-27 15:33, Valerio Bellizzomi wrote:
> hello, my node has 2 gpus so I have specified gres=gpus:2 but the
> scontrol show node displays this:
> 
> State=IDLE+DRAIN
> Reason=gres/gpus count too low (1 < 2)
> 
> 
> 
> 
> 

-- 
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience
Tel.:   +49 (0)551 201-2191
E-Mail: mbo...@gwdg.de
---
Gesellschaft fuer wissenschaftliche
Datenverarbeitung mbH Goettingen (GWDG)
Am Fassberg 11, 37077 Goettingen
URL:http://www.gwdg.de
E-Mail: g...@gwdg.de
Tel.:   +49 (0)551 201-1510
Fax:+49 (0)551 201-2150
Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender:
Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Goettingen
Registergericht: Goettingen
Handelsregister-Nr. B 598
---


smime.p7s
Description: S/MIME cryptographic signature