Re: [slurm-users] Providing users with info on wait time vs. run time

Paul Edmon Fri, 16 Sep 2022 06:46:28 -0700

We also call scontrol in our scripts (a little as we can manage) and werun at the scale of 1500 nodes. It hasn't really caused many issues,but we try to limit it as much as we possibly can.


-Paul Edmon-


On 9/16/22 9:41 AM, Sebastian Potthoff wrote:

Hi Hermann,
So you both are happily(?) ignoring this warning the "Prolog andEpilog Guide",
right? :-)
"Prolog and Epilog scripts [...] should not call Slurm commands(e.g. squeue,
scontrol, sacctmgr, etc)."
We have probably been doing this since before the warning was added to
the documentation.  So we are "ignorantly ignoring" the advice :-/
Same here :) But if $SLURM_JOB_STDOUT is not defined as documented …what can you do.
May I ask how big your clusters are (number of nodes) and howheavily they are
used (submitted jobs per hour)?
We have around 500 nodes (mostly 2x18 cores). Jobs ending (i.e.calling the epilog script) varies quite a lot between 1000 and 15k aday, so something in between 40 and 625 Jobs/hour. During those peaksSlurm can become noticeably slower, however usually it runs fine.
Sebastian
Am 16.09.2022 um 15:15 schrieb Loris Bennett<loris.benn...@fu-berlin.de>:
Hi Hermann,

Hermann Schwärzler <hermann.schwaerz...@uibk.ac.at> writes:
Hi Loris,
hi Sebastian,

thanks for the information on how you are doing this.
So you both are happily(?) ignoring this warning the "Prolog andEpilog Guide",
right? :-)
"Prolog and Epilog scripts [...] should not call Slurm commands(e.g. squeue,
scontrol, sacctmgr, etc)."
We have probably been doing this since before the warning was added to
the documentation.  So we are "ignorantly ignoring" the advice :-/
May I ask how big your clusters are (number of nodes) and howheavily they are
used (submitted jobs per hour)?
We have around 190 32-core nodes.  I don't know how I would easily find
out the average number of jobs per hour.  The only problems we have had
with submission have been when people have written their own mechanisms
for submitting thousands of jobs.  Once we get them to use job array,
such problems generally disappear.

Cheers,

Loris
Regards,
Hermann

On 9/16/22 9:09 AM, Loris Bennett wrote:
Hi Hermann,
Sebastian Potthoff <s.potth...@uni-muenster.de> writes:
Hi Hermann,
I happened to read along this conversation and was just solvingthis issue today. I added this part to the epilog script to makeit work:
# Add job report to stdout
StdOut=$(/usr/bin/scontrol show job=$SLURM_JOB_ID | /usr/bin/grepStdOut | /usr/bin/xargs | /usr/bin/awk 'BEGIN { FS = "=" } ; {print $2 }')
NODELIST=($(/usr/bin/scontrol show hostnames))

# Only add to StdOut file if it exists and if we are the first node
if [ "$(/usr/bin/hostname -s)" = "${NODELIST[0]}" -a ! -z"${StdOut}" ]
then
echo "################################# JOB REPORT##################################" >> $StdOut
  /usr/bin/seff $SLURM_JOB_ID >> $StdOut
echo"###############################################################################">> $StdOut
fi
We do something similar.  At the end of our script pointed to by
EpilogSlurmctld we have
  OUT=`scontrol show jobid ${job_id} | awk -F= '/ StdOut/{print $2}'`
  if [ ! -f "$OUT" ]; then
    exit
  fi
  printf "\n== Epilog Slurmctld
==================================================\n\n" >>  ${OUT}
  seff ${SLURM_JOB_ID} >> ${OUT}
  printf
"\n======================================================================\n"
${OUT}
  chown ${user} ${OUT}
Cheers,
Loris
Contrary to what it says in the slurm docshttps://slurm.schedmd.com/prolog_epilog.html I was not able touse the env var SLURM_JOB_STDOUT, so I had to fetch it viascontrol. In addition I had tomake sure it is only called by the „leading“ node as the epilogscript will be called by ALL nodes of a multinode job and theywould all call seff and clutter up the output. Last thing was tocheck if StdOut isnot of length zero (i.e. it exists). Interactive jobs wouldotherwise cause the node to drain.
Maybe this helps.

Kind regards
Sebastian
PS: goslmailer looks quite nice with its recommendations! Willdefinitely look into it.
--
Westfälische Wilhelms-Universität (WWU) Münster
WWU IT
Sebastian Potthoff (eScience / HPC)
Am 15.09.2022 um 18:07 schrieb Hermann Schwärzler<hermann.schwaerz...@uibk.ac.at>:
 Hi Ole,

 On 9/15/22 5:21 PM, Ole Holm Nielsen wrote:

 On 15-09-2022 16:08, Hermann Schwärzler wrote:
Just out of curiosity: how do you insert the output of seff intothe out-file of a job?
Use the "smail" tool from the slurm-contribs RPM and set this inslurm.conf:
 MailProg=/usr/bin/smail
Maybe I am missing something but from what I can tell smail sendsan email and does *not* change or append to the .out file of a job...
 Regards,
 Hermann
--
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin         emailloris.benn...@fu-berlin.de

Re: [slurm-users] Providing users with info on wait time vs. run time

Reply via email to