[slurm-dev] srun can't use variables in a batch script after upgrade

2017-07-07 Thread Dennis Tants
Hi list,

I am a little bit lost right now and would appreciate your help.
We have a little cluster with 16 nodes running with SLURM and it is
doing everything we want, except a few
little things I want to improve.

So that is why I wanted to upgrade our old SLURM 15.X (don't know the
exact version) to 17.02.4 on my test machine.
I just deleted the old version completely with 'yum erase slurm-*'
(CentOS 7 btw.) and build the new version with rpmbuild.
Everything went fine so I started configuring a new slurm[dbd].conf.
This time I also wanted to integrate backfill instead of FIFO
and also use accounting (just to know which person uses the most
resources). Because we had no databases yet I started
slurmdbd and slurmctld without problems.

Everything seemed fine with a simple mpi hello world test on one and two
nodes.
Now I wanted to enhance the script a bit more and include working in the
local directory of the nodes which is /work.
To get everything up and running I used the script which I attached for
you (it also includes the output after running the script).
It should basically just copy all data to /work/tants/$SLURM_JOB_NAME
before doing the mpi hello world.
But it seems that srun does not know $SLURM_JOB_NAME even though it is
there.
/work/tants belongs to the correct user and has rwx permissions.

So did I just configure something wrong or what happened here? Nearly
the same example is working on our cluster with
15.X. The script is only for testing purposes, thats why there are so
many echo commands in there.
If you see any mistake or can recommend better configurations I would
glady hear them.
Should you need any more information I will provide them.
Thank you for your time!

Best regards,
Dennis

-- 
Dennis Tants
Auszubildender: Fachinformatiker für Systemintegration

ZARM - Zentrum für angewandte Raumfahrttechnologie und Mikrogravitation
ZARM - Center of Applied Space Technology and Microgravity

Universität Bremen
Am Fallturm
28359 Bremen, Germany

Telefon: 0421 218 57940
E-Mail: ta...@zarm.uni-bremen.de

www.zarm.uni-bremen.de

#!/bin/bash
#SBATCH -J mpicopytest
#SBATCH -t 01:00:00
#SBATCH -N 2 -n 8
#SBATCH -o copy-test001-%j.out

# Vorbereiten
echo "Script start"
echo "Job name: $SLURM_JOB_NAME with ID: $SLURM_JOB_ID"
echo "Allocated nodes: $SLURM_JOB_NODELIST"
echo "Amount of tasks to run: $SLURM_NTASKS"
echo "Changing to submit dir: $SLURM_SUBMIT_DIR"
cd $SLURM_SUBMIT_DIR
echo "Preparing data..."
echo "Copying files to /work on the nodes"
srun -N2 -n2 mkdir /work/tants/$SLURM_JOB_NAME
srun -N2 -n2 cp -r * /work/tants/$SLURM_JOB_NAME
echo "Finished prepation"

# Start der eigentlichen Aufgaben
echo "Starting computation"
module load OpenMPI/1.10.2
cd /work/tants/$SLURM_JOB_NAME
time mpirun -np $SLURM_NTASKS ./mpi-hello-world
time srun -N2 -n8 ./mpi-hello-world
echo "Ending computation"

# Aufräumen
echo "Starting cleanup process, copying back to the submit dir and deleting 
files in /work"
srun -N2 -n2 cp -r * $SLURM_SUBMIT_DIR
srun -N2 -n2 rm -rf /work/tants/$SLURM_JOB_NAME
echo "Script end"


# Actual output of the output file I specified
Output of copy-test001-%j.out
Script start
Job name: mpicopytest with ID: 54
Allocated nodes: node[1-2]
Amount of tasks to run: 8
Changing to submit dir: /home/tants/pbs
Preparing data...
Copying files to /work on the nodes
Script end
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=headnode
ControlAddr=192.168.80.254
#BackupController=
#BackupAddr=
#
AuthType=auth/munge
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
EnforcePartLimits=YES
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=99
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
JobFileAppend=1
JobRequeue=0
#JobSubmitPlugins=1
#KillOnBadExit=0
LaunchType=launch/slurm
#Licenses=foo*4,bar
MailProg=/bin/mail
MaxJobCount=5000
MaxStepCount=4
#MaxTasksPerNode=128
MpiDefault=none
MpiParams=ports=12000-12999
#PluginDir=
#PlugStackConfig=
PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
TopologyPlugin=topology/none
TmpFS=/tmp
#TrackWCKey=no
TreeWidth=8
#UnkillableStepProgram=
UsePAM=0
#
#
# TIMERS
BatchStartTimeout=10
CompleteWait=8
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=6

[slurm-dev] Re: Length of possible SlurmDBD without HA

2017-07-07 Thread Adam Huffman

While I can't tell you how much memory is required, I can say that
I've seen warnings in the logs when this has happened, including a
more urgent warning that the cache is about to fill up.

Cheers,
Adam

On Thu, Jul 6, 2017 at 10:54 AM, Loris Bennett
 wrote:
>
> Hi,
>
> On the Slurm FAQ page
>
>   https://slurm.schedmd.com/faq.html
>
> it says the following:
>
>   52. How critical is configuring high availability for my database?
>
>   Consider if you really need mysql failover. Short outage of
>   slurmdbd is not a problem, because slurmctld will store all data
>   in memory and send it to slurmdbd when it's back operating. The
>   slurmctld daemon will also cache all user limits and fair share
>   information.
>
> I was wondering how long a "short outage" can be.  Presumably this is
> determined by the amount of free memory on the server running slurmctld,
> the number of jobs, and the amount of memory required per job.
>
> So roughly how much memory will be required per job?
>
> Cheers,
>
> Loris
>
> --
> Dr. Loris Bennett (Mr.)
> ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de