Hi list,

I am a little bit lost right now and would appreciate your help.
We have a little cluster with 16 nodes running with SLURM and it is
doing everything we want, except a few
little things I want to improve.

So that is why I wanted to upgrade our old SLURM 15.X (don't know the
exact version) to 17.02.4 on my test machine.
I just deleted the old version completely with 'yum erase slurm-*'
(CentOS 7 btw.) and build the new version with rpmbuild.
Everything went fine so I started configuring a new slurm[dbd].conf.
This time I also wanted to integrate backfill instead of FIFO
and also use accounting (just to know which person uses the most
resources). Because we had no databases yet I started
slurmdbd and slurmctld without problems.

Everything seemed fine with a simple mpi hello world test on one and two
nodes.
Now I wanted to enhance the script a bit more and include working in the
local directory of the nodes which is /work.
To get everything up and running I used the script which I attached for
you (it also includes the output after running the script).
It should basically just copy all data to /work/tants/$SLURM_JOB_NAME
before doing the mpi hello world.
But it seems that srun does not know $SLURM_JOB_NAME even though it is
there.
/work/tants belongs to the correct user and has rwx permissions.

So did I just configure something wrong or what happened here? Nearly
the same example is working on our cluster with
15.X. The script is only for testing purposes, thats why there are so
many echo commands in there.
If you see any mistake or can recommend better configurations I would
glady hear them.
Should you need any more information I will provide them.
Thank you for your time!

Best regards,
Dennis

-- 
Dennis Tants
Auszubildender: Fachinformatiker für Systemintegration

ZARM - Zentrum für angewandte Raumfahrttechnologie und Mikrogravitation
ZARM - Center of Applied Space Technology and Microgravity

Universität Bremen
Am Fallturm
28359 Bremen, Germany

Telefon: 0421 218 57940
E-Mail: ta...@zarm.uni-bremen.de

www.zarm.uni-bremen.de

#!/bin/bash
#SBATCH -J mpicopytest
#SBATCH -t 01:00:00
#SBATCH -N 2 -n 8
#SBATCH -o copy-test001-%j.out

# Vorbereiten
echo "Script start"
echo "Job name: $SLURM_JOB_NAME with ID: $SLURM_JOB_ID"
echo "Allocated nodes: $SLURM_JOB_NODELIST"
echo "Amount of tasks to run: $SLURM_NTASKS"
echo "Changing to submit dir: $SLURM_SUBMIT_DIR"
cd $SLURM_SUBMIT_DIR
echo "Preparing data..."
echo "Copying files to /work on the nodes"
srun -N2 -n2 mkdir /work/tants/$SLURM_JOB_NAME
srun -N2 -n2 cp -r * /work/tants/$SLURM_JOB_NAME
echo "Finished prepation"

# Start der eigentlichen Aufgaben
echo "Starting computation"
module load OpenMPI/1.10.2
cd /work/tants/$SLURM_JOB_NAME
time mpirun -np $SLURM_NTASKS ./mpi-hello-world
time srun -N2 -n8 ./mpi-hello-world
echo "Ending computation"

# Aufräumen
echo "Starting cleanup process, copying back to the submit dir and deleting 
files in /work"
srun -N2 -n2 cp -r * $SLURM_SUBMIT_DIR
srun -N2 -n2 rm -rf /work/tants/$SLURM_JOB_NAME
echo "Script end"


# Actual output of the output file I specified
Output of copy-test001-%j.out
Script start
Job name: mpicopytest with ID: 54
Allocated nodes: node[1-2]
Amount of tasks to run: 8
Changing to submit dir: /home/tants/pbs
Preparing data...
Copying files to /work on the nodes
Script end
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=headnode
ControlAddr=192.168.80.254
#BackupController=
#BackupAddr=
#
AuthType=auth/munge
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
EnforcePartLimits=YES
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
JobFileAppend=1
JobRequeue=0
#JobSubmitPlugins=1
#KillOnBadExit=0
LaunchType=launch/slurm
#Licenses=foo*4,bar
MailProg=/bin/mail
MaxJobCount=5000
MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
MpiParams=ports=12000-12999
#PluginDir=
#PlugStackConfig=
PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
TopologyPlugin=topology/none
TmpFS=/tmp
#TrackWCKey=no
TreeWidth=8
#UnkillableStepProgram=
UsePAM=0
#
#
# TIMERS
BatchStartTimeout=10
CompleteWait=8
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=6
#MessageTimeout=10
ResvOverRun=1
MinJobAge=300
OverTimeLimit=45
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
VSizeFactor=0
WaitTime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/linear
#SelectTypeParameters=
#
#
# JOB PRIORITY
PriorityFlags=ACCRUE_ALWAYS,INCR_ONLY
PriorityType=priority/multifactor
PriorityDecayHalfLife=2-12
PriorityCalcPeriod=120
PriorityFavorSmall=no
PriorityMaxAge=4-0
PriorityUsageResetPeriod=none
PriorityWeightAge=7000
PriorityWeightFairshare=1
PriorityWeightJobSize=3000
PriorityWeightPartition=1
PriorityWeightQOS=1
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
AccountingStorageHost=localhost
#AccountingStorageLoc=
#AccountingStoragePass=/var/run/munge/munge.socket.2
AccountingStoragePort=6819
AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
#JobCompType=jobcomp/slurmdbd
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=5
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=5
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmSchedLogFile=/var/log/slurm/scheduler.log
SlurmSchedLogLevel=1
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=node[1-2] NodeAddr=192.168.80.[171-172] CPUs=4 Sockets=1 
CoresPerSocket=4 ThreadsPerCore=1 RealMemory=3952 TmpDisk=30705 State=UNKNOWN
PartitionName=compute Nodes=node[1-2] Default=YES MaxTime=4-0 State=UP

Reply via email to