Hi list, I am a little bit lost right now and would appreciate your help. We have a little cluster with 16 nodes running with SLURM and it is doing everything we want, except a few little things I want to improve.
So that is why I wanted to upgrade our old SLURM 15.X (don't know the exact version) to 17.02.4 on my test machine. I just deleted the old version completely with 'yum erase slurm-*' (CentOS 7 btw.) and build the new version with rpmbuild. Everything went fine so I started configuring a new slurm[dbd].conf. This time I also wanted to integrate backfill instead of FIFO and also use accounting (just to know which person uses the most resources). Because we had no databases yet I started slurmdbd and slurmctld without problems. Everything seemed fine with a simple mpi hello world test on one and two nodes. Now I wanted to enhance the script a bit more and include working in the local directory of the nodes which is /work. To get everything up and running I used the script which I attached for you (it also includes the output after running the script). It should basically just copy all data to /work/tants/$SLURM_JOB_NAME before doing the mpi hello world. But it seems that srun does not know $SLURM_JOB_NAME even though it is there. /work/tants belongs to the correct user and has rwx permissions. So did I just configure something wrong or what happened here? Nearly the same example is working on our cluster with 15.X. The script is only for testing purposes, thats why there are so many echo commands in there. If you see any mistake or can recommend better configurations I would glady hear them. Should you need any more information I will provide them. Thank you for your time! Best regards, Dennis -- Dennis Tants Auszubildender: Fachinformatiker für Systemintegration ZARM - Zentrum für angewandte Raumfahrttechnologie und Mikrogravitation ZARM - Center of Applied Space Technology and Microgravity Universität Bremen Am Fallturm 28359 Bremen, Germany Telefon: 0421 218 57940 E-Mail: ta...@zarm.uni-bremen.de www.zarm.uni-bremen.de
#!/bin/bash #SBATCH -J mpicopytest #SBATCH -t 01:00:00 #SBATCH -N 2 -n 8 #SBATCH -o copy-test001-%j.out # Vorbereiten echo "Script start" echo "Job name: $SLURM_JOB_NAME with ID: $SLURM_JOB_ID" echo "Allocated nodes: $SLURM_JOB_NODELIST" echo "Amount of tasks to run: $SLURM_NTASKS" echo "Changing to submit dir: $SLURM_SUBMIT_DIR" cd $SLURM_SUBMIT_DIR echo "Preparing data..." echo "Copying files to /work on the nodes" srun -N2 -n2 mkdir /work/tants/$SLURM_JOB_NAME srun -N2 -n2 cp -r * /work/tants/$SLURM_JOB_NAME echo "Finished prepation" # Start der eigentlichen Aufgaben echo "Starting computation" module load OpenMPI/1.10.2 cd /work/tants/$SLURM_JOB_NAME time mpirun -np $SLURM_NTASKS ./mpi-hello-world time srun -N2 -n8 ./mpi-hello-world echo "Ending computation" # Aufräumen echo "Starting cleanup process, copying back to the submit dir and deleting files in /work" srun -N2 -n2 cp -r * $SLURM_SUBMIT_DIR srun -N2 -n2 rm -rf /work/tants/$SLURM_JOB_NAME echo "Script end" # Actual output of the output file I specified Output of copy-test001-%j.out Script start Job name: mpicopytest with ID: 54 Allocated nodes: node[1-2] Amount of tasks to run: 8 Changing to submit dir: /home/tants/pbs Preparing data... Copying files to /work on the nodes Script end
# slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=headnode ControlAddr=192.168.80.254 #BackupController= #BackupAddr= # AuthType=auth/munge #CheckpointType=checkpoint/none CryptoType=crypto/munge #DisableRootJobs=NO EnforcePartLimits=YES #Epilog= #EpilogSlurmctld= #FirstJobId=1 #MaxJobId=999999 #GresTypes= #GroupUpdateForce=0 #GroupUpdateTime=600 #JobCheckpointDir=/var/slurm/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= JobFileAppend=1 JobRequeue=0 #JobSubmitPlugins=1 #KillOnBadExit=0 LaunchType=launch/slurm #Licenses=foo*4,bar MailProg=/bin/mail MaxJobCount=5000 MaxStepCount=40000 #MaxTasksPerNode=128 MpiDefault=none MpiParams=ports=12000-12999 #PluginDir= #PlugStackConfig= PrivateData=jobs ProctrackType=proctrack/cgroup #Prolog= #PrologFlags= #PrologSlurmctld= PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= #RebootProgram= ReturnToService=1 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm #SlurmdUser=root #SrunEpilog= #SrunProlog= StateSaveLocation=/var/spool/slurmctld SwitchType=switch/none #TaskEpilog= TaskPlugin=task/none #TaskPluginParam= #TaskProlog= TopologyPlugin=topology/none TmpFS=/tmp #TrackWCKey=no TreeWidth=8 #UnkillableStepProgram= UsePAM=0 # # # TIMERS BatchStartTimeout=10 CompleteWait=8 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=6 #MessageTimeout=10 ResvOverRun=1 MinJobAge=300 OverTimeLimit=45 SlurmctldTimeout=120 SlurmdTimeout=300 #UnkillableStepTimeout=60 VSizeFactor=0 WaitTime=0 # # # SCHEDULING #DefMemPerCPU=0 FastSchedule=1 #MaxMemPerCPU=0 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SelectType=select/linear #SelectTypeParameters= # # # JOB PRIORITY PriorityFlags=ACCRUE_ALWAYS,INCR_ONLY PriorityType=priority/multifactor PriorityDecayHalfLife=2-12 PriorityCalcPeriod=120 PriorityFavorSmall=no PriorityMaxAge=4-0 PriorityUsageResetPeriod=none PriorityWeightAge=7000 PriorityWeightFairshare=1 PriorityWeightJobSize=3000 PriorityWeightPartition=1 PriorityWeightQOS=1 # # # LOGGING AND ACCOUNTING #AccountingStorageEnforce=0 AccountingStorageHost=localhost #AccountingStorageLoc= #AccountingStoragePass=/var/run/munge/munge.socket.2 AccountingStoragePort=6819 AccountingStorageType=accounting_storage/slurmdbd #AccountingStorageUser= AccountingStoreJobComment=YES ClusterName=cluster #DebugFlags= #JobCompHost= #JobCompLoc= #JobCompPass= #JobCompPort= #JobCompType=jobcomp/slurmdbd #JobCompUser= #JobContainerType=job_container/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/cgroup SlurmctldDebug=5 SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=5 SlurmdLogFile=/var/log/slurm/slurmd.log SlurmSchedLogFile=/var/log/slurm/scheduler.log SlurmSchedLogLevel=1 # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES NodeName=node[1-2] NodeAddr=192.168.80.[171-172] CPUs=4 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=3952 TmpDisk=30705 State=UNKNOWN PartitionName=compute Nodes=node[1-2] Default=YES MaxTime=4-0 State=UP