[slurm-dev] srun can't use variables in a batch script after upgrade
Hi list, I am a little bit lost right now and would appreciate your help. We have a little cluster with 16 nodes running with SLURM and it is doing everything we want, except a few little things I want to improve. So that is why I wanted to upgrade our old SLURM 15.X (don't know the exact version) to 17.02.4 on my test machine. I just deleted the old version completely with 'yum erase slurm-*' (CentOS 7 btw.) and build the new version with rpmbuild. Everything went fine so I started configuring a new slurm[dbd].conf. This time I also wanted to integrate backfill instead of FIFO and also use accounting (just to know which person uses the most resources). Because we had no databases yet I started slurmdbd and slurmctld without problems. Everything seemed fine with a simple mpi hello world test on one and two nodes. Now I wanted to enhance the script a bit more and include working in the local directory of the nodes which is /work. To get everything up and running I used the script which I attached for you (it also includes the output after running the script). It should basically just copy all data to /work/tants/$SLURM_JOB_NAME before doing the mpi hello world. But it seems that srun does not know $SLURM_JOB_NAME even though it is there. /work/tants belongs to the correct user and has rwx permissions. So did I just configure something wrong or what happened here? Nearly the same example is working on our cluster with 15.X. The script is only for testing purposes, thats why there are so many echo commands in there. If you see any mistake or can recommend better configurations I would glady hear them. Should you need any more information I will provide them. Thank you for your time! Best regards, Dennis -- Dennis Tants Auszubildender: Fachinformatiker für Systemintegration ZARM - Zentrum für angewandte Raumfahrttechnologie und Mikrogravitation ZARM - Center of Applied Space Technology and Microgravity Universität Bremen Am Fallturm 28359 Bremen, Germany Telefon: 0421 218 57940 E-Mail: ta...@zarm.uni-bremen.de www.zarm.uni-bremen.de #!/bin/bash #SBATCH -J mpicopytest #SBATCH -t 01:00:00 #SBATCH -N 2 -n 8 #SBATCH -o copy-test001-%j.out # Vorbereiten echo "Script start" echo "Job name: $SLURM_JOB_NAME with ID: $SLURM_JOB_ID" echo "Allocated nodes: $SLURM_JOB_NODELIST" echo "Amount of tasks to run: $SLURM_NTASKS" echo "Changing to submit dir: $SLURM_SUBMIT_DIR" cd $SLURM_SUBMIT_DIR echo "Preparing data..." echo "Copying files to /work on the nodes" srun -N2 -n2 mkdir /work/tants/$SLURM_JOB_NAME srun -N2 -n2 cp -r * /work/tants/$SLURM_JOB_NAME echo "Finished prepation" # Start der eigentlichen Aufgaben echo "Starting computation" module load OpenMPI/1.10.2 cd /work/tants/$SLURM_JOB_NAME time mpirun -np $SLURM_NTASKS ./mpi-hello-world time srun -N2 -n8 ./mpi-hello-world echo "Ending computation" # Aufräumen echo "Starting cleanup process, copying back to the submit dir and deleting files in /work" srun -N2 -n2 cp -r * $SLURM_SUBMIT_DIR srun -N2 -n2 rm -rf /work/tants/$SLURM_JOB_NAME echo "Script end" # Actual output of the output file I specified Output of copy-test001-%j.out Script start Job name: mpicopytest with ID: 54 Allocated nodes: node[1-2] Amount of tasks to run: 8 Changing to submit dir: /home/tants/pbs Preparing data... Copying files to /work on the nodes Script end # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=headnode ControlAddr=192.168.80.254 #BackupController= #BackupAddr= # AuthType=auth/munge #CheckpointType=checkpoint/none CryptoType=crypto/munge #DisableRootJobs=NO EnforcePartLimits=YES #Epilog= #EpilogSlurmctld= #FirstJobId=1 #MaxJobId=99 #GresTypes= #GroupUpdateForce=0 #GroupUpdateTime=600 #JobCheckpointDir=/var/slurm/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= JobFileAppend=1 JobRequeue=0 #JobSubmitPlugins=1 #KillOnBadExit=0 LaunchType=launch/slurm #Licenses=foo*4,bar MailProg=/bin/mail MaxJobCount=5000 MaxStepCount=4 #MaxTasksPerNode=128 MpiDefault=none MpiParams=ports=12000-12999 #PluginDir= #PlugStackConfig= PrivateData=jobs ProctrackType=proctrack/cgroup #Prolog= #PrologFlags= #PrologSlurmctld= PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= #RebootProgram= ReturnToService=1 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm #SlurmdUser=root #SrunEpilog= #SrunProlog= StateSaveLocation=/var/spool/slurmctld SwitchType=switch/none #TaskEpilog= TaskPlugin=task/none #TaskPluginParam= #TaskProlog= TopologyPlugin=topology/none TmpFS=/tmp #TrackWCKey=no TreeWidth=8 #UnkillableStepProgram= UsePAM=0 # # # TIMERS BatchStartTimeout=10 CompleteWait=8 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=6
[slurm-dev] Re: Length of possible SlurmDBD without HA
While I can't tell you how much memory is required, I can say that I've seen warnings in the logs when this has happened, including a more urgent warning that the cache is about to fill up. Cheers, Adam On Thu, Jul 6, 2017 at 10:54 AM, Loris Bennettwrote: > > Hi, > > On the Slurm FAQ page > > https://slurm.schedmd.com/faq.html > > it says the following: > > 52. How critical is configuring high availability for my database? > > Consider if you really need mysql failover. Short outage of > slurmdbd is not a problem, because slurmctld will store all data > in memory and send it to slurmdbd when it's back operating. The > slurmctld daemon will also cache all user limits and fair share > information. > > I was wondering how long a "short outage" can be. Presumably this is > determined by the amount of free memory on the server running slurmctld, > the number of jobs, and the amount of memory required per job. > > So roughly how much memory will be required per job? > > Cheers, > > Loris > > -- > Dr. Loris Bennett (Mr.) > ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de