Hi Danny, Yes of course... Here it is.
N On Tue, Aug 2, 2011 at 6:52 PM, Danny Auble <[email protected]> wrote: > ** > > Hey Nicolas, could you send your complete slurm.conf? It would be > interesting to see the other plugins you are using that may be contributing > to the problem. > > > > Danny > > > > On Tuesday August 02 2011 6:43:17 PM you wrote: > > > Hi all, > > > > > > I'm having issues with slurm 2.2.7 and specifying the nodes cpu > information. > > > > > > If I set the number of sockets, core per socket and thread per core like > > > this: > > > > > > > NodeName=node[2-4] RealMemory=23000 Sockets=2 CoresPerSocket=4 > > > > ThreadsPerCore=2 State=UNKNOWN > > > > > > > >> and submit a job, slurmctl crashes. The last section of sclurmctl.log > is: > > > > > > > [2011-08-02T17:58:50] debug2: initial priority for job 49852 is 98 > > > > [2011-08-02T17:58:50] debug2: found 3 usable nodes from config > containing > > > > node[2-4] > > > > [2011-08-02T17:58:50] debug3: _pick_best_nodes: job 49852 idle_nodes 65 > > > > share_nodes 76 > > > > [2011-08-02T17:58:50] debug2: sched: JobId=49852 allocated resources: > > > > NodeList=(null) > > > > [2011-08-02T17:58:50] _slurm_rpc_submit_batch_job JobId=49852 usec=1540 > > > > [2011-08-02T17:58:50] debug: sched: Running job scheduler > > > > [2011-08-02T17:58:50] debug2: found 3 usable nodes from config > containing > > > > node[2-4] > > > > [2011-08-02T17:58:50] debug3: _pick_best_nodes: job 49852 idle_nodes 65 > > > > share_nodes 76 > > > > [2011-08-02T17:58:50] fatal: cons_res: sync loop not progressing > > > > > > > > > > > > > I've also seen the error "cons_res: cpus computation error". > > > > > > There might be something wrong with my configuration, but slurm should > tell > > > me so, not crash when a job is submitted... > > > > > > I'm playing with these options because a user reported that just using > > > Procs=16 would not spread his mpi processes accross the allocated nodes. > > > I've fixed that by using --nodes=*-* and --ntasks-per-node=*, but the > crash > > > is still relevant I guess... > > > > > > Could it be a bug? > > > > > > Thanks > > > > > > Nicolas >
# slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=unicron #ControlAddr= #BackupController= #BackupAddr= # AuthType=auth/munge CacheGroups=1 #CheckpointType=checkpoint/none CryptoType=crypto/munge #DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #PrologSlurmctld= #FirstJobId=1 #GresTypes= #GroupUpdateForce=0 #GroupUpdateTime=600 #JobCheckpointDir=/var/slurm/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins=1 #KillOnBadExit=0 #Licenses=foo*4,bar #MailProg=/bin/mail #MaxJobCount=5000 MaxJobCount=100000 #MaxTasksPerNode=128 MpiDefault=none #MpiParams=ports=#-# #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/pgid #Prolog= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= ReturnToService=2 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/tmp/slurmd SlurmUser=slurm #SrunEpilog= #SrunProlog= StateSaveLocation=/var/tmp/slurmctld SwitchType=switch/none #TaskEpilog= TaskPlugin=task/none #TaskPluginParam= #TaskProlog= #TopologyPlugin=topology/tree #TmpFs=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 #MessageTimeout=10 #ResvOverRun=0 MinJobAge=300 #OverTimeLimit=0 SlurmctldTimeout=120 SlurmdTimeout=300 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0 # # # SCHEDULING #DefMemPerCPU=0 FastSchedule=1 #MaxMemPerCPU=0 #SchedulerRootFilter=1 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/cons_res SelectTypeParameters=CR_CPU # # # JOB PRIORITY PriorityType=priority/multifactor #PriorityDecayHalfLife= #PriorityCalcPeriod= PriorityFavorSmall=YES #PriorityMaxAge= #PriorityUsageResetPeriod= PriorityWeightAge=100 #PriorityWeightFairshare= PriorityWeightJobSize=100 #PriorityWeightPartition= #PriorityWeightQOS= # # # LOGGING AND ACCOUNTING #AccountingStorageEnforce=0 #AccountingStorageHost= #AccountingStorageLoc= #AccountingStoragePass= #AccountingStoragePort= AccountingStorageType=accounting_storage/none #AccountingStorageUser= ClusterName=unicron #DebugFlags= # Disable /etc/slurm.conf checking: #DebugFlags=NO_CONF_HASH #JobCompHost= #JobCompLoc= #JobCompPass= #JobCompPort= JobCompType=jobcomp/none #JobCompUser= JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurmd.log SlurmSchedLogFile=/var/log/slurmsched.log SlurmSchedLogLevel=3 # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES NodeName=node[2-71,101-105] RealMemory=23000 Procs=16 State=UNKNOWN NodeName=unicron RealMemory=12000 Procs=8 State=UNKNOWN NodeName=nodeKVM RealMemory=494 Procs=1 State=UNKNOWN PartitionName=test Nodes=node[67,70-71] MaxTime=INFINITE State=UP PartitionName=unicron Nodes=unicron MaxTime=INFINITE State=UP PartitionName=wholecluster Nodes=node[2-71,101-105] MaxTime=INFINITE State=UP PartitionName=normalnodes Nodes=node[2-71] Default=YES MaxTime=INFINITE State=UP PartitionName=supernodes Nodes=node[101-105] MaxTime=INFINITE State=UP PartitionName=virtual Nodes=nodeKVM MaxTime=INFINITE State=UP
