[slurm-dev] Re: nodes often down

Sivasangari Nandy Fri, 13 Sep 2013 09:43:44 -0700

Whaou thanks a lot Hongjia Cao ! It's working now with CPUs=1 and Sockets=1. 
Yea actually physically (cat /proc/cpuinfo) there was just 1 CPU so with 2 CPUs 
in the slurm conf file it did not work.


thanks again, 
bye.





----- Mail original -----
> De: "Hongjia Cao" <[email protected]>
> À: "slurm-dev" <[email protected]>
> Envoyé: Vendredi 13 Septembre 2013 06:24:53
> Objet: [slurm-dev] Re: nodes often down
> 
> 
> I guess that you have wrong number of CPUs per node configured.
> Please
> try changed the configuration file. Or you may try FastSchedule=1.
> 
> 在 2013-09-10二的 23:54 -0700，Sivasangari Nandy写道：
> > root@VM-667:/omaha-beach/workflow# sinfo -R
> > REASON               USER      TIMESTAMP           NODELIST
> > Low CPUs             slurm     2013-09-10T19:38:37 VM-[669-671]
> > 
> > 
> > Then when i changed nodes to idle, and type sinfo -R,
> > there is nothing.
> > I wanted to know how to have permanently idle nodes.
> > 
> > 
> > ______________________________________________________________________
> >         De: "曹宏嘉" <[email protected]>
> >         À: "slurm-dev" <[email protected]>
> >         Envoyé: Mercredi 11 Septembre 2013 02:26:54
> >         Objet: [slurm-dev] Re: nodes often down
> >         
> >         You may run "sinfo -R" to see the reason that the node is
> >         left
> >         down. ReturnToService=1 can not recover all down nodes.
> >         
> >         
> >                 -----原始邮件-----
> >                 发件人: "Sivasangari Nandy"
> >                 <[email protected]>
> >                 发送时间: 2013-09-10 20:38:43 (星期二)
> >                 收件人: slurm-dev <[email protected]>
> >                 抄送: slurm-dev <[email protected]>
> >                 主题: [slurm-dev] Re: nodes often down
> >                 
> >                 No it was like that at first so by default.
> >                 And yea I've restarted slurm but no changes, after
> >                 10
> >                 min nodes are all down.
> >                 
> >                 
> >                 Here my conf file if needed :
> >                 
> >                 
> >                 # slurm.conf file generated by configurator.html.
> >                 # Put this file on all nodes of your cluster.
> >                 # See the slurm.conf man page for more information.
> >                 #
> >                 ControlMachine=VM-667
> >                 ControlAddr=192.168.2.26
> >                 #BackupController=
> >                 #BackupAddr=
> >                 #
> >                 AuthType=auth/munge
> >                 CacheGroups=0
> >                 #CheckpointType=checkpoint/none
> >                 CryptoType=crypto/munge
> >                 #DisableRootJobs=NO
> >                 #EnforcePartLimits=NO
> >                 #Epilog=
> >                 #PrologSlurmctld=
> >                 #FirstJobId=1
> >                 #MaxJobId=999999
> >                 #GresTypes=
> >                 #GroupUpdateForce=0
> >                 #GroupUpdateTime=600
> >                 JobCheckpointDir=/var/lib/slurm-llnl/checkpoint
> >                 #JobCredentialPrivateKey=
> >                 #JobCredentialPublicCertificate=
> >                 #JobFileAppend=0
> >                 #JobRequeue=1
> >                 #JobSubmitPlugins=1
> >                 #KillOnBadExit=0
> >                 #Licenses=foo*4,bar
> >                 #MailProg=/usr/bin/mail
> >                 #MaxJobCount=5000
> >                 #MaxStepCount=40000
> >                 #MaxTasksPerNode=128
> >                 MpiDefault=none
> >                 #MpiParams=ports=#-#
> >                 #PluginDir=
> >                 #PlugStackConfig=
> >                 #PrivateData=jobs
> >                 ProctrackType=proctrack/pgid
> >                 #Prolog=
> >                 #PrologSlurmctld=
> >                 #PropagatePrioProcess=0
> >                 #PropagateResourceLimits=
> >                 #PropagateResourceLimitsExcept=
> >                 ReturnToService=1
> >                 #SallocDefaultCommand=
> >                 SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
> >                 SlurmctldPort=6817
> >                 SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
> >                 SlurmdPort=6818
> >                 SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
> >                 SlurmUser=slurm
> >                 #SrunEpilog=
> >                 #SrunProlog=
> >                 StateSaveLocation=/var/lib/slurm-llnl/slurmctld
> >                 SwitchType=switch/none
> >                 #TaskEpilog=
> >                 TaskPlugin=task/none
> >                 #TaskPluginParam=
> >                 #TaskProlog=
> >                 #TopologyPlugin=topology/tree
> >                 #TmpFs=/tmp
> >                 #TrackWCKey=no
> >                 #TreeWidth=
> >                 #UnkillableStepProgram=
> >                 #UsePAM=0
> >                 #
> >                 #
> >                 # TIMERS
> >                 #BatchStartTimeout=10
> >                 #CompleteWait=0
> >                 #EpilogMsgTime=2000
> >                 #GetEnvTimeout=2
> >                 #HealthCheckInterval=0
> >                 #HealthCheckProgram=
> >                 InactiveLimit=0
> >                 KillWait=30
> >                 #MessageTimeout=10
> >                 #ResvOverRun=0
> >                 MinJobAge=300
> >                 #OverTimeLimit=0
> >                 SlurmctldTimeout=120
> >                 SlurmdTimeout=300
> >                 #UnkillableStepTimeout=60
> >                 #VSizeFactor=0
> >                 Waittime=0
> >                 #
> >                 #
> >                 # SCHEDULING
> >                 #DefMemPerCPU=0
> >                 FastSchedule=1
> >                 #MaxMemPerCPU=0
> >                 #SchedulerRootFilter=1
> >                 #SchedulerTimeSlice=30
> >                 SchedulerType=sched/backfill
> >                 SchedulerPort=7321
> >                 SelectType=select/linear
> >                 #SelectTypeParameters=
> >                 #
> >                 #
> >                 # JOB PRIORITY
> >                 #PriorityType=priority/basic
> >                 #PriorityDecayHalfLife=
> >                 #PriorityCalcPeriod=
> >                 #PriorityFavorSmall=
> >                 #PriorityMaxAge=
> >                 #PriorityUsageResetPeriod=
> >                 #PriorityWeightAge=
> >                 #PriorityWeightFairshare=
> >                 #PriorityWeightJobSize=
> >                 #PriorityWeightPartition=
> >                 #PriorityWeightQOS=
> >                 #
> >                 #
> >                 # LOGGING AND ACCOUNTING
> >                 #AccountingStorageEnforce=0
> >                 #AccountingStorageHost=
> >                 #AccountingStorageLoc=
> >                 #AccountingStoragePass=
> >                 #AccountingStoragePort=
> >                 AccountingStorageType=accounting_storage/none
> >                 #AccountingStorageUser=
> >                 AccountingStoreJobComment=YES
> >                 ClusterName=cluster
> >                 #DebugFlags=
> >                 #JobCompHost=
> >                 #JobCompLoc=
> >                 #JobCompPass=
> >                 #JobCompPort=
> >                 JobCompType=jobcomp/none
> >                 #JobCompUser=
> >                 JobAcctGatherFrequency=30
> >                 JobAcctGatherType=jobacct_gather/none
> >                 SlurmctldDebug=3
> >                 SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
> >                 SlurmdDebug=3
> >                 SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
> >                 #SlurmSchedLogFile=
> >                 #SlurmSchedLogLevel=
> >                 #
> >                 #
> >                 # POWER SAVE SUPPORT FOR IDLE NODES (optional)
> >                 #SuspendProgram=
> >                 #ResumeProgram=
> >                 #SuspendTimeout=
> >                 #ResumeTimeout=
> >                 #ResumeRate=
> >                 #SuspendExcNodes=
> >                 #SuspendExcParts=
> >                 #SuspendRate=
> >                 #SuspendTime=
> >                 #
> >                 #
> >                 # COMPUTE NODES
> >                 NodeName=VM-[669-671] CPUs=2 Sockets=2
> >                 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN
> >                 PartitionName=SLURM-debug Nodes=VM-[669-671]
> >                 Default=YES MaxTime=INFINITE State=UP
> >                 
> >                 
> >                 ______________________________________________________
> >                         De: "Alan V. Cowles" <[email protected]>
> >                         À: "slurm-dev" <[email protected]>
> >                         Cc: "Sivasangari Nandy"
> >                         <[email protected]>
> >                         Envoyé: Mardi 10 Septembre 2013 13:17:36
> >                         Objet: Re: [slurm-dev]  nodes often down
> >                         
> >                         Siva,
> >                         
> >                         Was that the default setting you had in
> >                         place
> >                         with your original config, or a change you
> >                         made recently to combat the downed nodes
> >                         problem, and did you restart slurm or do a
> >                         reconfigure to re-read the slurm.conf file.
> >                         I've found some changes don't take effect
> >                         with
> >                         a reconfigure, and you have to restart.
> >                         
> >                         AC
> >                         
> >                         On 09/10/2013 04:01 AM, Sivasangari Nandy
> >                         wrote:
> >                         
> >                                 Hello,
> >                                 
> >                                 
> >                                 My nodes are often in "down"
> >                                 states,
> >                                 so I must make a 'sview' and
> >                                 activate
> >                                 all nodes manually to put them
> >                                 'idle'
> >                                 in order to run jobs.
> >                                 I've seen in the FAQ that I can
> >                                 change
> >                                 the slurm.conf file.
> >                                 
> >                                 
> >                                 "The configuration
> >                                 parameter ReturnToService in
> >                                 slurm.conf controls how DOWN nodes
> >                                 are handled. Set its value to one
> >                                 in order for DOWN nodes to
> >                                 automatically be returned to
> >                                 service once the slurmd daemon
> >                                 registers with a valid node
> >                                 configuration. "
> >                                 
> >                                 
> >                                 However in my file it's already "1"
> >                                 for ReturnToService.
> >                                 
> >                                 
> >                                 advance thanks,
> >                                 Siva
> >                                 
> >                                 --
> >                                 Sivasangari NANDY -  Plate-forme
> >                                 GenOuest
> >                                 IRISA-INRIA, Campus de Beaulieu
> >                                 263 Avenue du Général Leclerc
> >                                 35042 Rennes cedex, France
> >                                 Tél: +33 (0) 2 99 84 25 69
> >                                 Bureau :  D152
> >                                 
> >                                 
> >                                 
> >                                 
> >                         
> >                 
> >                 
> >                 
> >                 --
> >                 Sivasangari NANDY -  Plate-forme GenOuest
> >                 IRISA-INRIA, Campus de Beaulieu
> >                 263 Avenue du Général Leclerc
> >                 35042 Rennes cedex, France
> >                 Tél: +33 (0) 2 99 84 25 69
> >                 Bureau :  D152
> >                 
> >                 
> >                 
> >                 
> >         
> > 
> > 
> > 
> > --
> > Sivasangari NANDY -  Plate-forme GenOuest
> > IRISA-INRIA, Campus de Beaulieu
> > 263 Avenue du Général Leclerc
> > 35042 Rennes cedex, France
> > Tél: +33 (0) 2 99 84 25 69
> > Bureau :  D152
> > 
> > 
> > 
> > 
> 
> 
> 

-- 
Siva sangari NANDY - Plate-forme GenOuest 
IRISA-INRIA, Campus de Beaulieu 
263 Avenue du Général Leclerc 

35042 Rennes cedex, France 
Tél: +33 (0) 2 99 84 25 69 

Bureau : D152

[slurm-dev] Re: nodes often down

Reply via email to