Whaou thanks a lot Hongjia Cao ! It's working now with CPUs=1 and Sockets=1. Yea actually physically (cat /proc/cpuinfo) there was just 1 CPU so with 2 CPUs in the slurm conf file it did not work.
thanks again, bye. ----- Mail original ----- > De: "Hongjia Cao" <[email protected]> > À: "slurm-dev" <[email protected]> > Envoyé: Vendredi 13 Septembre 2013 06:24:53 > Objet: [slurm-dev] Re: nodes often down > > > I guess that you have wrong number of CPUs per node configured. > Please > try changed the configuration file. Or you may try FastSchedule=1. > > 在 2013-09-10二的 23:54 -0700,Sivasangari Nandy写道: > > root@VM-667:/omaha-beach/workflow# sinfo -R > > REASON USER TIMESTAMP NODELIST > > Low CPUs slurm 2013-09-10T19:38:37 VM-[669-671] > > > > > > Then when i changed nodes to idle, and type sinfo -R, > > there is nothing. > > I wanted to know how to have permanently idle nodes. > > > > > > ______________________________________________________________________ > > De: "曹宏嘉" <[email protected]> > > À: "slurm-dev" <[email protected]> > > Envoyé: Mercredi 11 Septembre 2013 02:26:54 > > Objet: [slurm-dev] Re: nodes often down > > > > You may run "sinfo -R" to see the reason that the node is > > left > > down. ReturnToService=1 can not recover all down nodes. > > > > > > -----原始邮件----- > > 发件人: "Sivasangari Nandy" > > <[email protected]> > > 发送时间: 2013-09-10 20:38:43 (星期二) > > 收件人: slurm-dev <[email protected]> > > 抄送: slurm-dev <[email protected]> > > 主题: [slurm-dev] Re: nodes often down > > > > No it was like that at first so by default. > > And yea I've restarted slurm but no changes, after > > 10 > > min nodes are all down. > > > > > > Here my conf file if needed : > > > > > > # slurm.conf file generated by configurator.html. > > # Put this file on all nodes of your cluster. > > # See the slurm.conf man page for more information. > > # > > ControlMachine=VM-667 > > ControlAddr=192.168.2.26 > > #BackupController= > > #BackupAddr= > > # > > AuthType=auth/munge > > CacheGroups=0 > > #CheckpointType=checkpoint/none > > CryptoType=crypto/munge > > #DisableRootJobs=NO > > #EnforcePartLimits=NO > > #Epilog= > > #PrologSlurmctld= > > #FirstJobId=1 > > #MaxJobId=999999 > > #GresTypes= > > #GroupUpdateForce=0 > > #GroupUpdateTime=600 > > JobCheckpointDir=/var/lib/slurm-llnl/checkpoint > > #JobCredentialPrivateKey= > > #JobCredentialPublicCertificate= > > #JobFileAppend=0 > > #JobRequeue=1 > > #JobSubmitPlugins=1 > > #KillOnBadExit=0 > > #Licenses=foo*4,bar > > #MailProg=/usr/bin/mail > > #MaxJobCount=5000 > > #MaxStepCount=40000 > > #MaxTasksPerNode=128 > > MpiDefault=none > > #MpiParams=ports=#-# > > #PluginDir= > > #PlugStackConfig= > > #PrivateData=jobs > > ProctrackType=proctrack/pgid > > #Prolog= > > #PrologSlurmctld= > > #PropagatePrioProcess=0 > > #PropagateResourceLimits= > > #PropagateResourceLimitsExcept= > > ReturnToService=1 > > #SallocDefaultCommand= > > SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid > > SlurmctldPort=6817 > > SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid > > SlurmdPort=6818 > > SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd > > SlurmUser=slurm > > #SrunEpilog= > > #SrunProlog= > > StateSaveLocation=/var/lib/slurm-llnl/slurmctld > > SwitchType=switch/none > > #TaskEpilog= > > TaskPlugin=task/none > > #TaskPluginParam= > > #TaskProlog= > > #TopologyPlugin=topology/tree > > #TmpFs=/tmp > > #TrackWCKey=no > > #TreeWidth= > > #UnkillableStepProgram= > > #UsePAM=0 > > # > > # > > # TIMERS > > #BatchStartTimeout=10 > > #CompleteWait=0 > > #EpilogMsgTime=2000 > > #GetEnvTimeout=2 > > #HealthCheckInterval=0 > > #HealthCheckProgram= > > InactiveLimit=0 > > KillWait=30 > > #MessageTimeout=10 > > #ResvOverRun=0 > > MinJobAge=300 > > #OverTimeLimit=0 > > SlurmctldTimeout=120 > > SlurmdTimeout=300 > > #UnkillableStepTimeout=60 > > #VSizeFactor=0 > > Waittime=0 > > # > > # > > # SCHEDULING > > #DefMemPerCPU=0 > > FastSchedule=1 > > #MaxMemPerCPU=0 > > #SchedulerRootFilter=1 > > #SchedulerTimeSlice=30 > > SchedulerType=sched/backfill > > SchedulerPort=7321 > > SelectType=select/linear > > #SelectTypeParameters= > > # > > # > > # JOB PRIORITY > > #PriorityType=priority/basic > > #PriorityDecayHalfLife= > > #PriorityCalcPeriod= > > #PriorityFavorSmall= > > #PriorityMaxAge= > > #PriorityUsageResetPeriod= > > #PriorityWeightAge= > > #PriorityWeightFairshare= > > #PriorityWeightJobSize= > > #PriorityWeightPartition= > > #PriorityWeightQOS= > > # > > # > > # LOGGING AND ACCOUNTING > > #AccountingStorageEnforce=0 > > #AccountingStorageHost= > > #AccountingStorageLoc= > > #AccountingStoragePass= > > #AccountingStoragePort= > > AccountingStorageType=accounting_storage/none > > #AccountingStorageUser= > > AccountingStoreJobComment=YES > > ClusterName=cluster > > #DebugFlags= > > #JobCompHost= > > #JobCompLoc= > > #JobCompPass= > > #JobCompPort= > > JobCompType=jobcomp/none > > #JobCompUser= > > JobAcctGatherFrequency=30 > > JobAcctGatherType=jobacct_gather/none > > SlurmctldDebug=3 > > SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log > > SlurmdDebug=3 > > SlurmdLogFile=/var/log/slurm-llnl/slurmd.log > > #SlurmSchedLogFile= > > #SlurmSchedLogLevel= > > # > > # > > # POWER SAVE SUPPORT FOR IDLE NODES (optional) > > #SuspendProgram= > > #ResumeProgram= > > #SuspendTimeout= > > #ResumeTimeout= > > #ResumeRate= > > #SuspendExcNodes= > > #SuspendExcParts= > > #SuspendRate= > > #SuspendTime= > > # > > # > > # COMPUTE NODES > > NodeName=VM-[669-671] CPUs=2 Sockets=2 > > CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN > > PartitionName=SLURM-debug Nodes=VM-[669-671] > > Default=YES MaxTime=INFINITE State=UP > > > > > > ______________________________________________________ > > De: "Alan V. Cowles" <[email protected]> > > À: "slurm-dev" <[email protected]> > > Cc: "Sivasangari Nandy" > > <[email protected]> > > Envoyé: Mardi 10 Septembre 2013 13:17:36 > > Objet: Re: [slurm-dev] nodes often down > > > > Siva, > > > > Was that the default setting you had in > > place > > with your original config, or a change you > > made recently to combat the downed nodes > > problem, and did you restart slurm or do a > > reconfigure to re-read the slurm.conf file. > > I've found some changes don't take effect > > with > > a reconfigure, and you have to restart. > > > > AC > > > > On 09/10/2013 04:01 AM, Sivasangari Nandy > > wrote: > > > > Hello, > > > > > > My nodes are often in "down" > > states, > > so I must make a 'sview' and > > activate > > all nodes manually to put them > > 'idle' > > in order to run jobs. > > I've seen in the FAQ that I can > > change > > the slurm.conf file. > > > > > > "The configuration > > parameter ReturnToService in > > slurm.conf controls how DOWN nodes > > are handled. Set its value to one > > in order for DOWN nodes to > > automatically be returned to > > service once the slurmd daemon > > registers with a valid node > > configuration. " > > > > > > However in my file it's already "1" > > for ReturnToService. > > > > > > advance thanks, > > Siva > > > > -- > > Sivasangari NANDY - Plate-forme > > GenOuest > > IRISA-INRIA, Campus de Beaulieu > > 263 Avenue du Général Leclerc > > 35042 Rennes cedex, France > > Tél: +33 (0) 2 99 84 25 69 > > Bureau : D152 > > > > > > > > > > > > > > > > > > -- > > Sivasangari NANDY - Plate-forme GenOuest > > IRISA-INRIA, Campus de Beaulieu > > 263 Avenue du Général Leclerc > > 35042 Rennes cedex, France > > Tél: +33 (0) 2 99 84 25 69 > > Bureau : D152 > > > > > > > > > > > > > > > > > > -- > > Sivasangari NANDY - Plate-forme GenOuest > > IRISA-INRIA, Campus de Beaulieu > > 263 Avenue du Général Leclerc > > 35042 Rennes cedex, France > > Tél: +33 (0) 2 99 84 25 69 > > Bureau : D152 > > > > > > > > > > > -- Siva sangari NANDY - Plate-forme GenOuest IRISA-INRIA, Campus de Beaulieu 263 Avenue du Général Leclerc 35042 Rennes cedex, France Tél: +33 (0) 2 99 84 25 69 Bureau : D152
