[slurm-dev] Re: Slurmdbd errors in accounting

Danny Auble Wed, 28 Nov 2012 11:04:44 -0800

Damien, this claims you have the entire cluster allocated and also mostof it down/drained the same time. Is that the case? Did you have anyreservations running?

When did the errors start? If you do find there are jobs not runningthat don't have a time_end in the lemaitre2_job_table it would beinteresting if you could post the relevant lines of that job from theslurmctld.log (hopefully it wasn't too long ago and you have it). Itwould also be interesting to know the columns from the jobs that have 0end_time that are not running...

selectid_job,from_unixtime(time_eligible),time_start,from_unixtime(time_end),job_name,statefrom lemaitre2_job_table where time_start=0;


Thanks,
Danny

On 11/28/2012 12:43 AM, Damien François wrote:

Hello,
we have a cluster running Slurm 2.3.3 with slurmdbd and mysql foraccounting. We are very much happy with how Slurm is doing the job,but we have noticed that 'sreport cluster utilization' and 'sreportcluster accountutilizationbyuser' report different values for thetotal allocated/used time for some time windows.
In the slurmdbd logs, we have a bunch of errors : "We have moreallocated time than is possible", and "We have more time than ispossible" (see below)
What could be the cause of such errors? What can we do to pinpoint theproblem?
Any suggestion would be appreciated. Thanks in advance.

damien francois


slurmdbd.log:
[2012-11-28T00:00:00] error: We have more allocated time than ispossible (5445695 > 4968000) for cluster lemaitre2(1380) from2012-11-27T23:00:00 - 2012-11-28T00:00:00[2012-11-28T00:00:00] error: We have more time than is possible(4968000+4881600+0)(9849600) > 4968000 for cluster lemaitre2(1380)from 2012-11-27T23:00:00 - 2012-11-28T00:00:00[2012-11-28T01:00:00] error: We have more allocated time than ispossible (5439457 > 4968000) for cluster lemaitre2(1380) from2012-11-28T00:00:00 - 2012-11-28T01:00:00[2012-11-28T01:00:00] error: We have more time than is possible(4968000+4881600+0)(9849600) > 4968000 for cluster lemaitre2(1380)from 2012-11-28T00:00:00 - 2012-11-28T01:00:00[2012-11-28T02:00:01] error: We have more allocated time than ispossible (5429731 > 4968000) for cluster lemaitre2(1380) from2012-11-28T01:00:00 - 2012-11-28T02:00:00[2012-11-28T02:00:01] error: We have more time than is possible(4968000+4881600+0)(9849600) > 4968000 for cluster lemaitre2(1380)from 2012-11-28T01:00:00 - 2012-11-28T02:00:00[2012-11-28T03:00:01] error: We have more allocated time than ispossible (5340229 > 4968000) for cluster lemaitre2(1380) from2012-11-28T02:00:00 - 2012-11-28T03:00:00[2012-11-28T03:00:01] error: We have more time than is possible(4968000+4881600+0)(9849600) > 4968000 for cluster lemaitre2(1380)from 2012-11-28T02:00:00 - 2012-11-28T03:00:00[2012-11-28T04:00:00] error: We have more allocated time than ispossible (5386768 > 4968000) for cluster lemaitre2(1380) from2012-11-28T03:00:00 - 2012-11-28T04:00:00[2012-11-28T04:00:00] error: We have more time than is possible(4968000+4881600+0)(9849600) > 4968000 for cluster lemaitre2(1380)from 2012-11-28T03:00:00 - 2012-11-28T04:00:00[2012-11-28T05:00:01] error: We have more allocated time than ispossible (5372727 > 4968000) for cluster lemaitre2(1380) from2012-11-28T04:00:00 - 2012-11-28T05:00:00[2012-11-28T05:00:01] error: We have more time than is possible(4968000+4881600+0)(9849600) > 4968000 for cluster lemaitre2(1380)from 2012-11-28T04:00:00 - 2012-11-28T05:00:00[2012-11-28T06:00:01] error: We have more allocated time than ispossible (5329341 > 4968000) for cluster lemaitre2(1380) from2012-11-28T05:00:00 - 2012-11-28T06:00:00[2012-11-28T06:00:01] error: We have more time than is possible(4968000+4881600+0)(9849600) > 4968000 for cluster lemaitre2(1380)from 2012-11-28T05:00:00 - 2012-11-28T06:00:00[2012-11-28T07:00:00] error: We have more allocated time than ispossible (5344040 > 4968000) for cluster lemaitre2(1380) from2012-11-28T06:00:00 - 2012-11-28T07:00:00[2012-11-28T07:00:00] error: We have more time than is possible(4968000+4881600+0)(9849600) > 4968000 for cluster lemaitre2(1380)from 2012-11-28T06:00:00 - 2012-11-28T07:00:00[2012-11-28T08:00:01] error: We have more allocated time than ispossible (5424416 > 4968000) for cluster lemaitre2(1380) from2012-11-28T07:00:00 - 2012-11-28T08:00:00[2012-11-28T08:00:01] error: We have more time than is possible(4968000+4881600+0)(9849600) > 4968000 for cluster lemaitre2(1380)from 2012-11-28T07:00:00 - 2012-11-28T08:00:00[2012-11-28T09:00:01] error: We have more allocated time than ispossible (5459590 > 4968000) for cluster lemaitre2(1380) from2012-11-28T08:00:00 - 2012-11-28T09:00:00[2012-11-28T09:00:01] error: We have more time than is possible(4968000+4881600+0)(9849600) > 4968000 for cluster lemaitre2(1380)from 2012-11-28T08:00:00 - 2012-11-28T09:00:00
Configuration data as of 2012-11-28T09:25:02
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = associations,limits
AccountingStorageHost   = lmMGT01
AccountingStorageLoc    = N/A
AccountingStoragePort   = 6819
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = YES
AuthType                = auth/munge
BackupAddr              = lemaitre2
BackupController        = lemaitre2
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2012-11-08T08:27:26
CacheGroups             = 0
CheckpointType          = checkpoint/none
ClusterName             = lemaitre2
CompleteWait            = 3 sec
ControlAddr             = lmMGT01
ControlMachine          = lmMGT01
CryptoType              = crypto/munge
DebugFlags              = Wiki
DefMemPerCPU            = 4000
DisableRootJobs         = NO
EnforcePartLimits       = YES
Epilog                  = (null)
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
FastSchedule            = 1
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = gpu
GroupUpdateForce        = 0
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 0 sec
HealthCheckProgram      = (null)
InactiveLimit           = 0 sec
JobAcctGatherFrequency  = 10 sec
JobAcctGatherType       = jobacct_gather/linux
JobCheckpointDir        = /var/slurm/checkpoint
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobFileAppend           = 0
JobRequeue              = 1
JobSubmitPlugins        = (null)
KillOnBadExit           = 1
KillWait                = 10 sec
Licenses                = (null)
MailProg                = /usr/local/slurm/bin/sendmail.sh
MaxJobCount             = 10000
MaxJobId                = 4294901760
MaxMemPerNode           = 48000
MaxStepCount            = 10000
MaxTasksPerNode         = 24
MessageTimeout          = 10 sec
MinJobAge               = 300 sec
MpiDefault              = none
MpiParams               = (null)
NEXT_JOB_ID             = 523805
OverTimeLimit           = 0 min
PluginDir               = /usr/lib64/slurm
PlugStackConfig         = /etc/slurm/plugstack.conf
PreemptMode             = OFF
PreemptType             = preempt/none
PriorityDecayHalfLife   = 7-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = 0
PriorityMaxAge          = 7-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 500
PriorityWeightFairShare = 1000
PriorityWeightJobSize   = 1000
PriorityWeightPartition = 0
PriorityWeightQOS       = 0
PrivateData             = none
ProctrackType           = proctrack/linuxproc
Prolog                  = (null)
PrologSlurmctld         = (null)
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvOverRun             = 0 min
ReturnToService         = 1
SallocDefaultCommand    = (null)
SchedulerParameters     = (null)
SchedulerPort           = 7321
SchedulerRootFilter     = 1
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE_MEMORY
SlurmUser               = slurm(102)
SlurmctldDebug          = 3
SlurmctldLogFile        = (null)
SlurmSchedLogFile       = (null)
SlurmctldPort           = 6817
SlurmctldTimeout        = 120 sec
SlurmdDebug             = 3
SlurmdLogFile           = (null)
SlurmdPidFile           = /var/run/slurm/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurm
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurm/slurmctld.pid
SLURM_CONF              = /etc/slurm/slurm.conf
SLURM_VERSION           = 2.3.3
SrunEpilog              = (null)
SrunProlog              = (null)
StateSaveLocation       = /usr/local/slurm/data
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = task/affinity
TaskPluginParam         = (null type)
TaskProlog              = (null)
TmpFS                   = /scratch
TopologyPlugin          = topology/tree
TrackWCKey              = 1
TreeWidth               = 50
UsePam                  = 0
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 20 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec

Slurmctld(primary/backup) at lmMGT01/lemaitre2 are UP/UP



--
D a m i e n F r a n ç o i s Computer scientist & HPCsysadminUniversité catholique de Louvain --http://www.uclouvain.be/damien.francois

[slurm-dev] Re: Slurmdbd errors in accounting

Reply via email to