Damien, this claims you have the entire cluster allocated and also most of it down/drained the same time. Is that the case? Did you have any reservations running?

When did the errors start? If you do find there are jobs not running that don't have a time_end in the lemaitre2_job_table it would be interesting if you could post the relevant lines of that job from the slurmctld.log (hopefully it wasn't too long ago and you have it). It would also be interesting to know the columns from the jobs that have 0 end_time that are not running...

select id_job,from_unixtime(time_eligible),time_start,from_unixtime(time_end),job_name,state from lemaitre2_job_table where time_start=0;

Thanks,
Danny

On 11/28/2012 12:43 AM, Damien François wrote:
Hello,

we have a cluster running Slurm 2.3.3 with slurmdbd and mysql for accounting. We are very much happy with how Slurm is doing the job, but we have noticed that 'sreport cluster utilization' and 'sreport cluster accountutilizationbyuser' report different values for the total allocated/used time for some time windows.

In the slurmdbd logs, we have a bunch of errors : "We have more allocated time than is possible", and "We have more time than is possible" (see below)

What could be the cause of such errors? What can we do to pinpoint the problem?

Any suggestion would be appreciated. Thanks in advance.

damien francois


slurmdbd.log:
[2012-11-28T00:00:00] error: We have more allocated time than is possible (5445695 > 4968000) for cluster lemaitre2(1380) from 2012-11-27T23:00:00 - 2012-11-28T00:00:00 [2012-11-28T00:00:00] error: We have more time than is possible (4968000+4881600+0)(9849600) > 4968000 for cluster lemaitre2(1380) from 2012-11-27T23:00:00 - 2012-11-28T00:00:00 [2012-11-28T01:00:00] error: We have more allocated time than is possible (5439457 > 4968000) for cluster lemaitre2(1380) from 2012-11-28T00:00:00 - 2012-11-28T01:00:00 [2012-11-28T01:00:00] error: We have more time than is possible (4968000+4881600+0)(9849600) > 4968000 for cluster lemaitre2(1380) from 2012-11-28T00:00:00 - 2012-11-28T01:00:00 [2012-11-28T02:00:01] error: We have more allocated time than is possible (5429731 > 4968000) for cluster lemaitre2(1380) from 2012-11-28T01:00:00 - 2012-11-28T02:00:00 [2012-11-28T02:00:01] error: We have more time than is possible (4968000+4881600+0)(9849600) > 4968000 for cluster lemaitre2(1380) from 2012-11-28T01:00:00 - 2012-11-28T02:00:00 [2012-11-28T03:00:01] error: We have more allocated time than is possible (5340229 > 4968000) for cluster lemaitre2(1380) from 2012-11-28T02:00:00 - 2012-11-28T03:00:00 [2012-11-28T03:00:01] error: We have more time than is possible (4968000+4881600+0)(9849600) > 4968000 for cluster lemaitre2(1380) from 2012-11-28T02:00:00 - 2012-11-28T03:00:00 [2012-11-28T04:00:00] error: We have more allocated time than is possible (5386768 > 4968000) for cluster lemaitre2(1380) from 2012-11-28T03:00:00 - 2012-11-28T04:00:00 [2012-11-28T04:00:00] error: We have more time than is possible (4968000+4881600+0)(9849600) > 4968000 for cluster lemaitre2(1380) from 2012-11-28T03:00:00 - 2012-11-28T04:00:00 [2012-11-28T05:00:01] error: We have more allocated time than is possible (5372727 > 4968000) for cluster lemaitre2(1380) from 2012-11-28T04:00:00 - 2012-11-28T05:00:00 [2012-11-28T05:00:01] error: We have more time than is possible (4968000+4881600+0)(9849600) > 4968000 for cluster lemaitre2(1380) from 2012-11-28T04:00:00 - 2012-11-28T05:00:00 [2012-11-28T06:00:01] error: We have more allocated time than is possible (5329341 > 4968000) for cluster lemaitre2(1380) from 2012-11-28T05:00:00 - 2012-11-28T06:00:00 [2012-11-28T06:00:01] error: We have more time than is possible (4968000+4881600+0)(9849600) > 4968000 for cluster lemaitre2(1380) from 2012-11-28T05:00:00 - 2012-11-28T06:00:00 [2012-11-28T07:00:00] error: We have more allocated time than is possible (5344040 > 4968000) for cluster lemaitre2(1380) from 2012-11-28T06:00:00 - 2012-11-28T07:00:00 [2012-11-28T07:00:00] error: We have more time than is possible (4968000+4881600+0)(9849600) > 4968000 for cluster lemaitre2(1380) from 2012-11-28T06:00:00 - 2012-11-28T07:00:00 [2012-11-28T08:00:01] error: We have more allocated time than is possible (5424416 > 4968000) for cluster lemaitre2(1380) from 2012-11-28T07:00:00 - 2012-11-28T08:00:00 [2012-11-28T08:00:01] error: We have more time than is possible (4968000+4881600+0)(9849600) > 4968000 for cluster lemaitre2(1380) from 2012-11-28T07:00:00 - 2012-11-28T08:00:00 [2012-11-28T09:00:01] error: We have more allocated time than is possible (5459590 > 4968000) for cluster lemaitre2(1380) from 2012-11-28T08:00:00 - 2012-11-28T09:00:00 [2012-11-28T09:00:01] error: We have more time than is possible (4968000+4881600+0)(9849600) > 4968000 for cluster lemaitre2(1380) from 2012-11-28T08:00:00 - 2012-11-28T09:00:00

Configuration data as of 2012-11-28T09:25:02
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = associations,limits
AccountingStorageHost   = lmMGT01
AccountingStorageLoc    = N/A
AccountingStoragePort   = 6819
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = YES
AuthType                = auth/munge
BackupAddr              = lemaitre2
BackupController        = lemaitre2
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2012-11-08T08:27:26
CacheGroups             = 0
CheckpointType          = checkpoint/none
ClusterName             = lemaitre2
CompleteWait            = 3 sec
ControlAddr             = lmMGT01
ControlMachine          = lmMGT01
CryptoType              = crypto/munge
DebugFlags              = Wiki
DefMemPerCPU            = 4000
DisableRootJobs         = NO
EnforcePartLimits       = YES
Epilog                  = (null)
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
FastSchedule            = 1
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = gpu
GroupUpdateForce        = 0
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 0 sec
HealthCheckProgram      = (null)
InactiveLimit           = 0 sec
JobAcctGatherFrequency  = 10 sec
JobAcctGatherType       = jobacct_gather/linux
JobCheckpointDir        = /var/slurm/checkpoint
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobFileAppend           = 0
JobRequeue              = 1
JobSubmitPlugins        = (null)
KillOnBadExit           = 1
KillWait                = 10 sec
Licenses                = (null)
MailProg                = /usr/local/slurm/bin/sendmail.sh
MaxJobCount             = 10000
MaxJobId                = 4294901760
MaxMemPerNode           = 48000
MaxStepCount            = 10000
MaxTasksPerNode         = 24
MessageTimeout          = 10 sec
MinJobAge               = 300 sec
MpiDefault              = none
MpiParams               = (null)
NEXT_JOB_ID             = 523805
OverTimeLimit           = 0 min
PluginDir               = /usr/lib64/slurm
PlugStackConfig         = /etc/slurm/plugstack.conf
PreemptMode             = OFF
PreemptType             = preempt/none
PriorityDecayHalfLife   = 7-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = 0
PriorityMaxAge          = 7-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 500
PriorityWeightFairShare = 1000
PriorityWeightJobSize   = 1000
PriorityWeightPartition = 0
PriorityWeightQOS       = 0
PrivateData             = none
ProctrackType           = proctrack/linuxproc
Prolog                  = (null)
PrologSlurmctld         = (null)
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvOverRun             = 0 min
ReturnToService         = 1
SallocDefaultCommand    = (null)
SchedulerParameters     = (null)
SchedulerPort           = 7321
SchedulerRootFilter     = 1
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE_MEMORY
SlurmUser               = slurm(102)
SlurmctldDebug          = 3
SlurmctldLogFile        = (null)
SlurmSchedLogFile       = (null)
SlurmctldPort           = 6817
SlurmctldTimeout        = 120 sec
SlurmdDebug             = 3
SlurmdLogFile           = (null)
SlurmdPidFile           = /var/run/slurm/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurm
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurm/slurmctld.pid
SLURM_CONF              = /etc/slurm/slurm.conf
SLURM_VERSION           = 2.3.3
SrunEpilog              = (null)
SrunProlog              = (null)
StateSaveLocation       = /usr/local/slurm/data
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = task/affinity
TaskPluginParam         = (null type)
TaskProlog              = (null)
TmpFS                   = /scratch
TopologyPlugin          = topology/tree
TrackWCKey              = 1
TreeWidth               = 50
UsePam                  = 0
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 20 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec

Slurmctld(primary/backup) at lmMGT01/lemaitre2 are UP/UP



--
D a m i e n F r a n ç o i s Computer scientist & HPC sysadmin Université catholique de Louvain -- http://www.uclouvain.be/damien.francois





Reply via email to