[slurm-dev] Fwd: Delivery Status Notification (Failure)

Manal Bouabida Tue, 24 Sep 2013 10:33:21 -0700

Hi slurm-dev,

 We have a CPU-Cluster with 32 computing nodes. Recently, we have reported
a problem when submitting jobs via sbatch command of Slurm :
 We cannot allocate nodes as defined in the job script !


Some jobs succeed to allocate 1 or 2 nodes, but usually the jobs failed  to
allocate any node !
First, the slurm.conf is attached.

Second, we have checked the sulrmctld log with the command :

    tail -f /var/log/slurm/slurmctld.log and whenever we try to run a job
we obtained the log below:
*log:*

and we got the following error messages :

[2013-09-17T12:34:33] slurmdbd: agent queue size 10000

[2013-09-17T12:34:33] error: slurmdbd: agent queue filling, RESTART
SLURMDBD NOW

[2013-09-17T12:34:33] error: slurmdbd: agent queue is full, discarding
request

[2013-09-17T12:34:33] error: slurmdbd: agent queue is full, discarding
request

[2013-09-17T12:34:33] completing job 2257

[2013-09-17T12:34:33] error: slurmdbd: agent queue is full, discarding
request

[2013-09-17T12:34:33] sched: job_complete for JobId=2257 successful

[2013-09-17T12:34:43] slurmdbd: agent queue size 10000

[2013-09-17T12:34:53] slurmdbd: agent queue size 10000

[2013-09-17T12:35:03] slurmdbd: agent queue size 10000

[2013-09-17T12:35:13] slurmdbd: agent queue size 10000

[2013-09-17T12:35:20] _slurm_rpc_submit_batch_job JobId=2258 usec=220

[2013-09-17T12:35:20] sched: Allocate JobId=2258 NodeList=haytham[25-32]
#CPUs=96

[2013-09-17T12:35:21] error: slurmdbd: agent queue is full, discarding
request

[2013-09-17T12:35:21] error: slurmdbd: agent queue is full, discarding
request

[2013-09-17T12:35:21] completing job 2258

[2013-09-17T12:35:21] error: slurmdbd: agent queue is full, discarding
request

[2013-09-17T12:35:21] sched: job_complete for JobId=2258 successful

[2013-09-17T12:35:31] slurmdbd: agent queue size 10000

[2013-09-17T12:35:41] slurmdbd: agent queue size 10000

[2013-09-17T12:35:51] slurmdbd: agent queue size 10000


i tried to restart slurmdbd but the problem is always the same

this is slurm.conf:


ClusterName=haytham
ControlMachine=haytham0
SlurmUser=slurm
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/var/log/slurm/log_slurmctld
SlurmdSpoolDir=/tmp/slurmd
SwitchType=switch/none
MpiDefault=none
MpiParams=ports=13000-14000
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/linuxproc
CacheGroups=0
ReturnToService=0
Prolog=/etc/slurm/slurm_prolog
Epilog=/etc/slurm/slurm_epilog
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK,CR_ONE_TASK_PER_CORE
FastSchedule=1
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurmd.log.%h
JobCompType=jobcomp/none
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageEnforce=limits
AccountingStorageHost=haytham0
AccountingStorageLoc=slurm_acct_db
NodeName=haytham[10] Procs=12 RealMemory=48266 Sockets=2 CoresPerSocket=6
ThreadsPerCore=1 State=UNKNOWN
NodeName=haytham[11-42] Procs=12 RealMemory=24026 Sockets=2
CoresPerSocket=6 ThreadsPerCore=1 State=UNKNOWN
PartitionName=prod Nodes=haytham[11-42] Default=YES MaxTime=INFINITE
State=UP
PartitionName=visu Nodes=haytham[10] MaxTime=INFINITE State=UP

[slurm-dev] Fwd: Delivery Status Notification (Failure)

Reply via email to