On 09/02/2011 03:30 PM, Andy Riebs wrote:
Sam,
What version of slurm are you running?
It will help considerably if you forward a copy of your slurm.conf
file so that people can see what options you are using.
This is version 2.3.0-0.rc1. Conf is:
AuthType=auth/munge
ControlMachine=test01
NodeName=test[02-10] Sockets=2 CoresPerSocket=4 ThreadsPerCore=1
MailProg=/usr/bin/mail
PartitionName=DEFAULT MaxTime=30 MaxNodes=10 State=UP
PartitionName=fast Nodes=test[02-07] MaxTime=60 Default=YES State=UP
PartitionName=batch Nodes=test[04-08,10] MaxTime=1440 State=UP
#PartitionName=md Nodes=test[02-03] MaxTime=1440 State=UP
SchedulerPort=7004
SlurmctldPort=7002
SlurmdPort=7003
SlurmdLogFile=/opt/slurm-2.3.0-0.rc1/log/slurmd.log
SlurmctldLogFile=/opt/slurm-2.3.0-0.rc1/log/slurmctld.log
SlurmSchedLogLevel=2
SlurmSchedLogFile=/opt/slurm-2.3.0-0.rc1/log/slurmsched.log
ClusterName=nfs
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=test01
AccountingStoragePort=7005
#AccountingStorageType=accounting_storage/filetxt
#AccountingStorageLoc=/opt/slurm-2.3.0-0.rc1/acct
#AccountingStorageType=accounting_storage/mysql
#AccountingStorageHost=test01
#AccountingStoragePort=3306
#AccountingStorageUser=root
#AccountingStoragePass=slurm123
#AccountingStorageLoc=slurm_acct_db
SelectType=select/cons_res
StateSaveLocation=/opt/slurm/tmp
MaxJobCount=10000
# Jack up timeouts since grid is slow
MessageTimeout=60
BatchStartTimeout=60
# Activate the Multi-factor Job Priority Plugin with decay
PriorityType=priority/multifactor
# 1 week half-life
PriorityDecayHalfLife=7-0
# The larger the job, the greater its job size priority.
PriorityFavorSmall=NO
# The job's age factor reaches 1.0 after waiting in the
# queue for 1 day.
PriorityMaxAge=1-0
# This next group determines the weighting of each of the
# components of the Multi-factor Job Priority Plugin.
# The default value for each of the following is 1.
PriorityWeightAge=1000
PriorityWeightFairshare=10000
PriorityWeightJobSize=1000
PriorityWeightPartition=1000
PriorityWeightQOS=0 # don't use the qos factor
Andy
On 09/02/2011 04:17 PM, Sam Lang wrote:
Hello,
I have a case where a completed job is reported as still running by
sacct:
JobID JobName Partition Account AllocCPUS State
ExitCode
------------ ---------- ---------- ---------- ---------- ----------
--------
124794 0.batch fast 1
COMPLETED 0:0
124794.batch batch 1
COMPLETED 0:0
124794.0 bash 1
RUNNING 0:0
On the node where the job ran, there are no processes still running
for that job, and squeue does not show the job. Any ideas what
happened here, or how I can set the state to completed?
Thanks,
-sam