On 09/02/2011 03:30 PM, Andy Riebs wrote:
Sam,

What version of slurm are you running?

It will help considerably if you forward a copy of your slurm.conf file so that people can see what options you are using.

This is version 2.3.0-0.rc1.  Conf is:

AuthType=auth/munge
ControlMachine=test01
NodeName=test[02-10] Sockets=2 CoresPerSocket=4 ThreadsPerCore=1
MailProg=/usr/bin/mail
PartitionName=DEFAULT MaxTime=30 MaxNodes=10 State=UP
PartitionName=fast Nodes=test[02-07] MaxTime=60 Default=YES State=UP
PartitionName=batch Nodes=test[04-08,10] MaxTime=1440 State=UP
#PartitionName=md Nodes=test[02-03] MaxTime=1440 State=UP
SchedulerPort=7004
SlurmctldPort=7002
SlurmdPort=7003
SlurmdLogFile=/opt/slurm-2.3.0-0.rc1/log/slurmd.log
SlurmctldLogFile=/opt/slurm-2.3.0-0.rc1/log/slurmctld.log
SlurmSchedLogLevel=2
SlurmSchedLogFile=/opt/slurm-2.3.0-0.rc1/log/slurmsched.log
ClusterName=nfs

AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=test01
AccountingStoragePort=7005

#AccountingStorageType=accounting_storage/filetxt
#AccountingStorageLoc=/opt/slurm-2.3.0-0.rc1/acct

#AccountingStorageType=accounting_storage/mysql
#AccountingStorageHost=test01
#AccountingStoragePort=3306
#AccountingStorageUser=root
#AccountingStoragePass=slurm123
#AccountingStorageLoc=slurm_acct_db

SelectType=select/cons_res
StateSaveLocation=/opt/slurm/tmp
MaxJobCount=10000

# Jack up timeouts since grid is slow
MessageTimeout=60
BatchStartTimeout=60

# Activate the Multi-factor Job Priority Plugin with decay
PriorityType=priority/multifactor

# 1 week half-life
PriorityDecayHalfLife=7-0

# The larger the job, the greater its job size priority.
PriorityFavorSmall=NO

# The job's age factor reaches 1.0 after waiting in the
# queue for 1 day.
PriorityMaxAge=1-0

# This next group determines the weighting of each of the
# components of the Multi-factor Job Priority Plugin.
# The default value for each of the following is 1.
PriorityWeightAge=1000
PriorityWeightFairshare=10000
PriorityWeightJobSize=1000
PriorityWeightPartition=1000
PriorityWeightQOS=0 # don't use the qos factor



Andy

On 09/02/2011 04:17 PM, Sam Lang wrote:
Hello,

I have a case where a completed job is reported as still running by
sacct:


        JobID    JobName  Partition    Account  AllocCPUS      State
ExitCode
------------ ---------- ---------- ---------- ---------- ----------
--------
124794          0.batch       fast                     1
COMPLETED      0:0
124794.batch      batch                                1
COMPLETED      0:0
124794.0           bash                                1
RUNNING      0:0

On the node where the job ran, there are no processes still running
for that job, and squeue does not show the job.  Any ideas what
happened here, or how I can set the state to completed?

Thanks,
-sam


Reply via email to