Re: [slurm-dev] slurmctld crash on QOS preemption

Danny Auble Mon, 28 Nov 2011 08:16:43 -0800

Can you try this with 2.3?

On 11/28/11 07:29, Ralph Bean wrote:

Hello all,


   I have set up two test QOSes, lopri and hipri, the only difference being that
   hipri can preempt lopri.  One user will submit enough jobs to max out the
   cluster using the lopri QOS.  The other user will submit one hipri job.  I
   expect it to preempt one of the lopri jobs and begin running.  Instead,
   slurmctld segfaults.

   I am running the following version of slurm:

     $ squeue --version
     slurm 2.4.0-pre1


   Below are:
    1) A gdb backtrace of slurmctld
    2) Commands used to setup my slurm database.
    3) /etc/slurm/slurm.conf

   Any suggestions here would be much appreciated.

-Ralph Bean
  Rochester Institute of Technology

------------- GDB backtrace ----------------
slurmctld: debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from uid=6823
slurmctld: debug2: initial priority for job 1452 is 253
slurmctld: debug2: found 1 usable nodes from config containing einstein
slurmctld: bitstring.c:175: bit_test: Assertion `(bit)<  ((b)[1])' failed.

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x42524940 (LWP 25815)]
0x00002aaaab118265 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00002aaaab118265 in raise () from /lib64/libc.so.6
#1  0x00002aaaab119d10 in abort () from /lib64/libc.so.6
#2  0x00002aaaab1116e6 in __assert_fail () from /lib64/libc.so.6
#3  0x000000000048d484 in bit_test (b=<value optimized out>,
     bit=<value optimized out>) at bitstring.c:175
#4  0x00002aaaabe72aed in _qos_preemptable (preemptee=<value optimized out>,
     preemptor=<value optimized out>) at preempt_qos.c:145
#5  0x00002aaaabe72c38 in find_preemptable_jobs (job_ptr=0x432b848)
     at preempt_qos.c:114
#6  0x000000000045066a in _get_req_features (node_set_ptr=0x433d0c8,
     node_set_size=1, select_bitmap=0x42523750, job_ptr=0x432b848,
     part_ptr=0x412dc78, min_nodes=1, max_nodes=500000, req_nodes=1, 
test_only=true,
     preemptee_job_list=0x42523740) at node_scheduler.c:474
#7  0x00000000004527b7 in select_nodes (job_ptr=0x432b848, test_only=true,
     select_node_bitmap=0x0) at node_scheduler.c:1266
#8  0x000000000043b8ef in job_allocate (job_specs=0x432b848, immediate=0,
     will_run=0, resp=0x0, allocate=0, submit_uid=0, job_pptr=0x42523be8)
     at job_mgr.c:2852
#9  0x000000000045bd63 in _slurm_rpc_submit_batch_job (msg=0x4334e58)
     at proc_req.c:2610
#10 0x000000000045ef67 in slurmctld_req (msg=0x4334e58) at proc_req.c:288
#11 0x000000000042d70b in _service_connection (arg=0x4164268) at 
controller.c:1020
#12 0x00002aaaaaed373d in start_thread () from /lib64/libpthread.so.0
#13 0x00002aaaab1bc4bd in clone () from /lib64/libc.so.6
(gdb)

------------- Database setup commands ----------------

sacctmgr -i add cluster Cluster=tropos

sacctmgr -i add QOS name=lopri
sacctmgr -i add QOS name=hipri preempt=lopri

sacctmgr -i add account rxbrc
sacctmgr -i add user rxbrc Cluster=tropos Partition=normal Account=rxbrc 
QOS=hipri

sacctmgr -i add account rjbpop
sacctmgr -i add user rjbpop Cluster=tropos Partition=normal Account=rjbpop 
QOS=lopri


------------- /etc/slurm/slurm.conf ----------------

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=tropos
#ControlAddr=
BackupController=stratos
#BackupAddr=
#
AuthType=auth/munge
CacheGroups=0
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
DisableRootJobs=YES
EnforcePartLimits=YES
#Epilog=
#PrologSlurmctld=
#FirstJobId=1
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#KillOnBadExit=0
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/linuxproc
#Prolog=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.%h.%n.pid
#SlurmdPort=6818
SlurmdSpoolDir=/tmp/slurmd.%h.%n.pid
SlurmUser=slurm
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/tmp
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=Sched
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFs=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UnkillableStepTimeout=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=2
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=300
SlurmdTimeout=300
#UnkillableStepProgram=
#UnkillableStepTimeout=60
Waittime=0
#
#

# Preemption
PreemptType=preempt/qos
PreemptMode=REQUEUE

# SCHEDULING
#DefMemPerCPU=0
FastSchedule=0
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

#
#
# JOB PRIORITY
PriorityType=priority/multifactor

# 5 minute half-life
PriorityDecayHalfLife=00:05:00

# The larger the job, the greater its job size priority.
PriorityFavorSmall=NO

# The job's age factor reaches 1.0 after waiting in the
# queue for 1 day
PriorityMaxAge=1-0

# This next group determines the weighting of each of the
# components of the Multi-factor Job Priority Plugin.
# The default value for each of the following is 1.
PriorityWeightAge=100
PriorityWeightFairshare=100
PriorityWeightJobSize=100
PriorityWeightPartition=100
PriorityWeightQOS=1000
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
AccountingStorageEnforce=associations,qos
AccountingStorageHost=localhost
AccountingStoragePort=7031
AccountingStorageType=accounting_storage/slurmdbd

ClusterName=tropos
#DebugFlags=Gang,Priority,SelectType,Triggers
#JobCompHost=
JobCompLoc=/shared/slurm/admin/jobs.txt
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/filetxt
#JobCompUser=
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=9
#SlurmctldLogFile=
SlurmdDebug=9
#SlurmdLogFile=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES


NodeName=DEFAULT Procs=32 Sockets=8 CoresPerSocket=4 ThreadsPerCore=1 
RealMemory=30000
NodeName=einstein   NodeHostName=einstein
#NodeName=pauli     NodeHostName=pauli

#PartitionName=batch Nodes=b[1-8],c[1-8],planck,pauli MaxTime=1440
PartitionName=DEFAULT       Nodes=einstein
PartitionName=normal        Default=Yes

Re: [slurm-dev] slurmctld crash on QOS preemption

Reply via email to