Re: [slurm-dev] slurmctld crash on QOS preemption

Ralph Bean Mon, 28 Nov 2011 09:25:34 -0800

It worked with 2.3.1 as expected.  Thanks much!


Excerpts from Danny Auble's message of Mon Nov 28 11:16:01 -0500 2011:
> Can you try this with 2.3?
> 
> On 11/28/11 07:29, Ralph Bean wrote:
> > Hello all,
> >
> >    I have set up two test QOSes, lopri and hipri, the only difference being 
> > that
> >    hipri can preempt lopri.  One user will submit enough jobs to max out the
> >    cluster using the lopri QOS.  The other user will submit one hipri job.  
> > I
> >    expect it to preempt one of the lopri jobs and begin running.  Instead,
> >    slurmctld segfaults.
> >
> >    I am running the following version of slurm:
> >
> >      $ squeue --version
> >      slurm 2.4.0-pre1
> >
> >
> >    Below are:
> >     1) A gdb backtrace of slurmctld
> >     2) Commands used to setup my slurm database.
> >     3) /etc/slurm/slurm.conf
> >
> >    Any suggestions here would be much appreciated.
> >
> > -Ralph Bean
> >   Rochester Institute of Technology
> >
> > ------------- GDB backtrace ----------------
> > slurmctld: debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from uid=6823
> > slurmctld: debug2: initial priority for job 1452 is 253
> > slurmctld: debug2: found 1 usable nodes from config containing einstein
> > slurmctld: bitstring.c:175: bit_test: Assertion `(bit)<  ((b)[1])' failed.
> >
> > Program received signal SIGABRT, Aborted.
> > [Switching to Thread 0x42524940 (LWP 25815)]
> > 0x00002aaaab118265 in raise () from /lib64/libc.so.6
> > (gdb) bt
> > #0  0x00002aaaab118265 in raise () from /lib64/libc.so.6
> > #1  0x00002aaaab119d10 in abort () from /lib64/libc.so.6
> > #2  0x00002aaaab1116e6 in __assert_fail () from /lib64/libc.so.6
> > #3  0x000000000048d484 in bit_test (b=<value optimized out>,
> >      bit=<value optimized out>) at bitstring.c:175
> > #4  0x00002aaaabe72aed in _qos_preemptable (preemptee=<value optimized out>,
> >      preemptor=<value optimized out>) at preempt_qos.c:145
> > #5  0x00002aaaabe72c38 in find_preemptable_jobs (job_ptr=0x432b848)
> >      at preempt_qos.c:114
> > #6  0x000000000045066a in _get_req_features (node_set_ptr=0x433d0c8,
> >      node_set_size=1, select_bitmap=0x42523750, job_ptr=0x432b848,
> >      part_ptr=0x412dc78, min_nodes=1, max_nodes=500000, req_nodes=1, 
> > test_only=true,
> >      preemptee_job_list=0x42523740) at node_scheduler.c:474
> > #7  0x00000000004527b7 in select_nodes (job_ptr=0x432b848, test_only=true,
> >      select_node_bitmap=0x0) at node_scheduler.c:1266
> > #8  0x000000000043b8ef in job_allocate (job_specs=0x432b848, immediate=0,
> >      will_run=0, resp=0x0, allocate=0, submit_uid=0, job_pptr=0x42523be8)
> >      at job_mgr.c:2852
> > #9  0x000000000045bd63 in _slurm_rpc_submit_batch_job (msg=0x4334e58)
> >      at proc_req.c:2610
> > #10 0x000000000045ef67 in slurmctld_req (msg=0x4334e58) at proc_req.c:288
> > #11 0x000000000042d70b in _service_connection (arg=0x4164268) at 
> > controller.c:1020
> > #12 0x00002aaaaaed373d in start_thread () from /lib64/libpthread.so.0
> > #13 0x00002aaaab1bc4bd in clone () from /lib64/libc.so.6
> > (gdb)
> >
> > ------------- Database setup commands ----------------
> >
> > sacctmgr -i add cluster Cluster=tropos
> >
> > sacctmgr -i add QOS name=lopri
> > sacctmgr -i add QOS name=hipri preempt=lopri
> >
> > sacctmgr -i add account rxbrc
> > sacctmgr -i add user rxbrc Cluster=tropos Partition=normal Account=rxbrc 
> > QOS=hipri
> >
> > sacctmgr -i add account rjbpop
> > sacctmgr -i add user rjbpop Cluster=tropos Partition=normal Account=rjbpop 
> > QOS=lopri
> >
> >
> > ------------- /etc/slurm/slurm.conf ----------------
> >
> > # slurm.conf file generated by configurator.html.
> > # Put this file on all nodes of your cluster.
> > # See the slurm.conf man page for more information.
> > #
> > ControlMachine=tropos
> > #ControlAddr=
> > BackupController=stratos
> > #BackupAddr=
> > #
> > AuthType=auth/munge
> > CacheGroups=0
> > #CheckpointType=checkpoint/none
> > CryptoType=crypto/munge
> > DisableRootJobs=YES
> > EnforcePartLimits=YES
> > #Epilog=
> > #PrologSlurmctld=
> > #FirstJobId=1
> > #JobCheckpointDir=/var/slurm/checkpoint
> > #JobCredentialPrivateKey=
> > #JobCredentialPublicCertificate=
> > #JobFileAppend=0
> > #JobRequeue=1
> > #KillOnBadExit=0
> > #Licenses=foo*4,bar
> > #MailProg=/bin/mail
> > #MaxJobCount=5000
> > #MaxTasksPerNode=128
> > MpiDefault=none
> > #MpiParams=ports=#-#
> > #PluginDir=
> > #PlugStackConfig=
> > #PrivateData=jobs
> > ProctrackType=proctrack/linuxproc
> > #Prolog=
> > #PrologSlurmctld=
> > #PropagatePrioProcess=0
> > #PropagateResourceLimits=
> > #PropagateResourceLimitsExcept=
> > ReturnToService=1
> > #SallocDefaultCommand=
> > SlurmctldPidFile=/var/run/slurmctld.pid
> > SlurmctldPort=6817
> > SlurmdPidFile=/var/run/slurmd.%h.%n.pid
> > #SlurmdPort=6818
> > SlurmdSpoolDir=/tmp/slurmd.%h.%n.pid
> > SlurmUser=slurm
> > #SrunEpilog=
> > #SrunProlog=
> > StateSaveLocation=/tmp
> > SwitchType=switch/none
> > #TaskEpilog=
> > TaskPlugin=task/none
> > #TaskPluginParam=Sched
> > #TaskProlog=
> > #TopologyPlugin=topology/tree
> > #TmpFs=/tmp
> > #TrackWCKey=no
> > #TreeWidth=
> > #UnkillableStepProgram=
> > #UnkillableStepTimeout=
> > #UsePAM=0
> > #
> > #
> > # TIMERS
> > #BatchStartTimeout=10
> > #CompleteWait=0
> > #EpilogMsgTime=2000
> > #GetEnvTimeout=2
> > #HealthCheckInterval=0
> > #HealthCheckProgram=
> > InactiveLimit=0
> > KillWait=2
> > #MessageTimeout=10
> > #ResvOverRun=0
> > MinJobAge=300
> > #OverTimeLimit=0
> > SlurmctldTimeout=300
> > SlurmdTimeout=300
> > #UnkillableStepProgram=
> > #UnkillableStepTimeout=60
> > Waittime=0
> > #
> > #
> >
> > # Preemption
> > PreemptType=preempt/qos
> > PreemptMode=REQUEUE
> >
> > # SCHEDULING
> > #DefMemPerCPU=0
> > FastSchedule=0
> > #MaxMemPerCPU=0
> > #SchedulerRootFilter=1
> > #SchedulerTimeSlice=30
> > SchedulerType=sched/backfill
> > SchedulerPort=7321
> > SelectType=select/cons_res
> > SelectTypeParameters=CR_Core_Memory
> >
> > #
> > #
> > # JOB PRIORITY
> > PriorityType=priority/multifactor
> >
> > # 5 minute half-life
> > PriorityDecayHalfLife=00:05:00
> >
> > # The larger the job, the greater its job size priority.
> > PriorityFavorSmall=NO
> >
> > # The job's age factor reaches 1.0 after waiting in the
> > # queue for 1 day
> > PriorityMaxAge=1-0
> >
> > # This next group determines the weighting of each of the
> > # components of the Multi-factor Job Priority Plugin.
> > # The default value for each of the following is 1.
> > PriorityWeightAge=100
> > PriorityWeightFairshare=100
> > PriorityWeightJobSize=100
> > PriorityWeightPartition=100
> > PriorityWeightQOS=1000
> > #PriorityDecayHalfLife=
> > #PriorityCalcPeriod=
> > #PriorityFavorSmall=
> > #PriorityMaxAge=
> > #PriorityUsageResetPeriod=
> > #PriorityWeightAge=
> > #PriorityWeightFairshare=
> > #PriorityWeightJobSize=
> > #PriorityWeightPartition=
> > #PriorityWeightQOS=
> > #
> > #
> > # LOGGING AND ACCOUNTING
> > AccountingStorageEnforce=associations,qos
> > AccountingStorageHost=localhost
> > AccountingStoragePort=7031
> > AccountingStorageType=accounting_storage/slurmdbd
> >
> > ClusterName=tropos
> > #DebugFlags=Gang,Priority,SelectType,Triggers
> > #JobCompHost=
> > JobCompLoc=/shared/slurm/admin/jobs.txt
> > #JobCompPass=
> > #JobCompPort=
> > JobCompType=jobcomp/filetxt
> > #JobCompUser=
> > JobAcctGatherFrequency=30
> > JobAcctGatherType=jobacct_gather/linux
> > SlurmctldDebug=9
> > #SlurmctldLogFile=
> > SlurmdDebug=9
> > #SlurmdLogFile=
> > #
> > #
> > # POWER SAVE SUPPORT FOR IDLE NODES (optional)
> > #SuspendProgram=
> > #ResumeProgram=
> > #SuspendTimeout=
> > #ResumeTimeout=
> > #ResumeRate=
> > #SuspendExcNodes=
> > #SuspendExcParts=
> > #SuspendRate=
> > #SuspendTime=
> > #
> > #
> > # COMPUTE NODES
> >
> >
> > NodeName=DEFAULT Procs=32 Sockets=8 CoresPerSocket=4 ThreadsPerCore=1 
> > RealMemory=30000
> > NodeName=einstein   NodeHostName=einstein
> > #NodeName=pauli     NodeHostName=pauli
> >
> > #PartitionName=batch Nodes=b[1-8],c[1-8],planck,pauli MaxTime=1440
> > PartitionName=DEFAULT       Nodes=einstein
> > PartitionName=normal        Default=Yes
> >

Re: [slurm-dev] slurmctld crash on QOS preemption

Reply via email to