It worked with 2.3.1 as expected. Thanks much!
Excerpts from Danny Auble's message of Mon Nov 28 11:16:01 -0500 2011: > Can you try this with 2.3? > > On 11/28/11 07:29, Ralph Bean wrote: > > Hello all, > > > > I have set up two test QOSes, lopri and hipri, the only difference being > > that > > hipri can preempt lopri. One user will submit enough jobs to max out the > > cluster using the lopri QOS. The other user will submit one hipri job. > > I > > expect it to preempt one of the lopri jobs and begin running. Instead, > > slurmctld segfaults. > > > > I am running the following version of slurm: > > > > $ squeue --version > > slurm 2.4.0-pre1 > > > > > > Below are: > > 1) A gdb backtrace of slurmctld > > 2) Commands used to setup my slurm database. > > 3) /etc/slurm/slurm.conf > > > > Any suggestions here would be much appreciated. > > > > -Ralph Bean > > Rochester Institute of Technology > > > > ------------- GDB backtrace ---------------- > > slurmctld: debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from uid=6823 > > slurmctld: debug2: initial priority for job 1452 is 253 > > slurmctld: debug2: found 1 usable nodes from config containing einstein > > slurmctld: bitstring.c:175: bit_test: Assertion `(bit)< ((b)[1])' failed. > > > > Program received signal SIGABRT, Aborted. > > [Switching to Thread 0x42524940 (LWP 25815)] > > 0x00002aaaab118265 in raise () from /lib64/libc.so.6 > > (gdb) bt > > #0 0x00002aaaab118265 in raise () from /lib64/libc.so.6 > > #1 0x00002aaaab119d10 in abort () from /lib64/libc.so.6 > > #2 0x00002aaaab1116e6 in __assert_fail () from /lib64/libc.so.6 > > #3 0x000000000048d484 in bit_test (b=<value optimized out>, > > bit=<value optimized out>) at bitstring.c:175 > > #4 0x00002aaaabe72aed in _qos_preemptable (preemptee=<value optimized out>, > > preemptor=<value optimized out>) at preempt_qos.c:145 > > #5 0x00002aaaabe72c38 in find_preemptable_jobs (job_ptr=0x432b848) > > at preempt_qos.c:114 > > #6 0x000000000045066a in _get_req_features (node_set_ptr=0x433d0c8, > > node_set_size=1, select_bitmap=0x42523750, job_ptr=0x432b848, > > part_ptr=0x412dc78, min_nodes=1, max_nodes=500000, req_nodes=1, > > test_only=true, > > preemptee_job_list=0x42523740) at node_scheduler.c:474 > > #7 0x00000000004527b7 in select_nodes (job_ptr=0x432b848, test_only=true, > > select_node_bitmap=0x0) at node_scheduler.c:1266 > > #8 0x000000000043b8ef in job_allocate (job_specs=0x432b848, immediate=0, > > will_run=0, resp=0x0, allocate=0, submit_uid=0, job_pptr=0x42523be8) > > at job_mgr.c:2852 > > #9 0x000000000045bd63 in _slurm_rpc_submit_batch_job (msg=0x4334e58) > > at proc_req.c:2610 > > #10 0x000000000045ef67 in slurmctld_req (msg=0x4334e58) at proc_req.c:288 > > #11 0x000000000042d70b in _service_connection (arg=0x4164268) at > > controller.c:1020 > > #12 0x00002aaaaaed373d in start_thread () from /lib64/libpthread.so.0 > > #13 0x00002aaaab1bc4bd in clone () from /lib64/libc.so.6 > > (gdb) > > > > ------------- Database setup commands ---------------- > > > > sacctmgr -i add cluster Cluster=tropos > > > > sacctmgr -i add QOS name=lopri > > sacctmgr -i add QOS name=hipri preempt=lopri > > > > sacctmgr -i add account rxbrc > > sacctmgr -i add user rxbrc Cluster=tropos Partition=normal Account=rxbrc > > QOS=hipri > > > > sacctmgr -i add account rjbpop > > sacctmgr -i add user rjbpop Cluster=tropos Partition=normal Account=rjbpop > > QOS=lopri > > > > > > ------------- /etc/slurm/slurm.conf ---------------- > > > > # slurm.conf file generated by configurator.html. > > # Put this file on all nodes of your cluster. > > # See the slurm.conf man page for more information. > > # > > ControlMachine=tropos > > #ControlAddr= > > BackupController=stratos > > #BackupAddr= > > # > > AuthType=auth/munge > > CacheGroups=0 > > #CheckpointType=checkpoint/none > > CryptoType=crypto/munge > > DisableRootJobs=YES > > EnforcePartLimits=YES > > #Epilog= > > #PrologSlurmctld= > > #FirstJobId=1 > > #JobCheckpointDir=/var/slurm/checkpoint > > #JobCredentialPrivateKey= > > #JobCredentialPublicCertificate= > > #JobFileAppend=0 > > #JobRequeue=1 > > #KillOnBadExit=0 > > #Licenses=foo*4,bar > > #MailProg=/bin/mail > > #MaxJobCount=5000 > > #MaxTasksPerNode=128 > > MpiDefault=none > > #MpiParams=ports=#-# > > #PluginDir= > > #PlugStackConfig= > > #PrivateData=jobs > > ProctrackType=proctrack/linuxproc > > #Prolog= > > #PrologSlurmctld= > > #PropagatePrioProcess=0 > > #PropagateResourceLimits= > > #PropagateResourceLimitsExcept= > > ReturnToService=1 > > #SallocDefaultCommand= > > SlurmctldPidFile=/var/run/slurmctld.pid > > SlurmctldPort=6817 > > SlurmdPidFile=/var/run/slurmd.%h.%n.pid > > #SlurmdPort=6818 > > SlurmdSpoolDir=/tmp/slurmd.%h.%n.pid > > SlurmUser=slurm > > #SrunEpilog= > > #SrunProlog= > > StateSaveLocation=/tmp > > SwitchType=switch/none > > #TaskEpilog= > > TaskPlugin=task/none > > #TaskPluginParam=Sched > > #TaskProlog= > > #TopologyPlugin=topology/tree > > #TmpFs=/tmp > > #TrackWCKey=no > > #TreeWidth= > > #UnkillableStepProgram= > > #UnkillableStepTimeout= > > #UsePAM=0 > > # > > # > > # TIMERS > > #BatchStartTimeout=10 > > #CompleteWait=0 > > #EpilogMsgTime=2000 > > #GetEnvTimeout=2 > > #HealthCheckInterval=0 > > #HealthCheckProgram= > > InactiveLimit=0 > > KillWait=2 > > #MessageTimeout=10 > > #ResvOverRun=0 > > MinJobAge=300 > > #OverTimeLimit=0 > > SlurmctldTimeout=300 > > SlurmdTimeout=300 > > #UnkillableStepProgram= > > #UnkillableStepTimeout=60 > > Waittime=0 > > # > > # > > > > # Preemption > > PreemptType=preempt/qos > > PreemptMode=REQUEUE > > > > # SCHEDULING > > #DefMemPerCPU=0 > > FastSchedule=0 > > #MaxMemPerCPU=0 > > #SchedulerRootFilter=1 > > #SchedulerTimeSlice=30 > > SchedulerType=sched/backfill > > SchedulerPort=7321 > > SelectType=select/cons_res > > SelectTypeParameters=CR_Core_Memory > > > > # > > # > > # JOB PRIORITY > > PriorityType=priority/multifactor > > > > # 5 minute half-life > > PriorityDecayHalfLife=00:05:00 > > > > # The larger the job, the greater its job size priority. > > PriorityFavorSmall=NO > > > > # The job's age factor reaches 1.0 after waiting in the > > # queue for 1 day > > PriorityMaxAge=1-0 > > > > # This next group determines the weighting of each of the > > # components of the Multi-factor Job Priority Plugin. > > # The default value for each of the following is 1. > > PriorityWeightAge=100 > > PriorityWeightFairshare=100 > > PriorityWeightJobSize=100 > > PriorityWeightPartition=100 > > PriorityWeightQOS=1000 > > #PriorityDecayHalfLife= > > #PriorityCalcPeriod= > > #PriorityFavorSmall= > > #PriorityMaxAge= > > #PriorityUsageResetPeriod= > > #PriorityWeightAge= > > #PriorityWeightFairshare= > > #PriorityWeightJobSize= > > #PriorityWeightPartition= > > #PriorityWeightQOS= > > # > > # > > # LOGGING AND ACCOUNTING > > AccountingStorageEnforce=associations,qos > > AccountingStorageHost=localhost > > AccountingStoragePort=7031 > > AccountingStorageType=accounting_storage/slurmdbd > > > > ClusterName=tropos > > #DebugFlags=Gang,Priority,SelectType,Triggers > > #JobCompHost= > > JobCompLoc=/shared/slurm/admin/jobs.txt > > #JobCompPass= > > #JobCompPort= > > JobCompType=jobcomp/filetxt > > #JobCompUser= > > JobAcctGatherFrequency=30 > > JobAcctGatherType=jobacct_gather/linux > > SlurmctldDebug=9 > > #SlurmctldLogFile= > > SlurmdDebug=9 > > #SlurmdLogFile= > > # > > # > > # POWER SAVE SUPPORT FOR IDLE NODES (optional) > > #SuspendProgram= > > #ResumeProgram= > > #SuspendTimeout= > > #ResumeTimeout= > > #ResumeRate= > > #SuspendExcNodes= > > #SuspendExcParts= > > #SuspendRate= > > #SuspendTime= > > # > > # > > # COMPUTE NODES > > > > > > NodeName=DEFAULT Procs=32 Sockets=8 CoresPerSocket=4 ThreadsPerCore=1 > > RealMemory=30000 > > NodeName=einstein NodeHostName=einstein > > #NodeName=pauli NodeHostName=pauli > > > > #PartitionName=batch Nodes=b[1-8],c[1-8],planck,pauli MaxTime=1440 > > PartitionName=DEFAULT Nodes=einstein > > PartitionName=normal Default=Yes > >
