Can you try this with 2.3? On 11/28/11 07:29, Ralph Bean wrote:
Hello all,I have set up two test QOSes, lopri and hipri, the only difference being that hipri can preempt lopri. One user will submit enough jobs to max out the cluster using the lopri QOS. The other user will submit one hipri job. I expect it to preempt one of the lopri jobs and begin running. Instead, slurmctld segfaults. I am running the following version of slurm: $ squeue --version slurm 2.4.0-pre1 Below are: 1) A gdb backtrace of slurmctld 2) Commands used to setup my slurm database. 3) /etc/slurm/slurm.conf Any suggestions here would be much appreciated. -Ralph Bean Rochester Institute of Technology ------------- GDB backtrace ---------------- slurmctld: debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from uid=6823 slurmctld: debug2: initial priority for job 1452 is 253 slurmctld: debug2: found 1 usable nodes from config containing einstein slurmctld: bitstring.c:175: bit_test: Assertion `(bit)< ((b)[1])' failed. Program received signal SIGABRT, Aborted. [Switching to Thread 0x42524940 (LWP 25815)] 0x00002aaaab118265 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x00002aaaab118265 in raise () from /lib64/libc.so.6 #1 0x00002aaaab119d10 in abort () from /lib64/libc.so.6 #2 0x00002aaaab1116e6 in __assert_fail () from /lib64/libc.so.6 #3 0x000000000048d484 in bit_test (b=<value optimized out>, bit=<value optimized out>) at bitstring.c:175 #4 0x00002aaaabe72aed in _qos_preemptable (preemptee=<value optimized out>, preemptor=<value optimized out>) at preempt_qos.c:145 #5 0x00002aaaabe72c38 in find_preemptable_jobs (job_ptr=0x432b848) at preempt_qos.c:114 #6 0x000000000045066a in _get_req_features (node_set_ptr=0x433d0c8, node_set_size=1, select_bitmap=0x42523750, job_ptr=0x432b848, part_ptr=0x412dc78, min_nodes=1, max_nodes=500000, req_nodes=1, test_only=true, preemptee_job_list=0x42523740) at node_scheduler.c:474 #7 0x00000000004527b7 in select_nodes (job_ptr=0x432b848, test_only=true, select_node_bitmap=0x0) at node_scheduler.c:1266 #8 0x000000000043b8ef in job_allocate (job_specs=0x432b848, immediate=0, will_run=0, resp=0x0, allocate=0, submit_uid=0, job_pptr=0x42523be8) at job_mgr.c:2852 #9 0x000000000045bd63 in _slurm_rpc_submit_batch_job (msg=0x4334e58) at proc_req.c:2610 #10 0x000000000045ef67 in slurmctld_req (msg=0x4334e58) at proc_req.c:288 #11 0x000000000042d70b in _service_connection (arg=0x4164268) at controller.c:1020 #12 0x00002aaaaaed373d in start_thread () from /lib64/libpthread.so.0 #13 0x00002aaaab1bc4bd in clone () from /lib64/libc.so.6 (gdb) ------------- Database setup commands ---------------- sacctmgr -i add cluster Cluster=tropos sacctmgr -i add QOS name=lopri sacctmgr -i add QOS name=hipri preempt=lopri sacctmgr -i add account rxbrc sacctmgr -i add user rxbrc Cluster=tropos Partition=normal Account=rxbrc QOS=hipri sacctmgr -i add account rjbpop sacctmgr -i add user rjbpop Cluster=tropos Partition=normal Account=rjbpop QOS=lopri ------------- /etc/slurm/slurm.conf ---------------- # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=tropos #ControlAddr= BackupController=stratos #BackupAddr= # AuthType=auth/munge CacheGroups=0 #CheckpointType=checkpoint/none CryptoType=crypto/munge DisableRootJobs=YES EnforcePartLimits=YES #Epilog= #PrologSlurmctld= #FirstJobId=1 #JobCheckpointDir=/var/slurm/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= #JobFileAppend=0 #JobRequeue=1 #KillOnBadExit=0 #Licenses=foo*4,bar #MailProg=/bin/mail #MaxJobCount=5000 #MaxTasksPerNode=128 MpiDefault=none #MpiParams=ports=#-# #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/linuxproc #Prolog= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= ReturnToService=1 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.%h.%n.pid #SlurmdPort=6818 SlurmdSpoolDir=/tmp/slurmd.%h.%n.pid SlurmUser=slurm #SrunEpilog= #SrunProlog= StateSaveLocation=/tmp SwitchType=switch/none #TaskEpilog= TaskPlugin=task/none #TaskPluginParam=Sched #TaskProlog= #TopologyPlugin=topology/tree #TmpFs=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UnkillableStepTimeout= #UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=2 #MessageTimeout=10 #ResvOverRun=0 MinJobAge=300 #OverTimeLimit=0 SlurmctldTimeout=300 SlurmdTimeout=300 #UnkillableStepProgram= #UnkillableStepTimeout=60 Waittime=0 # # # Preemption PreemptType=preempt/qos PreemptMode=REQUEUE # SCHEDULING #DefMemPerCPU=0 FastSchedule=0 #MaxMemPerCPU=0 #SchedulerRootFilter=1 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory # # # JOB PRIORITY PriorityType=priority/multifactor # 5 minute half-life PriorityDecayHalfLife=00:05:00 # The larger the job, the greater its job size priority. PriorityFavorSmall=NO # The job's age factor reaches 1.0 after waiting in the # queue for 1 day PriorityMaxAge=1-0 # This next group determines the weighting of each of the # components of the Multi-factor Job Priority Plugin. # The default value for each of the following is 1. PriorityWeightAge=100 PriorityWeightFairshare=100 PriorityWeightJobSize=100 PriorityWeightPartition=100 PriorityWeightQOS=1000 #PriorityDecayHalfLife= #PriorityCalcPeriod= #PriorityFavorSmall= #PriorityMaxAge= #PriorityUsageResetPeriod= #PriorityWeightAge= #PriorityWeightFairshare= #PriorityWeightJobSize= #PriorityWeightPartition= #PriorityWeightQOS= # # # LOGGING AND ACCOUNTING AccountingStorageEnforce=associations,qos AccountingStorageHost=localhost AccountingStoragePort=7031 AccountingStorageType=accounting_storage/slurmdbd ClusterName=tropos #DebugFlags=Gang,Priority,SelectType,Triggers #JobCompHost= JobCompLoc=/shared/slurm/admin/jobs.txt #JobCompPass= #JobCompPort= JobCompType=jobcomp/filetxt #JobCompUser= JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux SlurmctldDebug=9 #SlurmctldLogFile= SlurmdDebug=9 #SlurmdLogFile= # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES NodeName=DEFAULT Procs=32 Sockets=8 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=30000 NodeName=einstein NodeHostName=einstein #NodeName=pauli NodeHostName=pauli #PartitionName=batch Nodes=b[1-8],c[1-8],planck,pauli MaxTime=1440 PartitionName=DEFAULT Nodes=einstein PartitionName=normal Default=Yes
