Brian, I've never run into that message with SLURM yet.
Have you tried releasing the jobs with scontrol, e.g. "scontrol release ID" where "ID" is the job number? We do not automatically requeue jobs due to a bug (fixed!) which caused the controller to crash because of an empty task_id_bitmap. John DeSantis 2016-01-26 20:05 GMT-05:00 Andrus, Brian Contractor <[email protected]>: > John, > > > > Thanks. That seemed to help; a job started on a node that had a job on it > once the job that had been on it (‘using’ all the memory) completed. > > > > But now all my jobs won’t start and have a status of ‘JobHoldMaxRequeue’ > > > > From the docs, it seems that is because MAX_BATCH_REQUEUE is too low, but > I don’t see where to change that. > > > > Even worse, I cannot seem to scancel any of those jobs just to clean > things up and test stuff. > > > > Anyone know how to get rid of jobs with a status of ‘JobHoldMaxRequeue’? > > > > Brian Andrus > > > > > > *From:* John Desantis [mailto:[email protected]] > *Sent:* Tuesday, January 26, 2016 12:37 PM > *To:* slurm-dev <[email protected]> > *Subject:* [slurm-dev] Re: Update job and partition for shared jobs > > > > Brian, > > > > Try setting a default memory per CPU in the partition definition. Later > versions of SLURM (>= 14.11.6?) require this value to be set, otherwise all > memory per node is scheduled. > > > > HTH, > > John DeSantis > > > > 2016-01-26 15:20 GMT-05:00 Andrus, Brian Contractor <[email protected]>: > > All, > > > > I am in the process of transitioning from Torque to Slurm. > > So far it is doing very well, especially handling arrays. > > > > Now I have one array job that is running across several nodes, but only > using some of the node resources. I would like to have slurm start sharing > the nodes so some of the array jobs will start where there are unused > resources. > > > > I ran a scontrol update to force sharing and see the partition did change: > > > > *#scontrol show partitions* > > *PartitionName=debug* > > * AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL* > > * AllocNodes=ALL Default=YES QoS=N/A* > > * DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 > Hidden=NO* > > * MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO > MaxCPUsPerNode=UNLIMITED* > > * Nodes=compute[45-49]* > > * Priority=1 RootOnly=NO ReqResv=NO Shared=FORCE:4 PreemptMode=OFF* > > * State=UP TotalCPUs=280 TotalNodes=5 SelectTypeParameters=N/A* > > * DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED* > > > > But it is not starting job 416_37 on any node as I would expect. > > > > *#squeue* > > * JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON)* > > * 416_[37-1013%6] debug slurm_ar user1 PD 0:00 1 > (Resources)* > > * 416_36 debug slurm_ar user1 R 35:46 1 > compute49* > > * 416_35 debug slurm_ar user1 R 1:47:25 1 > compute46* > > * 416_33 debug slurm_ar user1 R 7:30:50 1 > compute45* > > * 416_32 debug slurm_ar user1 R 7:38:39 1 > compute47* > > * 416_31 debug slurm_ar user1 R 8:53:26 1 > compute48* > > > > In my config, I have: > > *SelectType = select/cons_res* > > *SelectTypeParameters = CR_CORE_MEMORY* > > > > > > What am I missing to get more than one job to run on a node? > > > > Thanks in advance, > > > > Brian Andrus > > >
