I am trying sending jobs using the user's account. There is no existing allocation. I am canceling all reservations and allocations of of the user before sending jobs.
I just checked the squeue.. There is 10 pending jobs due to "Priority", and 8 pending jobs due to "Resources" although %60 of resources are IDLE. I am thinking that, if there is idle resources, all pending jobs should be started (if there is no AssociationResourceLimit for owner of the jobs), am I wrong? On 11/05/2012 06:06 PM, Moe Jette wrote: > The user is probably running within an existing salloc allocation. In > that case, srun would ru na step within the existing job allocation > and sbatch would create a new job allocation. > > Moe Jette > SchedMD > > Quoting Sefa Arslan <[email protected]>: > >> >> One of our user has such a porblem. >> He sends his batch file via sbatch..The batch file is : >> >> #!/bin/bash >> >> #SBATCH -p partition-1 >> >> #SBATCH -J test >> >> #SBATCH -N 2 >> >> #SBATCH -n 48 >> >> #SBATCH --output=job.%j.out >> >> #SBATCH --error=job.%j.err >> >> srun hostname >> >> >> >> The job is waiting due to the resources (sometime due to priority >> although only %8-%10 of the partition is utilized): >> >> 138791 partition-1 test mesahin PD 0:00 2 (Resources) >> >> >> sprio: >> >> JOBID PRIORITY AGE FAIRSHARE JOBSIZE PARTITION >> >> 138791 1990 0 0 991 1000 >> >> >> But he is able to run the same job via srun >> >> srun -N 2 -n 48 -p partition-1 -J test hostname >> >> >> >> I could not understand how a user can not run a job via sbatch while he >> could run the same job via srun. Do I have something wrong in my >> configuration? >> the slurm version is slurm-2.4.1-1. >> >> PS: Only a few user have this problem. The others are able to run jobs >> both way. And only %10 of the partition is utilized. >> >> >> scontrol show partition: >> >> PartitionName=partition-1 >> AllocNodes=ALL AllowGroups=ALL Default=NO >> DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO >> MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 >> Nodes=mercan[7-192] >> Priority=1500 RootOnly=NO ReqResv=NO Shared=NO >> PreemptMode=GANG,SUSPEND >> State=UP TotalCPUs=4464 TotalNodes=186 DefMemPerCPU=5000 >> MaxMemPerNode=124800 >> >> >> scontrol show config: >> Configuration data as of 2012-11-05T16:22:17 >> AccountingStorageBackupHost = (null) >> AccountingStorageEnforce = associations,limits >> AccountingStorageHost = mercan5 >> AccountingStorageLoc = N/A >> AccountingStoragePort = 6819 >> AccountingStorageType = accounting_storage/slurmdbd >> AccountingStorageUser = N/A >> AccountingStoreJobComment = YES >> AuthType = auth/munge >> BackupAddr = mercan6 >> BackupController = mercan6 >> BatchStartTimeout = 10 sec >> BOOT_TIME = 2012-10-24T16:20:38 >> CacheGroups = 1 >> CheckpointType = checkpoint/none >> ClusterName = linux >> CompleteWait = 0 sec >> ControlAddr = mercan5 >> ControlMachine = mercan5 >> CryptoType = crypto/munge >> DebugFlags = NO_CONF_HASH >> DefMemPerNode = UNLIMITED >> DisableRootJobs = NO >> EnforcePartLimits = YES >> Epilog = (null) >> EpilogMsgTime = 2000 usec >> EpilogSlurmctld = (null) >> FastSchedule = 1 >> FirstJobId = 100000 >> GetEnvTimeout = 2 sec >> GresTypes = gpu >> GroupUpdateForce = 0 >> GroupUpdateTime = 600 sec >> HASH_VAL = Match >> HealthCheckInterval = 0 sec >> HealthCheckProgram = (null) >> InactiveLimit = 0 sec >> JobAcctGatherFrequency = 30 sec >> JobAcctGatherType = jobacct_gather/linux >> JobCheckpointDir = /var/slurm/checkpoint >> JobCompHost = localhost >> JobCompLoc = /var/log/slurm/job_completions >> JobCompPort = 0 >> JobCompType = jobcomp/filetxt >> JobCompUser = root >> JobCredentialPrivateKey = (null) >> JobCredentialPublicCertificate = (null) >> JobFileAppend = 0 >> JobRequeue = 1 >> JobSubmitPlugins = (null) >> KillOnBadExit = 0 >> KillWait = 30 sec >> Licenses = (null) >> MailProg = /bin/mail >> MaxJobCount = 1000000 >> MaxJobId = 4294901760 >> MaxMemPerNode = UNLIMITED >> MaxStepCount = 40000 >> MaxTasksPerNode = 128 >> MessageTimeout = 10 sec >> MinJobAge = 300 sec >> MpiDefault = none >> MpiParams = (null) >> NEXT_JOB_ID = 138808 >> OverTimeLimit = 0 min >> PluginDir = /usr/lib64/slurm >> PlugStackConfig = /etc/slurm/plugstack.conf >> PreemptMode = GANG,SUSPEND >> PreemptType = preempt/partition_prio >> PriorityDecayHalfLife = 00:00:00 >> PriorityCalcPeriod = 00:05:00 >> PriorityFavorSmall = 1 >> PriorityFlags = 0 >> PriorityMaxAge = 14-00:00:00 >> PriorityUsageResetPeriod = NONE >> PriorityType = priority/multifactor >> PriorityWeightAge = 1000 >> PriorityWeightFairShare = 10000 >> PriorityWeightJobSize = 1000 >> PriorityWeightPartition = 1000 >> PriorityWeightQOS = 0 >> PrivateData = jobs,usage,users,accounts,reservations >> ProctrackType = proctrack/cgroup >> Prolog = (null) >> PrologSlurmctld = (null) >> PropagatePrioProcess = 0 >> PropagateResourceLimits = (null) >> PropagateResourceLimitsExcept = MEMLOCK >> RebootProgram = (null) >> ReconfigFlags = (null) >> ResumeProgram = (null) >> ResumeRate = 300 nodes/min >> ResumeTimeout = 60 sec >> ResvOverRun = 0 min >> ReturnToService = 2 >> SallocDefaultCommand = (null) >> SchedulerParameters = (null) >> SchedulerPort = 7321 >> SchedulerRootFilter = 1 >> SchedulerTimeSlice = 30 sec >> SchedulerType = sched/builtin >> SelectType = select/cons_res >> SelectTypeParameters = CR_CPU_MEMORY >> SlurmUser = root(0) >> SlurmctldDebug = debug3 >> SlurmctldLogFile = /var/log/slurm/slurmctld.log >> SlurmSchedLogFile = (null) >> SlurmctldPort = 6817 >> SlurmctldTimeout = 300 sec >> SlurmdDebug = debug3 >> SlurmdLogFile = /var/log/slurm/slurmd.log >> SlurmdPidFile = /var/run/slurmd.pid >> SlurmdPort = 6818 >> SlurmdSpoolDir = /tmp/slurmd >> SlurmdTimeout = 300 sec >> SlurmdUser = root(0) >> SlurmSchedLogLevel = 0 >> SlurmctldPidFile = /var/run/slurmctld.pid >> SLURM_CONF = /etc/slurm/slurm.conf >> SLURM_VERSION = 2.4.1 >> SrunEpilog = (null) >> SrunProlog = (null) >> StateSaveLocation = /home_palamut1/slurm/slurm.state >> SuspendExcNodes = (null) >> SuspendExcParts = (null) >> SuspendProgram = (null) >> SuspendRate = 60 nodes/min >> SuspendTime = NONE >> SuspendTimeout = 30 sec >> SwitchType = switch/none >> TaskEpilog = (null) >> TaskPlugin = task/cgroup >> TaskPluginParam = (null type) >> TaskProlog = (null) >> TmpFS = /tmp >> TopologyPlugin = topology/none >> TrackWCKey = 0 >> TreeWidth = 50 >> UsePam = 1 >> UnkillableStepProgram = (null) >> UnkillableStepTimeout = 60 sec >> VSizeFactor = 0 percent >> WaitTime = 0 sec >> >> Slurmctld(primary/backup) at mercan5/mercan6 are UP/UP >> >> >> > > >
