[slurm-dev] Numbering of physical and hyper cores
Dear all, where can I tell Slurm what core numbers belong to the same physical core? The physical cores on our KNL are 0-63, followed by hyperthreads 64-255. cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list 0,64,128,192 When I ask for 4 cores with "srun --pty -c 4 -p knl bash" I see: taurusknl1 /home/mark taskset -pc $$ pid 285662's current affinity list: 0,64,128,192 but these are not 4 cores but only one core! It looks like Slurm does not recognize the numbering scheme for the cores on the node. Where can I specify this? Thank you, Ulf "scontrol show node " says: CoreSpecCount=1 CPUSpecList=252-255 this, again, are 4 threads on 4 different cores! This is my node entry for this guy: NodeName=taurusknl[1] Sockets=1 CoresPerSocket=64 ThreadsPerCore=4 State=UNKNOWN RealMemory=94000 Weight=64 CoreSpecCount=1 -- ___ Dr. Ulf Markwardt Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) 01062 Dresden, Germany Phone: (+49) 351/463-33640 WWW: http://www.tu-dresden.de/zih smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Accounting and limits
Using sacctmgr you can set limits like GrpCPUMins and other GrpTRESMins. It's pretty easy to see what the limit is, but I'm not sure how to see how close to the limit someone is. Is there a normal slurm command that can get you the internal GrpTRESMins info slurmctld is using to enforce limits? Looking at the code, the share request from sshare gets sent all the data, but there's no option to print it out. I made a simple patch below to add a GrpTRESRaw option to sshare to print this info. It has to convert the usage from double to int, which isn't ideal, but it was simple and is close enough for what I was looking for. Is there a better way to get this information? - Gary Skouson diff -Naru slurm-16.05.9/src/sshare/process.c slurm-16.05.9.change/src/sshare/process.c --- slurm-16.05.9/src/sshare/process.c 2017-01-31 11:55:41.0 -0800 +++ slurm-16.05.9.change/src/sshare/process.c 2017-02-08 15:45:08.019494347 -0800 @@ -63,6 +63,7 @@ {10, "User", print_fields_str, PRINT_USER}, {30, "GrpTRESMins", _print_tres, PRINT_TRESMINS}, {30, "TRESRunMins", _print_tres, PRINT_RUNMINS}, + {30, "GrpTRESRaw", _print_tres, PRINT_GRPTRESRAW}, {0, NULL, NULL, 0} }; @@ -226,6 +227,7 @@ char *tmp_char = NULL; char *local_acct = NULL; print_field_t *field = NULL; + uint64_t tres_raw[tres_cnt]; if ((options & PRINT_USERS_ONLY) && share->user == 0) continue; @@ -342,6 +344,14 @@ share->tres_grp_mins, (curr_inx == field_count)); break; + case PRINT_GRPTRESRAW: + /* convert to ints and minutes */ + for (i=0; iusage_tres_raw[i]/60; + field->print_routine(field, +tres_raw, +(curr_inx == field_count)); + break; case PRINT_RUNMINS: /* convert to minutes */ for (i=0; i
[slurm-dev] Re: Allocating at logical core level and binding separate physical cores first
(clumsy fingers) if I understand your question correctly, but maybe; srun --cpu_bind=threads On Wed, Feb 8, 2017 at 4:02 PM, andrealphuswrote: > srun --cpu_bind=cores > > On Wed, Feb 8, 2017 at 1:08 PM, Brendan Moloney > wrote: > >> Hi, >> >> I want to allocate at the level of logical cores (each serial job gets >> one thread on a hyperthreading system), which seems to be achievable only >> by not setting threads_per_core on each node, and instead just setting >> CPUs=. >> >> However with core binding, this will pack two tasks onto the same >> physical core while all other cores are left idle. On a system with 20 >> cores and 40 threads I see this behavior: >> >> $ srun bash -c "lstopo | head -n 6 ; sleep 10" & >> NUMANode L#0 (P#0 47GB) + Package L#0 + L3 L#0 (25MB) + L2 L#0 (256KB) >> + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0) >> NUMANode L#1 (P#1 47GB) >> HostBridge L#0 >> PCIBridge >> PCI 15b3:1003 >> $ srun bash -c "lstopo | head -n 6 ; sleep 10" & >> NUMANode L#0 (P#0 47GB) + Package L#0 + L3 L#0 (25MB) + L2 L#0 (256KB) >> + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#20) >> NUMANode L#1 (P#1 47GB) >> HostBridge L#0 >> PCIBridge >> PCI 15b3:1003 >> >> I expected to see the second job get logical core #1 (on the second >> physical core) but instead it gets logical core #20 (the second thread on >> the first physical core). I can't imagine that this is ever the desired >> behavior, but I guess I could be missing some use case. >> >> I have spent quite a bit of time reading the documentation/mailing list >> and experimenting with different options, all to no avail. Is it possible >> to achieve my desired setup with Slurm? >> >> I also experimented with setting threads_per_core=2 and then setting >> OverSubscribe=FORCE:2, but I am pretty unhappy with the results. I think >> it is confusing that you request one cpu and end up with two (with your >> --mem-per-cpu doubled), and best I can tell there is no way to only >> oversubscribe if the user requested 1 core instead of two. >> >> Thanks for your time, >> Brendan >> >> >> >> >
[slurm-dev] Re: Allocating at logical core level and binding separate physical cores first
srun --cpu_bind=cores On Wed, Feb 8, 2017 at 1:08 PM, Brendan Moloneywrote: > Hi, > > I want to allocate at the level of logical cores (each serial job gets one > thread on a hyperthreading system), which seems to be achievable only by > not setting threads_per_core on each node, and instead just setting > CPUs=. > > However with core binding, this will pack two tasks onto the same physical > core while all other cores are left idle. On a system with 20 cores and 40 > threads I see this behavior: > > $ srun bash -c "lstopo | head -n 6 ; sleep 10" & > NUMANode L#0 (P#0 47GB) + Package L#0 + L3 L#0 (25MB) + L2 L#0 (256KB) + > L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0) > NUMANode L#1 (P#1 47GB) > HostBridge L#0 > PCIBridge > PCI 15b3:1003 > $ srun bash -c "lstopo | head -n 6 ; sleep 10" & > NUMANode L#0 (P#0 47GB) + Package L#0 + L3 L#0 (25MB) + L2 L#0 (256KB) + > L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#20) > NUMANode L#1 (P#1 47GB) > HostBridge L#0 > PCIBridge > PCI 15b3:1003 > > I expected to see the second job get logical core #1 (on the second > physical core) but instead it gets logical core #20 (the second thread on > the first physical core). I can't imagine that this is ever the desired > behavior, but I guess I could be missing some use case. > > I have spent quite a bit of time reading the documentation/mailing list > and experimenting with different options, all to no avail. Is it possible > to achieve my desired setup with Slurm? > > I also experimented with setting threads_per_core=2 and then setting > OverSubscribe=FORCE:2, but I am pretty unhappy with the results. I think > it is confusing that you request one cpu and end up with two (with your > --mem-per-cpu doubled), and best I can tell there is no way to only > oversubscribe if the user requested 1 core instead of two. > > Thanks for your time, > Brendan > > > >
[slurm-dev] Re: Job priority/cluster utilization help
On 08/02/17 11:19, Vicker, Darby (JSC-EG311) wrote: > Sorry for the long post but not sure how to get adequate help without > providing a lot of detail. Any recommendations on configuring the > scheduler to help these jobs run and increase the cluster utilization > would be appreciated. My one thought after a quick scan is that both the jobs you mention are listed as reason "Priority" and there's a higher priority job 1772 in the list before them. You might want to look at your backfill settings to see whether it's looking far enough down the queue to see these. Perhaps an alternative idea would be to instead of using features use partitions and then have people submit to all partitions (there is a plugin for that, though we use a submit filter instead to accomplish the same). That way Slurm should consider each job against each partition (set of architectures) individually. Best of luck! Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
[slurm-dev] Allocating at logical core level and binding separate physical cores first
Hi, I want to allocate at the level of logical cores (each serial job gets one thread on a hyperthreading system), which seems to be achievable only by not setting threads_per_core on each node, and instead just setting CPUs=. However with core binding, this will pack two tasks onto the same physical core while all other cores are left idle. On a system with 20 cores and 40 threads I see this behavior: $ srun bash -c "lstopo | head -n 6 ; sleep 10" & NUMANode L#0 (P#0 47GB) + Package L#0 + L3 L#0 (25MB) + L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0) NUMANode L#1 (P#1 47GB) HostBridge L#0 PCIBridge PCI 15b3:1003 $ srun bash -c "lstopo | head -n 6 ; sleep 10" & NUMANode L#0 (P#0 47GB) + Package L#0 + L3 L#0 (25MB) + L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#20) NUMANode L#1 (P#1 47GB) HostBridge L#0 PCIBridge PCI 15b3:1003 I expected to see the second job get logical core #1 (on the second physical core) but instead it gets logical core #20 (the second thread on the first physical core). I can't imagine that this is ever the desired behavior, but I guess I could be missing some use case. I have spent quite a bit of time reading the documentation/mailing list and experimenting with different options, all to no avail. Is it possible to achieve my desired setup with Slurm? I also experimented with setting threads_per_core=2 and then setting OverSubscribe=FORCE:2, but I am pretty unhappy with the results. I think it is confusing that you request one cpu and end up with two (with your --mem-per-cpu doubled), and best I can tell there is no way to only oversubscribe if the user requested 1 core instead of two. Thanks for your time, Brendan
[slurm-dev] Slurm-16.05.9-1 can't start a batch script when allocated nodes are in power save mode (Fix included)
Hi, When the following conditions are met : - submitting a script with sbatch - allocation done on nodes in power save mode - backfill scheduler - no PrologSlurmctld program then the routine 'launch_job' (job_scheduler.c) is never called causing the job to be completed by '_purge_missing_jobs' (job_mgr.c) with the following log message : [2017-02-08T16:00:36.272] Batch JobId=214 missing from node 0 (not found BatchStartTime after startup) [2017-02-08T16:00:36.272] job_complete: JobID=214 State=0x1 NodeCnt=1 WTERMSIG 126 [2017-02-08T16:00:36.272] job_complete: JobID=214 State=0x1 NodeCnt=1 cancelled by node failure Before being cancelled, the job status appears in squeue as : - 'Configuring' during the boot process of nodes being resumed from power save - 'Running' once the nodes are up (but no script will never be started) I have done some work to track down the bug: The routine 'launch_job' is called by several functions in slurmctld : (1) _start_job (backfill.c) if job's CONFIGURING flag is false (2) _schedule (job_scheduler.c) if job's CONFIGURING flag is false (3) prolog_running_decr (job_scheduler.c) in case a PrologSlurmctld program is run (4) job_time_limit (job_mgr.c) if the nodes are coming from REBOOT It seems that functions (1) or (2) may be called during job submission but the job CONFIGURING flag is true because job is started on allocated nodes that are in power save mode => launch_job cannot be called. Then later, periodically, functions (1) and (2) are called but as they are dealing only with PENDING jobs, our RUNNING job is avoided => launch_job cannot be called. The function (3) is called when a PrologSlurmctld program is defined : I don't have one => launch_job cannot be called. Note that when a PrologSlurmctld program is defined, there is no problem. Finally, the issue can be fixed in the 'job_time_limit' function (4) that is periodically called for RUNNING jobs. I am just not sure that this is not breaking the logic for the NODE_REBOOT case but it's working fine : diff --git a/src/slurmctld/job_mgr.c b/src/slurmctld/job_mgr.c index 1d961ab..d6463cc 100644 --- a/src/slurmctld/job_mgr.c +++ b/src/slurmctld/job_mgr.c @@ -7583,9 +7583,10 @@ void job_time_limit(void) if (job_ptr->bit_flags & NODE_REBOOT) { job_ptr->bit_flags &= (~NODE_REBOOT); job_validate_mem(job_ptr); - if (job_ptr->batch_flag) - launch_job(job_ptr); - } +} + if (job_ptr->batch_flag){ + launch_job(job_ptr); +} } #endif /* This needs to be near the top of the loop, checks every What do you think? Best regards, Didier
[slurm-dev] Re: BadConstraints after maintenance (Slurm 15.08.8)
On Wed, 2017-02-08 at 06:53:48 -0800, Steffen Grunewald wrote: > > Hi, > > after an all_nodes reservation for maintenance, a couple of jobs didn't start. > Instead, they complain about BadConstraints. > Since they are clones of jobs that ran perfectly before this is puzzling. > > Looking deeper into the job details, I believe I have found what causes this > - but the deeper reason is still unclear: > > # scontrol show job 12345 >JobState=PENDING Reason=BadConstraints Dependency=(null) >Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 >RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A >NumNodes=2 NumCPUs=32 CPUs/Task=1 ReqB:S:C:T=0:0:*:* >TRES=cpu=32,node=1 >MinCPUsNode=32 MinMemoryNode=0 MinTmpDiskNode=0 > > The user had requested 2 (NumNodes) nodes, with a total of 32 (NumCPUs) cores. > This is OK, since the nodes have 16 cores each. > The TRES prat looks strange though as the total cpu count is still correct, > but > the number of nodes has been set to 1 only. > As a consequence, a matching node must have 32 (MinCPUsNode) cores, which is > impossible to fulfill. > > Attempts to change the values failed as > # scontrol update job=12345 MinCPUsNode=16 > returns without having changed anything, and TRES cannot be modified. > > Is there a way to adjust the values to make the jobs runnable again? In the end, I tried to do the not-so-obvious and "changed" NumNodes and NumCPUs to the values "scontrol show job 12345" reported. The jobs went into JobHeldAdmin state again, but could be "scontrol release"d this time. Phew. (Apparently "changing" these values made Slurm recalculate the requirements, which now fit.) The question remains: > What may have caused Slurm (which had not been stopped during the reservation) > to mangle these values? And why is there an obvious discrepancy between NumNodes and NumCPUs on one hand and TRES nodes and MinCPUsNode on the other? Cheers, S
[slurm-dev] BadConstraints after maintenance (Slurm 15.08.8)
Hi, after an all_nodes reservation for maintenance, a couple of jobs didn't start. Instead, they complain about BadConstraints. Since they are clones of jobs that ran perfectly before this is puzzling. Looking deeper into the job details, I believe I have found what causes this - but the deeper reason is still unclear: # scontrol show job 12345 JobState=PENDING Reason=BadConstraints Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A NumNodes=2 NumCPUs=32 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=32,node=1 MinCPUsNode=32 MinMemoryNode=0 MinTmpDiskNode=0 The user had requested 2 (NumNodes) nodes, with a total of 32 (NumCPUs) cores. This is OK, since the nodes have 16 cores each. The TRES prat looks strange though as the total cpu count is still correct, but the number of nodes has been set to 1 only. As a consequence, a matching node must have 32 (MinCPUsNode) cores, which is impossible to fulfill. Attempts to change the values failed as # scontrol update job=12345 MinCPUsNode=16 returns without having changed anything, and TRES cannot be modified. Is there a way to adjust the values to make the jobs runnable again? What may have caused Slurm (which had not been stopped during the reservation) to mangle these values? Thanks S -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am M�hlenberg 1 D-14476 Potsdam-Golm Germany ~~~ Fon: +49-331-567 7274 Fax: +49-331-567 7298 Mail: steffen.grunewald(at)aei.mpg.de ~~~
[slurm-dev] Re: Unable to allocate Gres by type
On Mon, Feb 6, 2017 at 1:55 PM, Hans-Nikolai Viessmannwrote: > Hi Michael, > > Yes, on all the compute nodes there is a gres.conf, and all the GPU nodes > except gpu08 have the following defined: > > Name=gpu Count=1 > Name=mic Count=0 > > The head node has this defined: > > Name=gpu Count=0 > Name=mic Count=0 > > Is it possible that Gres Type needs to be specified for all nodes (of a > particular > grestype, e.g. gpu) in order to use type based allocation? > > So should I perhaps update the gres.conf file on the gpu nodes to something > this: > > Name=gpu Type=tesla Count=1 > Name=mic Count=0 > > Would that make a difference? Not sure, this is starting to get beyond my troubleshooting ability. Here's what i have defined: slurm.conf nodename=host001 gres=gpu:k10:8 gres.conf name=gpu file=/dev/nvidia0 type=k10 name=gpu file=/dev/nvidia1 type=k10 name=gpu file=/dev/nvidia2 type=k10 name=gpu file=/dev/nvidia3 type=k10 name=gpu file=/dev/nvidia4 type=k10 name=gpu file=/dev/nvidia5 type=k10 name=gpu file=/dev/nvidia6 type=k10 name=gpu file=/dev/nvidia7 type=k10 if that isn't working you, i would take the "type" definitions out of both the slurm.conf and the gres.conf and see if it then works. there was a bug a couple revs ago with the gres types, which is resolved, but maybe it regressed.