[slurm-dev] Numbering of physical and hyper cores

2017-02-08 Thread Ulf Markwardt
Dear all,

where can I tell Slurm what core numbers belong to the same physical core?

The physical cores on our KNL are 0-63, followed by hyperthreads 64-255.
  cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list
  0,64,128,192

When I ask for 4 cores with "srun --pty -c 4 -p knl bash" I see:
   taurusknl1 /home/mark taskset -pc $$
   pid 285662's current affinity list: 0,64,128,192
but these are not 4 cores but only one core!

It looks like Slurm does not recognize the numbering scheme for the
cores on the node. Where can I specify this?


Thank you,
Ulf

"scontrol show node " says:
   CoreSpecCount=1 CPUSpecList=252-255
this, again, are 4 threads on 4 different cores!

This is my node entry for this guy:
NodeName=taurusknl[1] Sockets=1 CoresPerSocket=64 ThreadsPerCore=4
State=UNKNOWN RealMemory=94000 Weight=64 CoreSpecCount=1


-- 
___
Dr. Ulf Markwardt

Technische Universität Dresden
Center for Information Services and High Performance Computing (ZIH)
01062 Dresden, Germany

Phone: (+49) 351/463-33640  WWW:  http://www.tu-dresden.de/zih



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Accounting and limits

2017-02-08 Thread Skouson, Gary B

Using sacctmgr you can set limits like GrpCPUMins and other GrpTRESMins.  It's 
pretty easy to see what the limit is, but I'm not sure how to see how close to 
the limit someone is.  Is there a normal slurm command that can get you the 
internal GrpTRESMins info slurmctld is using to enforce limits?

Looking at the code, the share request from sshare gets sent all the data, but 
there's no option to print it out.

I made a simple patch below to add a GrpTRESRaw option to sshare to print this 
info.  It has to convert the usage from double to int, which isn't ideal, but 
it was simple and is close enough for what I was looking for.

Is there a better way to get this information?
 
-
Gary Skouson

diff -Naru slurm-16.05.9/src/sshare/process.c 
slurm-16.05.9.change/src/sshare/process.c
--- slurm-16.05.9/src/sshare/process.c  2017-01-31 11:55:41.0 -0800
+++ slurm-16.05.9.change/src/sshare/process.c   2017-02-08 15:45:08.019494347 
-0800
@@ -63,6 +63,7 @@
{10, "User", print_fields_str, PRINT_USER},
{30, "GrpTRESMins", _print_tres, PRINT_TRESMINS},
{30, "TRESRunMins", _print_tres, PRINT_RUNMINS},
+   {30, "GrpTRESRaw", _print_tres, PRINT_GRPTRESRAW},
{0,  NULL, NULL, 0}
 };
 
@@ -226,6 +227,7 @@
char *tmp_char = NULL;
char *local_acct = NULL;
print_field_t *field = NULL;
+   uint64_t tres_raw[tres_cnt];
 
if ((options & PRINT_USERS_ONLY) && share->user == 0)
continue;
@@ -342,6 +344,14 @@
 share->tres_grp_mins,
 (curr_inx == field_count));
break;
+   case PRINT_GRPTRESRAW:
+   /* convert to ints and minutes */
+   for (i=0; iusage_tres_raw[i]/60;
+   field->print_routine(field,
+tres_raw,
+(curr_inx == field_count));
+   break;
case PRINT_RUNMINS:
/* convert to minutes */
for (i=0; i

[slurm-dev] Re: Allocating at logical core level and binding separate physical cores first

2017-02-08 Thread andrealphus
(clumsy fingers)

if I understand your question correctly, but maybe;

srun --cpu_bind=threads

On Wed, Feb 8, 2017 at 4:02 PM, andrealphus  wrote:

> srun --cpu_bind=cores
>
> On Wed, Feb 8, 2017 at 1:08 PM, Brendan Moloney  > wrote:
>
>> Hi,
>>
>> I want to allocate at the level of logical cores (each serial job gets
>> one thread on a hyperthreading system), which seems to be achievable only
>> by not setting threads_per_core on each node, and instead just setting
>> CPUs=.
>>
>> However with core binding, this will pack two tasks onto the same
>> physical core while all other cores are left idle. On a system with 20
>> cores and 40 threads I see this behavior:
>>
>> $ srun bash -c "lstopo | head -n 6 ; sleep 10" &
>>   NUMANode L#0 (P#0 47GB) + Package L#0 + L3 L#0 (25MB) + L2 L#0 (256KB)
>> + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
>>   NUMANode L#1 (P#1 47GB)
>>   HostBridge L#0
>> PCIBridge
>>   PCI 15b3:1003
>> $ srun bash -c "lstopo | head -n 6 ; sleep 10" &
>>   NUMANode L#0 (P#0 47GB) + Package L#0 + L3 L#0 (25MB) + L2 L#0 (256KB)
>> + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#20)
>>   NUMANode L#1 (P#1 47GB)
>>   HostBridge L#0
>> PCIBridge
>>   PCI 15b3:1003
>>
>> I expected to see the second job get logical core #1 (on the second
>> physical core) but instead it gets logical core #20 (the second thread on
>> the first physical core). I can't imagine that this is ever the desired
>> behavior, but I guess I could be missing some use case.
>>
>> I have spent quite a bit of time reading the documentation/mailing list
>> and experimenting with different options, all to no avail. Is it possible
>> to achieve my desired setup with Slurm?
>>
>> I also experimented with setting threads_per_core=2 and then setting
>> OverSubscribe=FORCE:2, but I am pretty unhappy with the results.  I think
>> it is confusing that you request one cpu and end up with two (with your
>> --mem-per-cpu doubled), and best I can tell there is no way to only
>> oversubscribe if the user requested 1 core instead of two.
>>
>> Thanks for your time,
>> Brendan
>>
>>
>>
>>
>


[slurm-dev] Re: Allocating at logical core level and binding separate physical cores first

2017-02-08 Thread andrealphus
srun --cpu_bind=cores

On Wed, Feb 8, 2017 at 1:08 PM, Brendan Moloney 
wrote:

> Hi,
>
> I want to allocate at the level of logical cores (each serial job gets one
> thread on a hyperthreading system), which seems to be achievable only by
> not setting threads_per_core on each node, and instead just setting
> CPUs=.
>
> However with core binding, this will pack two tasks onto the same physical
> core while all other cores are left idle. On a system with 20 cores and 40
> threads I see this behavior:
>
> $ srun bash -c "lstopo | head -n 6 ; sleep 10" &
>   NUMANode L#0 (P#0 47GB) + Package L#0 + L3 L#0 (25MB) + L2 L#0 (256KB) +
> L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
>   NUMANode L#1 (P#1 47GB)
>   HostBridge L#0
> PCIBridge
>   PCI 15b3:1003
> $ srun bash -c "lstopo | head -n 6 ; sleep 10" &
>   NUMANode L#0 (P#0 47GB) + Package L#0 + L3 L#0 (25MB) + L2 L#0 (256KB) +
> L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#20)
>   NUMANode L#1 (P#1 47GB)
>   HostBridge L#0
> PCIBridge
>   PCI 15b3:1003
>
> I expected to see the second job get logical core #1 (on the second
> physical core) but instead it gets logical core #20 (the second thread on
> the first physical core). I can't imagine that this is ever the desired
> behavior, but I guess I could be missing some use case.
>
> I have spent quite a bit of time reading the documentation/mailing list
> and experimenting with different options, all to no avail. Is it possible
> to achieve my desired setup with Slurm?
>
> I also experimented with setting threads_per_core=2 and then setting
> OverSubscribe=FORCE:2, but I am pretty unhappy with the results.  I think
> it is confusing that you request one cpu and end up with two (with your
> --mem-per-cpu doubled), and best I can tell there is no way to only
> oversubscribe if the user requested 1 core instead of two.
>
> Thanks for your time,
> Brendan
>
>
>
>


[slurm-dev] Re: Job priority/cluster utilization help

2017-02-08 Thread Christopher Samuel

On 08/02/17 11:19, Vicker, Darby (JSC-EG311) wrote:

> Sorry for the long post but not sure how to get adequate help without
> providing a lot of detail.  Any recommendations on configuring the
> scheduler to help these jobs run and increase the cluster utilization
> would be appreciated.

My one thought after a quick scan is that both the jobs you mention are
listed as reason "Priority" and there's a higher priority job 1772 in
the list before them.  You might want to look at your backfill settings
to see whether it's looking far enough down the queue to see these.

Perhaps an alternative idea would be to instead of using features use
partitions and then have people submit to all partitions (there is a
plugin for that, though we use a submit filter instead to accomplish the
same).

That way Slurm should consider each job against each partition (set of
architectures) individually.

Best of luck!
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Allocating at logical core level and binding separate physical cores first

2017-02-08 Thread Brendan Moloney
Hi,

I want to allocate at the level of logical cores (each serial job gets one
thread on a hyperthreading system), which seems to be achievable only by
not setting threads_per_core on each node, and instead just setting
CPUs=.

However with core binding, this will pack two tasks onto the same physical
core while all other cores are left idle. On a system with 20 cores and 40
threads I see this behavior:

$ srun bash -c "lstopo | head -n 6 ; sleep 10" &
  NUMANode L#0 (P#0 47GB) + Package L#0 + L3 L#0 (25MB) + L2 L#0 (256KB) +
L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
  NUMANode L#1 (P#1 47GB)
  HostBridge L#0
PCIBridge
  PCI 15b3:1003
$ srun bash -c "lstopo | head -n 6 ; sleep 10" &
  NUMANode L#0 (P#0 47GB) + Package L#0 + L3 L#0 (25MB) + L2 L#0 (256KB) +
L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#20)
  NUMANode L#1 (P#1 47GB)
  HostBridge L#0
PCIBridge
  PCI 15b3:1003

I expected to see the second job get logical core #1 (on the second
physical core) but instead it gets logical core #20 (the second thread on
the first physical core). I can't imagine that this is ever the desired
behavior, but I guess I could be missing some use case.

I have spent quite a bit of time reading the documentation/mailing list and
experimenting with different options, all to no avail. Is it possible to
achieve my desired setup with Slurm?

I also experimented with setting threads_per_core=2 and then setting
OverSubscribe=FORCE:2, but I am pretty unhappy with the results.  I think
it is confusing that you request one cpu and end up with two (with your
--mem-per-cpu doubled), and best I can tell there is no way to only
oversubscribe if the user requested 1 core instead of two.

Thanks for your time,
Brendan


[slurm-dev] Slurm-16.05.9-1 can't start a batch script when allocated nodes are in power save mode (Fix included)

2017-02-08 Thread Didier GAZEN

Hi,

When the following conditions are met :

- submitting a script with sbatch
- allocation done on nodes in power save mode
- backfill scheduler
- no PrologSlurmctld program

then the routine 'launch_job' (job_scheduler.c) is never called causing 
the job

to be completed by '_purge_missing_jobs' (job_mgr.c) with the following log
message :

[2017-02-08T16:00:36.272] Batch JobId=214 missing from node 0 (not found 
BatchStartTime after startup)
[2017-02-08T16:00:36.272] job_complete: JobID=214 State=0x1 NodeCnt=1 
WTERMSIG 126
[2017-02-08T16:00:36.272] job_complete: JobID=214 State=0x1 NodeCnt=1 
cancelled by node failure


Before being cancelled, the job status appears in squeue as :
- 'Configuring' during the boot process of nodes being resumed from 
power save

- 'Running' once the nodes are up (but no script will never be started)

I have done some work to track down the bug:

The routine 'launch_job' is called by several functions in slurmctld :

(1) _start_job  (backfill.c)  if job's CONFIGURING flag is 
false

(2) _schedule   (job_scheduler.c) if job's CONFIGURING flag is false
(3) prolog_running_decr (job_scheduler.c) in case a PrologSlurmctld 
program is run
(4) job_time_limit  (job_mgr.c)   if the nodes are coming from 
REBOOT


It seems that functions (1) or (2) may be called during job submission 
but the

job CONFIGURING flag is true because job is started on allocated nodes that
are in power save mode => launch_job cannot be called. Then later,
periodically, functions (1) and (2) are called but as they are dealing 
only with

PENDING jobs, our RUNNING job is avoided => launch_job cannot be called.

The function (3) is called when a PrologSlurmctld program is defined : I 
don't

have one => launch_job cannot be called. Note that when a PrologSlurmctld
program is defined, there is no problem.

Finally, the issue can be fixed in the 'job_time_limit' function (4) that is
periodically called for RUNNING jobs. I am just not sure that this is not
breaking the logic for the NODE_REBOOT case but it's working fine :

diff --git a/src/slurmctld/job_mgr.c b/src/slurmctld/job_mgr.c
index 1d961ab..d6463cc 100644
--- a/src/slurmctld/job_mgr.c
+++ b/src/slurmctld/job_mgr.c
@@ -7583,9 +7583,10 @@ void job_time_limit(void)
if (job_ptr->bit_flags & NODE_REBOOT) {
job_ptr->bit_flags &= (~NODE_REBOOT);
job_validate_mem(job_ptr);
-   if (job_ptr->batch_flag)
-   launch_job(job_ptr);
-   }
+}
+   if (job_ptr->batch_flag){
+   launch_job(job_ptr);
+}
}
 #endif
/* This needs to be near the top of the loop, checks every

What do you think?

Best regards,

Didier


[slurm-dev] Re: BadConstraints after maintenance (Slurm 15.08.8)

2017-02-08 Thread Steffen Grunewald

On Wed, 2017-02-08 at 06:53:48 -0800, Steffen Grunewald wrote:
> 
> Hi,
> 
> after an all_nodes reservation for maintenance, a couple of jobs didn't start.
> Instead, they complain about BadConstraints.
> Since they are clones of jobs that ran perfectly before this is puzzling.
> 
> Looking deeper into the job details, I believe I have found what causes this
> - but the deeper reason is still unclear:
> 
> # scontrol show job 12345
>JobState=PENDING Reason=BadConstraints Dependency=(null)
>Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
>NumNodes=2 NumCPUs=32 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>TRES=cpu=32,node=1
>MinCPUsNode=32 MinMemoryNode=0 MinTmpDiskNode=0
>  
> The user had requested 2 (NumNodes) nodes, with a total of 32 (NumCPUs) cores.
> This is OK, since the nodes have 16 cores each.
> The TRES prat looks strange though as the total cpu count is still correct, 
> but
> the number of nodes has been set to 1 only.
> As a consequence, a matching node must have 32 (MinCPUsNode) cores, which is
> impossible to fulfill.
> 
> Attempts to change the values failed as 
> # scontrol update job=12345 MinCPUsNode=16
> returns without having changed anything, and TRES cannot be modified.
> 
> Is there a way to adjust the values to make the jobs runnable again?

In the end, I tried to do the not-so-obvious and "changed" NumNodes and NumCPUs
to the values "scontrol show job 12345" reported. The jobs went into 
JobHeldAdmin
state again, but could be "scontrol release"d this time. Phew.
(Apparently "changing" these values made Slurm recalculate the requirements,
which now fit.)

The question remains:

> What may have caused Slurm (which had not been stopped during the reservation)
> to mangle these values?
And why is there an obvious discrepancy between NumNodes and NumCPUs on one hand
and TRES nodes and MinCPUsNode on the other?

Cheers,
 S


[slurm-dev] BadConstraints after maintenance (Slurm 15.08.8)

2017-02-08 Thread Steffen Grunewald

Hi,

after an all_nodes reservation for maintenance, a couple of jobs didn't start.
Instead, they complain about BadConstraints.
Since they are clones of jobs that ran perfectly before this is puzzling.

Looking deeper into the job details, I believe I have found what causes this
- but the deeper reason is still unclear:

# scontrol show job 12345
   JobState=PENDING Reason=BadConstraints Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   NumNodes=2 NumCPUs=32 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=32,node=1
   MinCPUsNode=32 MinMemoryNode=0 MinTmpDiskNode=0
 
The user had requested 2 (NumNodes) nodes, with a total of 32 (NumCPUs) cores.
This is OK, since the nodes have 16 cores each.
The TRES prat looks strange though as the total cpu count is still correct, but
the number of nodes has been set to 1 only.
As a consequence, a matching node must have 32 (MinCPUsNode) cores, which is
impossible to fulfill.

Attempts to change the values failed as 
# scontrol update job=12345 MinCPUsNode=16
returns without having changed anything, and TRES cannot be modified.

Is there a way to adjust the values to make the jobs runnable again?

What may have caused Slurm (which had not been stopped during the reservation)
to mangle these values?

Thanks
 S

-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am M�hlenberg 1
D-14476 Potsdam-Golm
Germany
~~~
Fon: +49-331-567 7274
Fax: +49-331-567 7298
Mail: steffen.grunewald(at)aei.mpg.de
~~~


[slurm-dev] Re: Unable to allocate Gres by type

2017-02-08 Thread Michael Di Domenico

On Mon, Feb 6, 2017 at 1:55 PM, Hans-Nikolai Viessmann  wrote:
> Hi Michael,
>
> Yes, on all the compute nodes there is a gres.conf, and all the GPU nodes
> except gpu08 have the following defined:
>
> Name=gpu Count=1
> Name=mic Count=0
>
> The head node has this defined:
>
> Name=gpu Count=0
> Name=mic Count=0
>
> Is it possible that Gres Type needs to be specified for all nodes (of a
> particular
> grestype, e.g. gpu) in order to use type based allocation?
>
> So should I perhaps update the gres.conf file on the gpu nodes to something
> this:
>
> Name=gpu Type=tesla Count=1
> Name=mic Count=0
>
> Would that make a difference?

Not sure, this is starting to get beyond my troubleshooting ability.
Here's what i have defined:

slurm.conf
nodename=host001 gres=gpu:k10:8

gres.conf
name=gpu file=/dev/nvidia0 type=k10
name=gpu file=/dev/nvidia1 type=k10
name=gpu file=/dev/nvidia2 type=k10
name=gpu file=/dev/nvidia3 type=k10
name=gpu file=/dev/nvidia4 type=k10
name=gpu file=/dev/nvidia5 type=k10
name=gpu file=/dev/nvidia6 type=k10
name=gpu file=/dev/nvidia7 type=k10

if that isn't working you, i would take the "type" definitions out of
both the slurm.conf and the gres.conf and see if it then works.  there
was a bug a couple revs ago with the gres types, which is resolved,
but maybe it regressed.