Hello,

Now i have more information about the problem. Linux over POWER7 processors (with SMT disable) have non consecutive cpuids:

cat /proc/cpuinfo  | grep processor
processor    : 0
processor    : 4
processor    : 8
processor    : 12
processor    : 16
processor    : 20
processor    : 24
processor    : 28
processor    : 32
processor    : 36
processor    : 40
processor    : 44
processor    : 48
processor    : 52
processor    : 56
processor    : 60

I think that it is because this processors dont needs reboot to enable or disable SMT.

And when i start slurm (with debugLevel=5) on a node, it writes in log:
[2012-12-13T08:24:17+00:00] debug:  cpuid is 16 (> 16), ignored
[2012-12-13T08:24:17+00:00] debug:  cpuid is 20 (> 16), ignored
[2012-12-13T08:24:17+00:00] debug:  cpuid is 24 (> 16), ignored
[2012-12-13T08:24:17+00:00] debug:  cpuid is 28 (> 16), ignored
[2012-12-13T08:24:17+00:00] debug:  cpuid is 32 (> 16), ignored
[2012-12-13T08:24:17+00:00] debug:  cpuid is 36 (> 16), ignored
[2012-12-13T08:24:17+00:00] debug:  cpuid is 40 (> 16), ignored
[2012-12-13T08:24:17+00:00] debug:  cpuid is 44 (> 16), ignored
[2012-12-13T08:24:17+00:00] debug:  cpuid is 48 (> 16), ignored
[2012-12-13T08:24:17+00:00] debug:  cpuid is 52 (> 16), ignored
[2012-12-13T08:24:17+00:00] debug:  cpuid is 56 (> 16), ignored
[2012-12-13T08:24:17+00:00] debug:  cpuid is 60 (> 16), ignored

The problem is that the mask generated by slurm have processors that are not presents in the system, and when the call to sched_setaffinity() function fails because it is invoked with the 0...01111 mask when it has to be invoked with the 0...0001000100010001 mask (in binary).
I think it is wrong because slurm source assume that cpuids are consecutive.



El 05/12/2012 14:10, Andrés Marín Díaz escribió:

Hello,

I am testing slurm 2.5-rc2 in a cluster with PS702 nodes (2 processors Power7 with 8 cores each one and SMT disable) and I have a problem with the task/affinity plugin when it set masks.

When a job request 4 or less tasks, masks are setting up rightly, however when it request 5 processors or more, slurm returns an error setting masks: cpu_bind=MASK - r09c3b3, task 5 5 [61232]: mask 0x1111111111111111 set FAILED

What could be causing this problem? Can be a configuration error? a slurm bug? CPU IDs are not consecutive, because these nodes allow enable and disable multithreading without reboot the node.

I attached my configuration and logs.

Thanks a lot!

***********
SLURM.CONF:
***********
TaskPlugin=task/affinity
TaskPluginParam=Cores,Verbose

***********
JOB.OUTPUT:
***********
cpu_bind=MASK - r09c3b3, task  0  0 [61214]: mask 0xffff set
cpu_bind=MASK - r09c3b3, task  2  2 [61229]: mask 0x100 set
cpu_bind=MASK - r09c3b3, task  1  1 [61228]: mask 0x10 set
cpu_bind=MASK - r09c3b3, task  0  0 [61227]: mask 0x1 set
cpu_bind=MASK - r09c3b3, task  3  3 [61230]: mask 0x1000 set
cpu_bind=MASK - r09c3b3, task 4 4 [61231]: mask 0x1111111111111111 set FAILED
slurmd[r09c3b3]: Failed task affinity setup
cpu_bind=MASK - r09c3b3, task 5 5 [61232]: mask 0x1111111111111111 set FAILED
slurmd[r09c3b3]: Failed task affinity setup
cpu_bind=MASK - r09c3b3, task 6 6 [61233]: mask 0x1111111111111111 set FAILED
slurmd[r09c3b3]: Failed task affinity setup
cpu_bind=MASK - r09c3b3, task 7 7 [61234]: mask 0x1111111111111111 set FAILED
slurmd[r09c3b3]: Failed task affinity setup
cpu_bind=MASK - r09c3b3, task 9 9 [61236]: mask 0x1111111111111111 set FAILED
slurmd[r09c3b3]: Failed task affinity setup
cpu_bind=MASK - r09c3b3, task 8 8 [61235]: mask 0x1111111111111111 set FAILED
slurmd[r09c3b3]: Failed task affinity setup
cpu_bind=MASK - r09c3b3, task 11 11 [61238]: mask 0x1111111111111111 set FAILED
slurmd[r09c3b3]: Failed task affinity setup
cpu_bind=MASK - r09c3b3, task 13 13 [61240]: mask 0x1111111111111111 set FAILED
slurmd[r09c3b3]: Failed task affinity setup
cpu_bind=MASK - r09c3b3, task 10 10 [61237]: mask 0x1111111111111111 set FAILED
slurmd[r09c3b3]: Failed task affinity setup
cpu_bind=MASK - r09c3b3, task 15 15 [61242]: mask 0x1111111111111111 set FAILED cpu_bind=MASK - r09c3b3, task 14 14 [61241]: mask 0x1111111111111111 set FAILED
slurmd[r09c3b3]: Failed task affinity setup
slurmd[r09c3b3]: Failed task affinity setup
cpu_bind=MASK - r09c3b3, task 12 12 [61239]: mask 0x1111111111111111 set FAILED
slurmd[r09c3b3]: Failed task affinity setup
srun: error: r09c3b3: tasks 4-15: Exited with exit code 1


**********
SLURM.LOG:
**********
[2012-12-05T12:07:35+00:00] task_slurmd_batch_request: 430973
[2012-12-05T12:07:35+00:00] task/affinity: job 430973 CPU input mask for node: 0xFFFF [2012-12-05T12:07:35+00:00] task/affinity: job 430973 CPU final HW mask for node: 0xFFFF
[2012-12-05T12:07:35+00:00] Launching batch job 430973 for UID 50158
[2012-12-05T12:07:35+00:00] [430973] Using sched_affinity for tasks
[2012-12-05T12:07:35+00:00] task affinity : enforcing 'verbose,cores' cpu bind method [2012-12-05T12:07:35+00:00] lllp_distribution jobid [430973] binding: verbose,cores, dist 2
[2012-12-05T12:07:35+00:00] _task_layout_lllp_cyclic
[2012-12-05T12:07:35+00:00] _lllp_generate_cpu_bind jobid [430973]: verbose,mask_cpu, 0x0001,0x0010,0x0100,0x1000,0x0002,0x0004,0x0008,0x0020,0x0040,0x0080,0x0200,0x0400,0x0800,0x2000,0x4000,0x8000 [2012-12-05T12:07:35+00:00] launch task 430973.0 request from [email protected] (port 39175)
[2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks
[2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks
[2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks
[2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks
[2012-12-05T12:07:35+00:00] [430973.0] Failed task affinity setup
[2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks
[2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks
[2012-12-05T12:07:35+00:00] [430973.0] Failed task affinity setup
[2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks
[2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks
[2012-12-05T12:07:35+00:00] [430973.0] Failed task affinity setup
[2012-12-05T12:07:35+00:00] [430973.0] Failed task affinity setup
[2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks
[2012-12-05T12:07:35+00:00] [430973.0] Failed task affinity setup
[2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks
[2012-12-05T12:07:35+00:00] [430973.0] Failed task affinity setup
[2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks
[2012-12-05T12:07:35+00:00] [430973.0] Failed task affinity setup
[2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks
[2012-12-05T12:07:35+00:00] [430973.0] Failed task affinity setup
[2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks
[2012-12-05T12:07:35+00:00] [430973.0] Failed task affinity setup
[2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks
[2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks
[2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks
[2012-12-05T12:07:35+00:00] [430973.0] Failed task affinity setup
[2012-12-05T12:07:35+00:00] [430973.0] Failed task affinity setup
[2012-12-05T12:07:35+00:00] [430973.0] Failed task affinity setup
[2012-12-05T12:07:36+00:00] [430973.0] done with job
[2012-12-05T12:07:36+00:00] [430973] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0
[2012-12-05T12:07:36+00:00] [430973] done with job






--
---------------------------------------------------------
Andrés Marín Díaz | e-mail: [email protected]
Centro de Supercomputación y Visualización de Madrid
www.cesvima.upm.es
www.twitter.com/cesvima | www.fb.com/cesvima
---------------------------------------------------------


Attachment: smime.p7s
Description: Firma criptográfica S/MIME

Reply via email to