Hello,Now i have more information about the problem. Linux over POWER7 processors (with SMT disable) have non consecutive cpuids:
cat /proc/cpuinfo | grep processor processor : 0 processor : 4 processor : 8 processor : 12 processor : 16 processor : 20 processor : 24 processor : 28 processor : 32 processor : 36 processor : 40 processor : 44 processor : 48 processor : 52 processor : 56 processor : 60I think that it is because this processors dont needs reboot to enable or disable SMT.
And when i start slurm (with debugLevel=5) on a node, it writes in log: [2012-12-13T08:24:17+00:00] debug: cpuid is 16 (> 16), ignored [2012-12-13T08:24:17+00:00] debug: cpuid is 20 (> 16), ignored [2012-12-13T08:24:17+00:00] debug: cpuid is 24 (> 16), ignored [2012-12-13T08:24:17+00:00] debug: cpuid is 28 (> 16), ignored [2012-12-13T08:24:17+00:00] debug: cpuid is 32 (> 16), ignored [2012-12-13T08:24:17+00:00] debug: cpuid is 36 (> 16), ignored [2012-12-13T08:24:17+00:00] debug: cpuid is 40 (> 16), ignored [2012-12-13T08:24:17+00:00] debug: cpuid is 44 (> 16), ignored [2012-12-13T08:24:17+00:00] debug: cpuid is 48 (> 16), ignored [2012-12-13T08:24:17+00:00] debug: cpuid is 52 (> 16), ignored [2012-12-13T08:24:17+00:00] debug: cpuid is 56 (> 16), ignored [2012-12-13T08:24:17+00:00] debug: cpuid is 60 (> 16), ignoredThe problem is that the mask generated by slurm have processors that are not presents in the system, and when the call to sched_setaffinity() function fails because it is invoked with the 0...01111 mask when it has to be invoked with the 0...0001000100010001 mask (in binary).
I think it is wrong because slurm source assume that cpuids are consecutive. El 05/12/2012 14:10, Andrés Marín Díaz escribió:
Hello,I am testing slurm 2.5-rc2 in a cluster with PS702 nodes (2 processors Power7 with 8 cores each one and SMT disable) and I have a problem with the task/affinity plugin when it set masks.When a job request 4 or less tasks, masks are setting up rightly, however when it request 5 processors or more, slurm returns an error setting masks: cpu_bind=MASK - r09c3b3, task 5 5 [61232]: mask 0x1111111111111111 set FAILEDWhat could be causing this problem? Can be a configuration error? a slurm bug? CPU IDs are not consecutive, because these nodes allow enable and disable multithreading without reboot the node.I attached my configuration and logs. Thanks a lot! *********** SLURM.CONF: *********** TaskPlugin=task/affinity TaskPluginParam=Cores,Verbose *********** JOB.OUTPUT: *********** cpu_bind=MASK - r09c3b3, task 0 0 [61214]: mask 0xffff set cpu_bind=MASK - r09c3b3, task 2 2 [61229]: mask 0x100 set cpu_bind=MASK - r09c3b3, task 1 1 [61228]: mask 0x10 set cpu_bind=MASK - r09c3b3, task 0 0 [61227]: mask 0x1 set cpu_bind=MASK - r09c3b3, task 3 3 [61230]: mask 0x1000 setcpu_bind=MASK - r09c3b3, task 4 4 [61231]: mask 0x1111111111111111 set FAILEDslurmd[r09c3b3]: Failed task affinity setupcpu_bind=MASK - r09c3b3, task 5 5 [61232]: mask 0x1111111111111111 set FAILEDslurmd[r09c3b3]: Failed task affinity setupcpu_bind=MASK - r09c3b3, task 6 6 [61233]: mask 0x1111111111111111 set FAILEDslurmd[r09c3b3]: Failed task affinity setupcpu_bind=MASK - r09c3b3, task 7 7 [61234]: mask 0x1111111111111111 set FAILEDslurmd[r09c3b3]: Failed task affinity setupcpu_bind=MASK - r09c3b3, task 9 9 [61236]: mask 0x1111111111111111 set FAILEDslurmd[r09c3b3]: Failed task affinity setupcpu_bind=MASK - r09c3b3, task 8 8 [61235]: mask 0x1111111111111111 set FAILEDslurmd[r09c3b3]: Failed task affinity setupcpu_bind=MASK - r09c3b3, task 11 11 [61238]: mask 0x1111111111111111 set FAILEDslurmd[r09c3b3]: Failed task affinity setupcpu_bind=MASK - r09c3b3, task 13 13 [61240]: mask 0x1111111111111111 set FAILEDslurmd[r09c3b3]: Failed task affinity setupcpu_bind=MASK - r09c3b3, task 10 10 [61237]: mask 0x1111111111111111 set FAILEDslurmd[r09c3b3]: Failed task affinity setupcpu_bind=MASK - r09c3b3, task 15 15 [61242]: mask 0x1111111111111111 set FAILED cpu_bind=MASK - r09c3b3, task 14 14 [61241]: mask 0x1111111111111111 set FAILEDslurmd[r09c3b3]: Failed task affinity setup slurmd[r09c3b3]: Failed task affinity setupcpu_bind=MASK - r09c3b3, task 12 12 [61239]: mask 0x1111111111111111 set FAILEDslurmd[r09c3b3]: Failed task affinity setup srun: error: r09c3b3: tasks 4-15: Exited with exit code 1 ********** SLURM.LOG: ********** [2012-12-05T12:07:35+00:00] task_slurmd_batch_request: 430973[2012-12-05T12:07:35+00:00] task/affinity: job 430973 CPU input mask for node: 0xFFFF [2012-12-05T12:07:35+00:00] task/affinity: job 430973 CPU final HW mask for node: 0xFFFF[2012-12-05T12:07:35+00:00] Launching batch job 430973 for UID 50158 [2012-12-05T12:07:35+00:00] [430973] Using sched_affinity for tasks[2012-12-05T12:07:35+00:00] task affinity : enforcing 'verbose,cores' cpu bind method [2012-12-05T12:07:35+00:00] lllp_distribution jobid [430973] binding: verbose,cores, dist 2[2012-12-05T12:07:35+00:00] _task_layout_lllp_cyclic[2012-12-05T12:07:35+00:00] _lllp_generate_cpu_bind jobid [430973]: verbose,mask_cpu, 0x0001,0x0010,0x0100,0x1000,0x0002,0x0004,0x0008,0x0020,0x0040,0x0080,0x0200,0x0400,0x0800,0x2000,0x4000,0x8000 [2012-12-05T12:07:35+00:00] launch task 430973.0 request from [email protected] (port 39175)[2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks [2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks [2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks [2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks [2012-12-05T12:07:35+00:00] [430973.0] Failed task affinity setup [2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks [2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks [2012-12-05T12:07:35+00:00] [430973.0] Failed task affinity setup [2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks [2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks [2012-12-05T12:07:35+00:00] [430973.0] Failed task affinity setup [2012-12-05T12:07:35+00:00] [430973.0] Failed task affinity setup [2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks [2012-12-05T12:07:35+00:00] [430973.0] Failed task affinity setup [2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks [2012-12-05T12:07:35+00:00] [430973.0] Failed task affinity setup [2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks [2012-12-05T12:07:35+00:00] [430973.0] Failed task affinity setup [2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks [2012-12-05T12:07:35+00:00] [430973.0] Failed task affinity setup [2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks [2012-12-05T12:07:35+00:00] [430973.0] Failed task affinity setup [2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks [2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks [2012-12-05T12:07:35+00:00] [430973.0] Using sched_affinity for tasks [2012-12-05T12:07:35+00:00] [430973.0] Failed task affinity setup [2012-12-05T12:07:35+00:00] [430973.0] Failed task affinity setup [2012-12-05T12:07:35+00:00] [430973.0] Failed task affinity setup [2012-12-05T12:07:36+00:00] [430973.0] done with job[2012-12-05T12:07:36+00:00] [430973] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0[2012-12-05T12:07:36+00:00] [430973] done with job
-- --------------------------------------------------------- Andrés Marín Díaz | e-mail: [email protected] Centro de Supercomputación y Visualización de Madrid www.cesvima.upm.es www.twitter.com/cesvima | www.fb.com/cesvima ---------------------------------------------------------
smime.p7s
Description: Firma criptográfica S/MIME
