If you get a chance, you might test this patch: https://github.com/open-mpi/ompi-release/pull/656
I think it will resolve the problem you mentioned, and is small enough to go into 1.10.1 Ralph > On Oct 8, 2015, at 12:36 PM, marcin.krotkiewski > <marcin.krotkiew...@gmail.com> wrote: > > Sorry, I think I confused one thing: > > On 10/08/2015 09:15 PM, marcin.krotkiewski wrote: >> >> For version 1.10.1rc1 and up the situation is a bit different: it seems that >> in many cases all cores are present in the cpuset, just that the binding >> does not take place in a lot of cases. Instead, processes are bound to all >> cores allocated by SLURM. In other scenarios, as discussed before, some >> cores are over/under-subscribed. Again, this is done quietly. > > The problem here was in fact failure to run with an error message, not > under/over-subscription. Sorry for this - wanted to cover too much at the > same time.. > > Marcin > > > > >> >> In all cases what is needed is the --hetero-nodes switch. If I apply the >> patch that Gilles has posted, it seems to be enough for 1.10.1rc1 and up. >> The switch is not enough for earlier versions of OpenMPI and one needs >> --map-by core in addition. >> >> Given all that I think some sort of fix would be in order soon. I agree with >> Ralph that to address this issue quickly a simplified fix would be a good >> choice. As Ralph has already pointed out (or at least how I understood it :) >> this would essentially involve activating --hetero-nodes by default, and >> using --map-by core in cases where the architecture is not homogeneous. >> Uncovering the warning so that the failure to bind is not silent is the last >> piece of puzzle. Maybe adding a sanity check to make sure all allocated >> resources are in use would be helpful - if not by default, then maybe with >> some flag. >> >> Does all this make sense? >> >> Again, thank you all for your help, >> >> Marcin >> >> >> >> >> >> On 10/07/2015 04:03 PM, Ralph Castain wrote: >>> I’m a little nervous about this one, Gilles. It’s doing a lot more than >>> just addressing the immediate issue, and I’m concerned about any potential >>> side-effects that we don’t fully unocver prior to release. >>> >>> I’d suggest a two-pronged approach: >>> >>> 1. use my alternative method for 1.10.1 to solve the immediate issue. It >>> only affects this one, rather unusual, corner-case that was reported here. >>> So the impact can be easily contained and won’t impact anything else. >>> >>> 2. push your proposed solution to the master where it can soak for awhile >>> and give us a chance to fully discover the secondary effects. Removing the >>> unused and “not-allowed” cpus from the topology means a substantial scrub >>> of the code base in a number of places, and your patch doesn’t really get >>> them all. It’s going to take time to ensure everything is working correctly >>> again. >>> >>> HTH >>> Ralph >>> >>>> On Oct 7, 2015, at 4:29 AM, Gilles Gouaillardet >>>> <gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>> >>>> wrote: >>>> >>>> Jeff, >>>> >>>> there are quite a lot of changes, I did not update master yet (need extra >>>> pairs of eyes to review this...) >>>> so unless you want to make rc2 today and rc3 a week later, it is imho way >>>> safer to wait for v1.10.2 >>>> >>>> Ralph, >>>> any thoughts ? >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> On Wednesday, October 7, 2015, Jeff Squyres (jsquyres) <jsquy...@cisco.com >>>> <mailto:jsquy...@cisco.com>> wrote: >>>> Is this something that needs to go into v1.10.1? >>>> >>>> If so, a PR needs to be filed ASAP. We were supposed to make the next >>>> 1.10.1 RC yesterday, but slipped to today due to some last second patches. >>>> >>>> >>>> > On Oct 7, 2015, at 4:32 AM, Gilles Gouaillardet <gil...@rist.or.jp >>>> > <javascript:;>> wrote: >>>> > >>>> > Marcin, >>>> > >>>> > here is a patch for the master, hopefully it fixes all the issues we >>>> > discussed >>>> > i will make sure it applies fine vs latest 1.10 tarball from tomorrow >>>> > >>>> > Cheers, >>>> > >>>> > Gilles >>>> > >>>> > >>>> > On 10/6/2015 7:22 PM, marcin.krotkiewski wrote: >>>> >> Gilles, >>>> >> >>>> >> Yes, it seemed that all was fine with binding in the patched 1.10.1rc1 >>>> >> - thank you. Eagerly waiting for the other patches, let me know and I >>>> >> will test them later this week. >>>> >> >>>> >> Marcin >>>> >> >>>> >> >>>> >> >>>> >> On 10/06/2015 12:09 PM, Gilles Gouaillardet wrote: >>>> >>> Marcin, >>>> >>> >>>> >>> my understanding is that in this case, patched v1.10.1rc1 is working >>>> >>> just fine. >>>> >>> am I right ? >>>> >>> >>>> >>> I prepared two patches >>>> >>> one to remove the warning when binding on one core if only one core is >>>> >>> available, >>>> >>> an other one to add a warning if the user asks a binding policy that >>>> >>> makes no sense with the required mapping policy >>>> >>> >>>> >>> I will finalize them tomorrow hopefully >>>> >>> >>>> >>> Cheers, >>>> >>> >>>> >>> Gilles >>>> >>> >>>> >>> On Tuesday, October 6, 2015, marcin.krotkiewski >>>> >>> <marcin.krotkiew...@gmail.com <javascript:;>> wrote: >>>> >>> Hi, Gilles >>>> >>>> you mentionned you had one failure with 1.10.1rc1 and -bind-to core >>>> >>>> could you please send the full details (script, allocation and output) >>>> >>>> in your slurm script, you can do >>>> >>>> srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none -l grep >>>> >>>> Cpus_allowed_list /proc/self/status >>>> >>>> before invoking mpirun >>>> >>>> >>>> >>> It was an interactive job allocated with >>>> >>> >>>> >>> salloc --account=staff --ntasks=32 --mem-per-cpu=2G --time=120:0:0 >>>> >>> >>>> >>> The slurm environment is the following >>>> >>> >>>> >>> SLURM_JOBID=12714491 >>>> >>> SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5' >>>> >>> SLURM_JOB_ID=12714491 >>>> >>> SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]' >>>> >>> SLURM_JOB_NUM_NODES=7 >>>> >>> SLURM_JOB_PARTITION=normal >>>> >>> SLURM_MEM_PER_CPU=2048 >>>> >>> SLURM_NNODES=7 >>>> >>> SLURM_NODELIST='c1-[2,4,8,13,16,23,26]' >>>> >>> SLURM_NODE_ALIASES='(null)' >>>> >>> SLURM_NPROCS=32 >>>> >>> SLURM_NTASKS=32 >>>> >>> SLURM_SUBMIT_DIR=/cluster/home/marcink >>>> >>> SLURM_SUBMIT_HOST=login-0-1.local >>>> >>> SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5' >>>> >>> >>>> >>> The output of the command you asked for is >>>> >>> >>>> >>> 0: c1-2.local Cpus_allowed_list: 1-4,17-20 >>>> >>> 1: c1-4.local Cpus_allowed_list: 1,15,17,31 >>>> >>> 2: c1-8.local Cpus_allowed_list: 0,5,9,13-14,16,21,25,29-30 >>>> >>> 3: c1-13.local Cpus_allowed_list: 3-7,19-23 >>>> >>> 4: c1-16.local Cpus_allowed_list: 12-15,28-31 >>>> >>> 5: c1-23.local Cpus_allowed_list: 2-4,8,13-15,18-20,24,29-31 >>>> >>> 6: c1-26.local Cpus_allowed_list: 1,6,11,13,15,17,22,27,29,31 >>>> >>> >>>> >>> Running with command >>>> >>> >>>> >>> mpirun --mca rmaps_base_verbose 10 --hetero-nodes --bind-to core >>>> >>> --report-bindings --map-by socket -np 32 ./affinity >>>> >>> >>>> >>> I have attached two output files: one for the original 1.10.1rc1, one >>>> >>> for the patched version. >>>> >>> >>>> >>> When I said 'failed in one case' I was not precise. I got an error on >>>> >>> node c1-8, which was the first one to have different number of MPI >>>> >>> processes on the two sockets. It would also fail on some later nodes, >>>> >>> just that because of the error we never got there. >>>> >>> >>>> >>> Let me know if you need more. >>>> >>> >>>> >>> Marcin >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>>> Cheers, >>>> >>>> >>>> >>>> Gilles >>>> >>>> >>>> >>>> On 10/4/2015 11:55 PM, marcin.krotkiewski wrote: >>>> >>>>> Hi, all, >>>> >>>>> >>>> >>>>> I played a bit more and it seems that the problem results from >>>> >>>>> >>>> >>>>> trg_obj = opal_hwloc_base_find_min_bound_target_under_obj() >>>> >>>>> >>>> >>>>> called in rmaps_base_binding.c / bind_downwards being wrong. I do >>>> >>>>> not know the reason, but I think I know when the problem happens (at >>>> >>>>> least on 1.10.1rc1). It seems that by default openmpi maps by >>>> >>>>> socket. The error happens when for a given compute node there is a >>>> >>>>> different number of cores used on each socket. Consider previously >>>> >>>>> studied case (the debug outputs I sent in last post). c1-8, which >>>> >>>>> was source of error, has 5 mpi processes assigned, and the cpuset is >>>> >>>>> the following: >>>> >>>>> >>>> >>>>> 0, 5, 9, 13, 14, 16, 21, 25, 29, 30 >>>> >>>>> >>>> >>>>> Cores 0,5 are on socket 0, cores 9, 13, 14 are on socket 1. Binding >>>> >>>>> progresses correctly up to and including core 13 (see end of file >>>> >>>>> out.1.10.1rc2, before the error). That is 2 cores on socket 0, and 2 >>>> >>>>> cores on socket 1. Error is thrown when core 14 should be bound - >>>> >>>>> extra core on socket 1 with no corresponding core on socket 0. At >>>> >>>>> that point the returned trg_obj points to the first core on the node >>>> >>>>> (os_index 0, socket 0). >>>> >>>>> >>>> >>>>> I have submitted a few other jobs and I always had an error in such >>>> >>>>> situation. Moreover, if I now use --map-by core instead of socket, >>>> >>>>> the error is gone, and I get my expected binding: >>>> >>>>> >>>> >>>>> rank 0 @ compute-1-2.local 1, 17, >>>> >>>>> rank 1 @ compute-1-2.local 2, 18, >>>> >>>>> rank 2 @ compute-1-2.local 3, 19, >>>> >>>>> rank 3 @ compute-1-2.local 4, 20, >>>> >>>>> rank 4 @ compute-1-4.local 1, 17, >>>> >>>>> rank 5 @ compute-1-4.local 15, 31, >>>> >>>>> rank 6 @ compute-1-8.local 0, 16, >>>> >>>>> rank 7 @ compute-1-8.local 5, 21, >>>> >>>>> rank 8 @ compute-1-8.local 9, 25, >>>> >>>>> rank 9 @ compute-1-8.local 13, 29, >>>> >>>>> rank 10 @ compute-1-8.local 14, 30, >>>> >>>>> rank 11 @ compute-1-13.local 3, 19, >>>> >>>>> rank 12 @ compute-1-13.local 4, 20, >>>> >>>>> rank 13 @ compute-1-13.local 5, 21, >>>> >>>>> rank 14 @ compute-1-13.local 6, 22, >>>> >>>>> rank 15 @ compute-1-13.local 7, 23, >>>> >>>>> rank 16 @ compute-1-16.local 12, 28, >>>> >>>>> rank 17 @ compute-1-16.local 13, 29, >>>> >>>>> rank 18 @ compute-1-16.local 14, 30, >>>> >>>>> rank 19 @ compute-1-16.local 15, 31, >>>> >>>>> rank 20 @ compute-1-23.local 2, 18, >>>> >>>>> rank 29 @ compute-1-26.local 11, 27, >>>> >>>>> rank 21 @ compute-1-23.local 3, 19, >>>> >>>>> rank 30 @ compute-1-26.local 13, 29, >>>> >>>>> rank 22 @ compute-1-23.local 4, 20, >>>> >>>>> rank 31 @ compute-1-26.local 15, 31, >>>> >>>>> rank 23 @ compute-1-23.local 8, 24, >>>> >>>>> rank 27 @ compute-1-26.local 1, 17, >>>> >>>>> rank 24 @ compute-1-23.local 13, 29, >>>> >>>>> rank 28 @ compute-1-26.local 6, 22, >>>> >>>>> rank 25 @ compute-1-23.local 14, 30, >>>> >>>>> rank 26 @ compute-1-23.local 15, 31, >>>> >>>>> >>>> >>>>> Using --map-by core seems to fix the issue on 1.8.8, 1.10.0 and >>>> >>>>> 1.10.1rc1. However, there is still a difference in behavior between >>>> >>>>> 1.10.1rc1 and earlier versions. In the SLURM job described in last >>>> >>>>> post, 1.10.1rc1 fails to bind only in 1 case, while the earlier >>>> >>>>> versions fail in 21 out of 32 cases. You mentioned there was a bug >>>> >>>>> in hwloc. Not sure if it can explain the difference in behavior. >>>> >>>>> >>>> >>>>> Hope this helps to nail this down. >>>> >>>>> >>>> >>>>> Marcin >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> On 10/04/2015 09:55 AM, Gilles Gouaillardet wrote: >>>> >>>>>> Ralph, >>>> >>>>>> >>>> >>>>>> I suspect ompi tries to bind to threads outside the cpuset. >>>> >>>>>> this could be pretty similar to a previous issue when ompi tried to >>>> >>>>>> bind to cores outside the cpuset. >>>> >>>>>> /* when a core has more than one thread, would ompi assume all the >>>> >>>>>> threads are available if the core is available ? */ >>>> >>>>>> I will investigate this from tomorrow >>>> >>>>>> >>>> >>>>>> Cheers, >>>> >>>>>> >>>> >>>>>> Gilles >>>> >>>>>> >>>> >>>>>> On Sunday, October 4, 2015, Ralph Castain <r...@open-mpi.org >>>> >>>>>> <javascript:;>> wrote: >>>> >>>>>> Thanks - please go ahead and release that allocation as I’m not >>>> >>>>>> going to get to this immediately. I’ve got several hot irons in the >>>> >>>>>> fire right now, and I’m not sure when I’ll get a chance to track >>>> >>>>>> this down. >>>> >>>>>> >>>> >>>>>> Gilles or anyone else who might have time - feel free to take a >>>> >>>>>> gander and see if something pops out at you. >>>> >>>>>> >>>> >>>>>> Ralph >>>> >>>>>> >>>> >>>>>> >>>> >>>>>>> On Oct 3, 2015, at 11:05 AM, marcin.krotkiewski < >>>> >>>>>>> <javascript:;>marcin.krotkiew...@gmail.com >>>> >>>>>>> <mailto:marcin.krotkiew...@gmail.com>> wrote: >>>> >>>>>>> >>>> >>>>>>> >>>> >>>>>>> Done. I have compiled 1.10.0 and 1.10.rc1 with --enable-debug and >>>> >>>>>>> executed >>>> >>>>>>> >>>> >>>>>>> mpirun --mca rmaps_base_verbose 10 --hetero-nodes >>>> >>>>>>> --report-bindings --bind-to core -np 32 ./affinity >>>> >>>>>>> >>>> >>>>>>> In case of 1.10.rc1 I have also added :overload-allowed - output >>>> >>>>>>> in a separate file. This option did not make much difference for >>>> >>>>>>> 1.10.0, so I did not attach it here. >>>> >>>>>>> >>>> >>>>>>> First thing I noted for 1.10.0 are lines like >>>> >>>>>>> >>>> >>>>>>> [login-0-1.local:03399] [[37945,0],0] GOT 1 CPUS >>>> >>>>>>> [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] BITMAP >>>> >>>>>>> [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] ON c1-26 >>>> >>>>>>> IS NOT BOUND >>>> >>>>>>> >>>> >>>>>>> with an empty BITMAP. >>>> >>>>>>> >>>> >>>>>>> The SLURM environment is >>>> >>>>>>> >>>> >>>>>>> set | grep SLURM >>>> >>>>>>> SLURM_JOBID=12714491 >>>> >>>>>>> SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5' >>>> >>>>>>> SLURM_JOB_ID=12714491 >>>> >>>>>>> SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]' >>>> >>>>>>> SLURM_JOB_NUM_NODES=7 >>>> >>>>>>> SLURM_JOB_PARTITION=normal >>>> >>>>>>> SLURM_MEM_PER_CPU=2048 >>>> >>>>>>> SLURM_NNODES=7 >>>> >>>>>>> SLURM_NODELIST='c1-[2,4,8,13,16,23,26]' >>>> >>>>>>> SLURM_NODE_ALIASES='(null)' >>>> >>>>>>> SLURM_NPROCS=32 >>>> >>>>>>> SLURM_NTASKS=32 >>>> >>>>>>> SLURM_SUBMIT_DIR=/cluster/home/marcink >>>> >>>>>>> SLURM_SUBMIT_HOST=login-0-1.local >>>> >>>>>>> SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5' >>>> >>>>>>> >>>> >>>>>>> I have submitted an interactive job on screen for 120 hours now to >>>> >>>>>>> work with one example, and not change it for every post :) >>>> >>>>>>> >>>> >>>>>>> If you need anything else, let me know. I could introduce some >>>> >>>>>>> patch/printfs and recompile, if you need it. >>>> >>>>>>> >>>> >>>>>>> Marcin >>>> >>>>>>> >>>> >>>>>>> >>>> >>>>>>> >>>> >>>>>>> On 10/03/2015 07:17 PM, Ralph Castain wrote: >>>> >>>>>>>> Rats - just realized I have no way to test this as none of the >>>> >>>>>>>> machines I can access are setup for cgroup-based multi-tenant. Is >>>> >>>>>>>> this a debug version of OMPI? If not, can you rebuild OMPI with >>>> >>>>>>>> —enable-debug? >>>> >>>>>>>> >>>> >>>>>>>> Then please run it with —mca rmaps_base_verbose 10 and pass along >>>> >>>>>>>> the output. >>>> >>>>>>>> >>>> >>>>>>>> Thanks >>>> >>>>>>>> Ralph >>>> >>>>>>>> >>>> >>>>>>>> >>>> >>>>>>>>> On Oct 3, 2015, at 10:09 AM, Ralph Castain < >>>> >>>>>>>>> <mailto:r...@open-mpi.org>r...@open-mpi.org >>>> >>>>>>>>> <mailto:r...@open-mpi.org>> wrote: >>>> >>>>>>>>> >>>> >>>>>>>>> What version of slurm is this? I might try to debug it here. I’m >>>> >>>>>>>>> not sure where the problem lies just yet. >>>> >>>>>>>>> >>>> >>>>>>>>> >>>> >>>>>>>>>> On Oct 3, 2015, at 8:59 AM, marcin.krotkiewski < >>>> >>>>>>>>>> <mailto:marcin.krotkiew...@gmail.com>marcin.krotkiew...@gmail.com >>>> >>>>>>>>>> <mailto:marcin.krotkiew...@gmail.com>> wrote: >>>> >>>>>>>>>> >>>> >>>>>>>>>> Here is the output of lstopo. In short, (0,16) are core 0, >>>> >>>>>>>>>> (1,17) - core 1 etc. >>>> >>>>>>>>>> >>>> >>>>>>>>>> Machine (64GB) >>>> >>>>>>>>>> NUMANode L#0 (P#0 32GB) >>>> >>>>>>>>>> Socket L#0 + L3 L#0 (20MB) >>>> >>>>>>>>>> L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core >>>> >>>>>>>>>> L#0 >>>> >>>>>>>>>> PU L#0 (P#0) >>>> >>>>>>>>>> PU L#1 (P#16) >>>> >>>>>>>>>> L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core >>>> >>>>>>>>>> L#1 >>>> >>>>>>>>>> PU L#2 (P#1) >>>> >>>>>>>>>> PU L#3 (P#17) >>>> >>>>>>>>>> L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core >>>> >>>>>>>>>> L#2 >>>> >>>>>>>>>> PU L#4 (P#2) >>>> >>>>>>>>>> PU L#5 (P#18) >>>> >>>>>>>>>> L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core >>>> >>>>>>>>>> L#3 >>>> >>>>>>>>>> PU L#6 (P#3) >>>> >>>>>>>>>> PU L#7 (P#19) >>>> >>>>>>>>>> L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core >>>> >>>>>>>>>> L#4 >>>> >>>>>>>>>> PU L#8 (P#4) >>>> >>>>>>>>>> PU L#9 (P#20) >>>> >>>>>>>>>> L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core >>>> >>>>>>>>>> L#5 >>>> >>>>>>>>>> PU L#10 (P#5) >>>> >>>>>>>>>> PU L#11 (P#21) >>>> >>>>>>>>>> L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core >>>> >>>>>>>>>> L#6 >>>> >>>>>>>>>> PU L#12 (P#6) >>>> >>>>>>>>>> PU L#13 (P#22) >>>> >>>>>>>>>> L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core >>>> >>>>>>>>>> L#7 >>>> >>>>>>>>>> PU L#14 (P#7) >>>> >>>>>>>>>> PU L#15 (P#23) >>>> >>>>>>>>>> HostBridge L#0 >>>> >>>>>>>>>> PCIBridge >>>> >>>>>>>>>> PCI 8086:1521 >>>> >>>>>>>>>> Net L#0 "eth0" >>>> >>>>>>>>>> PCI 8086:1521 >>>> >>>>>>>>>> Net L#1 "eth1" >>>> >>>>>>>>>> PCIBridge >>>> >>>>>>>>>> PCI 15b3:1003 >>>> >>>>>>>>>> Net L#2 "ib0" >>>> >>>>>>>>>> OpenFabrics L#3 "mlx4_0" >>>> >>>>>>>>>> PCIBridge >>>> >>>>>>>>>> PCI 102b:0532 >>>> >>>>>>>>>> PCI 8086:1d02 >>>> >>>>>>>>>> Block L#4 "sda" >>>> >>>>>>>>>> NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB) >>>> >>>>>>>>>> L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 >>>> >>>>>>>>>> PU L#16 (P#8) >>>> >>>>>>>>>> PU L#17 (P#24) >>>> >>>>>>>>>> L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 >>>> >>>>>>>>>> PU L#18 (P#9) >>>> >>>>>>>>>> PU L#19 (P#25) >>>> >>>>>>>>>> L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core >>>> >>>>>>>>>> L#10 >>>> >>>>>>>>>> PU L#20 (P#10) >>>> >>>>>>>>>> PU L#21 (P#26) >>>> >>>>>>>>>> L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core >>>> >>>>>>>>>> L#11 >>>> >>>>>>>>>> PU L#22 (P#11) >>>> >>>>>>>>>> PU L#23 (P#27) >>>> >>>>>>>>>> L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core >>>> >>>>>>>>>> L#12 >>>> >>>>>>>>>> PU L#24 (P#12) >>>> >>>>>>>>>> PU L#25 (P#28) >>>> >>>>>>>>>> L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core >>>> >>>>>>>>>> L#13 >>>> >>>>>>>>>> PU L#26 (P#13) >>>> >>>>>>>>>> PU L#27 (P#29) >>>> >>>>>>>>>> L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core >>>> >>>>>>>>>> L#14 >>>> >>>>>>>>>> PU L#28 (P#14) >>>> >>>>>>>>>> PU L#29 (P#30) >>>> >>>>>>>>>> L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core >>>> >>>>>>>>>> L#15 >>>> >>>>>>>>>> PU L#30 (P#15) >>>> >>>>>>>>>> PU L#31 (P#31) >>>> >>>>>>>>>> >>>> >>>>>>>>>> >>>> >>>>>>>>>> >>>> >>>>>>>>>> On 10/03/2015 05:46 PM, Ralph Castain wrote: >>>> >>>>>>>>>>> Maybe I’m just misreading your HT map - that slurm nodelist >>>> >>>>>>>>>>> syntax is a new one to me, but they tend to change things >>>> >>>>>>>>>>> around. Could you run lstopo on one of those compute nodes and >>>> >>>>>>>>>>> send the output? >>>> >>>>>>>>>>> >>>> >>>>>>>>>>> I’m just suspicious because I’m not seeing a clear pairing of >>>> >>>>>>>>>>> HT numbers in your output, but HT numbering is BIOS-specific >>>> >>>>>>>>>>> and I may just not be understanding your particular pattern. >>>> >>>>>>>>>>> Our error message is clearly indicating that we are seeing >>>> >>>>>>>>>>> individual HTs (and not complete cores) assigned, and I don’t >>>> >>>>>>>>>>> know the source of that confusion. >>>> >>>>>>>>>>> >>>> >>>>>>>>>>> >>>> >>>>>>>>>>>> On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski < >>>> >>>>>>>>>>>> <mailto:marcin.krotkiew...@gmail.com>marcin.krotkiew...@gmail.com >>>> >>>>>>>>>>>> <mailto:marcin.krotkiew...@gmail.com>> wrote: >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> On 10/03/2015 04:38 PM, Ralph Castain wrote: >>>> >>>>>>>>>>>>> If mpirun isn’t trying to do any binding, then you will of >>>> >>>>>>>>>>>>> course get the right mapping as we’ll just inherit whatever >>>> >>>>>>>>>>>>> we received. >>>> >>>>>>>>>>>> Yes. I meant that whatever you received (what SLURM gives) is >>>> >>>>>>>>>>>> a correct cpu map and assigns _whole_ CPUs, not a single HT >>>> >>>>>>>>>>>> to MPI processes. In the case mentioned earlier openmpi >>>> >>>>>>>>>>>> should start 6 tasks on c1-30. If HT would be treated as >>>> >>>>>>>>>>>> separate and independent cores, sched_getaffinity of an MPI >>>> >>>>>>>>>>>> process started on c1-30 would return a map with 6 entries >>>> >>>>>>>>>>>> only. In my case it returns a map >>>> >>>>>>>>>>>> with 12 entries - 2 for each >>>> >>>>>>>>>>>> core. So one process is in fact allocated both HTs, not only >>>> >>>>>>>>>>>> one. Is what I'm saying correct? >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>>> Looking at your output, it’s pretty clear that you are >>>> >>>>>>>>>>>>> getting independent HTs assigned and not full cores. >>>> >>>>>>>>>>>> How do you mean? Is the above understanding wrong? I would >>>> >>>>>>>>>>>> expect that on c1-30 with --bind-to core openmpi should bind >>>> >>>>>>>>>>>> to logical cores 0 and 16 (rank 0), 1 and 17 (rank 2) and so >>>> >>>>>>>>>>>> on. All those logical cores are available in >>>> >>>>>>>>>>>> sched_getaffinity map, and there is twice as many logical >>>> >>>>>>>>>>>> cores as there are MPI processes started on the node. >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>>> My guess is that something in slurm has changed such that it >>>> >>>>>>>>>>>>> detects that HT has been enabled, and then begins treating >>>> >>>>>>>>>>>>> the HTs as completely independent cpus. >>>> >>>>>>>>>>>>> >>>> >>>>>>>>>>>>> Try changing “-bind-to core” to “-bind-to hwthread >>>> >>>>>>>>>>>>> -use-hwthread-cpus” and see if that works >>>> >>>>>>>>>>>>> >>>> >>>>>>>>>>>> I have and the binding is wrong. For example, I got this >>>> >>>>>>>>>>>> output >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> rank 0 @ compute-1-30.local 0, >>>> >>>>>>>>>>>> rank 1 @ compute-1-30.local 16, >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> Which means that two ranks have been bound to the same >>>> >>>>>>>>>>>> physical core (logical cores 0 and 16 are two HTs of the same >>>> >>>>>>>>>>>> core). If I use --bind-to core, I get the following correct >>>> >>>>>>>>>>>> binding >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> rank 0 @ compute-1-30.local 0, 16, >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> The problem is many other ranks get bad binding with 'rank >>>> >>>>>>>>>>>> XXX is not bound (or bound to all available processors)' >>>> >>>>>>>>>>>> warning. >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> But I think I was not entirely correct saying that 1.10.1rc1 >>>> >>>>>>>>>>>> did not fix things. It still might have improved something, >>>> >>>>>>>>>>>> but not everything. Consider this job: >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> SLURM_JOB_CPUS_PER_NODE='5,4,6,5(x2),7,5,9,5,7,6' >>>> >>>>>>>>>>>> SLURM_JOB_NODELIST='c8-[31,34],c9-[30-32,35-36],c10-[31-34]' >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> If I run 32 tasks as follows (with 1.10.1rc1) >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> mpirun --hetero-nodes --report-bindings --bind-to core -np 32 >>>> >>>>>>>>>>>> ./affinity >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> I get the following error: >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> -------------------------------------------------------------------------- >>>> >>>>>>>>>>>> A request was made to bind to that would result in binding >>>> >>>>>>>>>>>> more >>>> >>>>>>>>>>>> processes than cpus on a resource: >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> Bind to: CORE >>>> >>>>>>>>>>>> Node: c9-31 >>>> >>>>>>>>>>>> #processes: 2 >>>> >>>>>>>>>>>> #cpus: 1 >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> You can override this protection by adding the >>>> >>>>>>>>>>>> "overload-allowed" >>>> >>>>>>>>>>>> option to your binding directive. >>>> >>>>>>>>>>>> -------------------------------------------------------------------------- >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> If I now use --bind-to core:overload-allowed, then openmpi >>>> >>>>>>>>>>>> starts and _most_ of the threads are bound correctly (i.e., >>>> >>>>>>>>>>>> map contains two logical cores in ALL cases), except this >>>> >>>>>>>>>>>> case that required the overload flag: >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> rank 15 @ compute-9-31.local 1, 17, >>>> >>>>>>>>>>>> rank 16 @ compute-9-31.local 11, 27, >>>> >>>>>>>>>>>> rank 17 @ compute-9-31.local 2, 18, >>>> >>>>>>>>>>>> rank 18 @ compute-9-31.local 12, 28, >>>> >>>>>>>>>>>> rank 19 @ compute-9-31.local 1, 17, >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> Note pair 1,17 is used twice. The original SLURM delivered >>>> >>>>>>>>>>>> map (no binding) on this node is >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> rank 15 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27, >>>> >>>>>>>>>>>> 28, 29, >>>> >>>>>>>>>>>> rank 16 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27, >>>> >>>>>>>>>>>> 28, 29, >>>> >>>>>>>>>>>> rank 17 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27, >>>> >>>>>>>>>>>> 28, 29, >>>> >>>>>>>>>>>> rank 18 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27, >>>> >>>>>>>>>>>> 28, 29, >>>> >>>>>>>>>>>> rank 19 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27, >>>> >>>>>>>>>>>> 28, 29, >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> Why does openmpi use cores (1,17) twice instead of using core >>>> >>>>>>>>>>>> (13,29)? Clearly, the original SLURM-delivered map has 5 CPUs >>>> >>>>>>>>>>>> included, enough for 5 MPI processes. >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> Cheers, >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> Marcin >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> On Oct 3, 2015, at 7:12 AM, marcin.krotkiewski < >>>> >>>>>>>>>>>>>> <mailto:marcin.krotkiew...@gmail.com>marcin.krotkiew...@gmail.com >>>> >>>>>>>>>>>>>> <mailto:marcin.krotkiew...@gmail.com>> wrote: >>>> >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> On 10/03/2015 01:06 PM, Ralph Castain wrote: >>>> >>>>>>>>>>>>>>> Thanks Marcin. Looking at this, I’m guessing that Slurm >>>> >>>>>>>>>>>>>>> may be treating HTs as “cores” - i.e., as independent >>>> >>>>>>>>>>>>>>> cpus. Any chance that is true? >>>> >>>>>>>>>>>>>> Not to the best of my knowledge, and at least not >>>> >>>>>>>>>>>>>> intentionally. SLURM starts as many processes as there are >>>> >>>>>>>>>>>>>> physical cores, not threads. To verify this, consider this >>>> >>>>>>>>>>>>>> test case: >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> _______________________________________________ >>>> >>>>>> users mailing list >>>> >>>>>> >>>> >>>>>> us...@open-mpi.org <javascript:;> >>>> >>>>>> >>>> >>>>>> Subscription: >>>> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>>> >>>>>> >>>> >>>>>> Link to this post: >>>> >>>>>> http://www.open-mpi.org/community/lists/users/2015/10/27790.php >>>> >>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27790.php> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> _______________________________________________ >>>> >>>>> users mailing list >>>> >>>>> >>>> >>>>> us...@open-mpi.org <javascript:;> >>>> >>>>> >>>> >>>>> Subscription: >>>> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>>> >>>>> >>>> >>>>> Link to this post: >>>> >>>>> http://www.open-mpi.org/community/lists/users/2015/10/27791.php >>>> >>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27791.php> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> >>>> users mailing list >>>> >>>> >>>> >>>> us...@open-mpi.org <javascript:;> >>>> >>>> >>>> >>>> Subscription: >>>> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>>> >>>> >>>> >>>> Link to this post: >>>> >>>> http://www.open-mpi.org/community/lists/users/2015/10/27792.php >>>> >>>> <http://www.open-mpi.org/community/lists/users/2015/10/27792.php> >>>> >>> >>>> >>> >>>> >>> >>>> >>> _______________________________________________ >>>> >>> users mailing list >>>> >>> >>>> >>> us...@open-mpi.org <javascript:;> >>>> >>> >>>> >>> Subscription: >>>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>>> >>> >>>> >>> Link to this post: >>>> >>> http://www.open-mpi.org/community/lists/users/2015/10/27814.php >>>> >>> <http://www.open-mpi.org/community/lists/users/2015/10/27814.php> >>>> >> >>>> >> >>>> >> >>>> >> _______________________________________________ >>>> >> users mailing list >>>> >> >>>> >> us...@open-mpi.org <javascript:;> >>>> >> >>>> >> Subscription: >>>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>>> >> >>>> >> Link to this post: >>>> >> http://www.open-mpi.org/community/lists/users/2015/10/27815.php >>>> >> <http://www.open-mpi.org/community/lists/users/2015/10/27815.php> >>>> > >>>> > <heterogeneous_topologies.patch>_______________________________________________ >>>> > users mailing list >>>> > us...@open-mpi.org <javascript:;> >>>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> > <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>>> > Link to this post: >>>> > http://www.open-mpi.org/community/lists/users/2015/10/27827.php >>>> > <http://www.open-mpi.org/community/lists/users/2015/10/27827.php> >>>> >>>> >>>> -- >>>> Jeff Squyres >>>> jsquy...@cisco.com <javascript:;> >>>> For corporate legal information go to: >>>> <http://www.cisco.com/web/about/doing_business/legal/cri/>http://www.cisco.com/web/about/doing_business/legal/cri/ >>>> <http://www.cisco.com/web/about/doing_business/legal/cri/> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org <javascript:;> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/10/27828.php >>>> <http://www.open-mpi.org/community/lists/users/2015/10/27828.php> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/10/27830.php >>>> <http://www.open-mpi.org/community/lists/users/2015/10/27830.php> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/10/27834.php >>> <http://www.open-mpi.org/community/lists/users/2015/10/27834.php> > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/10/27849.php