I agree that makes sense. I’ve been somewhat limited in my ability to work on this lately, and I think Gilles has been in a similar situation. I’ll try to create a 1.10 patch later today. Depending how minimal I can make it, we may still be able to put it into 1.10.1, though the window on that is already closing.
> On Oct 8, 2015, at 12:15 PM, marcin.krotkiewski > <marcin.krotkiew...@gmail.com> wrote: > > Dear Ralph, Gilles, and Jeff > > Thanks a lot for your effort.. Understanding this problem has been a very > interesting exercise for me that let me understand OpenMPI much better (I > think:). > > I have given it all a little more thought, and done some more tests on our > production system, and I think that this is not exactly a corner-case. First > of all, I suspect all of this holds for other job scheduling systems besides > SLURM (to be thought about..). Moreover, on our system a rather common usage > scenario involves SLURM job allocation using, e.g., > > salloc --ntasks=32 > > which results in very fragmented allocations - that's specific for the type > of problems users use this cluster for, but it's a fact. Users then run the > job using > > mpirun ./program > > For versions up to 1.10.0, with uneven resource allocation among compute > nodes the default binding options used in OpenMPI in most cases result in > some CPU cores not being present in the used cpuset at all, others being > over/under-subscribed. This certainly is job-specific and depends on how > fragmented the SLURM allocations are, but to give a scary number: in one case > I started 512 tasks (1 per core), and OpenMPI binding created a cpuset that > used only 271 cores, some of them being over/under-subscribed on top of that. > Effectively, user gets 50% of what he asked for. As already discussed, this > happens quietly - the user has no idea. > > For version 1.10.1rc1 and up the situation is a bit different: it seems that > in many cases all cores are present in the cpuset, just that the binding does > not take place in a lot of cases. Instead, processes are bound to all cores > allocated by SLURM. In other scenarios, as discussed before, some cores are > over/under-subscribed. Again, this is done quietly. > > In all cases what is needed is the --hetero-nodes switch. If I apply the > patch that Gilles has posted, it seems to be enough for 1.10.1rc1 and up. The > switch is not enough for earlier versions of OpenMPI and one needs --map-by > core in addition. > > Given all that I think some sort of fix would be in order soon. I agree with > Ralph that to address this issue quickly a simplified fix would be a good > choice. As Ralph has already pointed out (or at least how I understood it :) > this would essentially involve activating --hetero-nodes by default, and > using --map-by core in cases where the architecture is not homogeneous. > Uncovering the warning so that the failure to bind is not silent is the last > piece of puzzle. Maybe adding a sanity check to make sure all allocated > resources are in use would be helpful - if not by default, then maybe with > some flag. > > Does all this make sense? > > Again, thank you all for your help, > > Marcin > > > > > > On 10/07/2015 04:03 PM, Ralph Castain wrote: >> I’m a little nervous about this one, Gilles. It’s doing a lot more than just >> addressing the immediate issue, and I’m concerned about any potential >> side-effects that we don’t fully unocver prior to release. >> >> I’d suggest a two-pronged approach: >> >> 1. use my alternative method for 1.10.1 to solve the immediate issue. It >> only affects this one, rather unusual, corner-case that was reported here. >> So the impact can be easily contained and won’t impact anything else. >> >> 2. push your proposed solution to the master where it can soak for awhile >> and give us a chance to fully discover the secondary effects. Removing the >> unused and “not-allowed” cpus from the topology means a substantial scrub of >> the code base in a number of places, and your patch doesn’t really get them >> all. It’s going to take time to ensure everything is working correctly again. >> >> HTH >> Ralph >> >>> On Oct 7, 2015, at 4:29 AM, Gilles Gouaillardet >>> <gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>> >>> wrote: >>> >>> Jeff, >>> >>> there are quite a lot of changes, I did not update master yet (need extra >>> pairs of eyes to review this...) >>> so unless you want to make rc2 today and rc3 a week later, it is imho way >>> safer to wait for v1.10.2 >>> >>> Ralph, >>> any thoughts ? >>> >>> Cheers, >>> >>> Gilles >>> >>> On Wednesday, October 7, 2015, Jeff Squyres (jsquyres) <jsquy...@cisco.com >>> <mailto:jsquy...@cisco.com>> wrote: >>> Is this something that needs to go into v1.10.1? >>> >>> If so, a PR needs to be filed ASAP. We were supposed to make the next >>> 1.10.1 RC yesterday, but slipped to today due to some last second patches. >>> >>> >>> > On Oct 7, 2015, at 4:32 AM, Gilles Gouaillardet <gil...@rist.or.jp >>> > <javascript:;>> wrote: >>> > >>> > Marcin, >>> > >>> > here is a patch for the master, hopefully it fixes all the issues we >>> > discussed >>> > i will make sure it applies fine vs latest 1.10 tarball from tomorrow >>> > >>> > Cheers, >>> > >>> > Gilles >>> > >>> > >>> > On 10/6/2015 7:22 PM, marcin.krotkiewski wrote: >>> >> Gilles, >>> >> >>> >> Yes, it seemed that all was fine with binding in the patched 1.10.1rc1 - >>> >> thank you. Eagerly waiting for the other patches, let me know and I will >>> >> test them later this week. >>> >> >>> >> Marcin >>> >> >>> >> >>> >> >>> >> On 10/06/2015 12:09 PM, Gilles Gouaillardet wrote: >>> >>> Marcin, >>> >>> >>> >>> my understanding is that in this case, patched v1.10.1rc1 is working >>> >>> just fine. >>> >>> am I right ? >>> >>> >>> >>> I prepared two patches >>> >>> one to remove the warning when binding on one core if only one core is >>> >>> available, >>> >>> an other one to add a warning if the user asks a binding policy that >>> >>> makes no sense with the required mapping policy >>> >>> >>> >>> I will finalize them tomorrow hopefully >>> >>> >>> >>> Cheers, >>> >>> >>> >>> Gilles >>> >>> >>> >>> On Tuesday, October 6, 2015, marcin.krotkiewski >>> >>> <marcin.krotkiew...@gmail.com <javascript:;>> wrote: >>> >>> Hi, Gilles >>> >>>> you mentionned you had one failure with 1.10.1rc1 and -bind-to core >>> >>>> could you please send the full details (script, allocation and output) >>> >>>> in your slurm script, you can do >>> >>>> srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none -l grep >>> >>>> Cpus_allowed_list /proc/self/status >>> >>>> before invoking mpirun >>> >>>> >>> >>> It was an interactive job allocated with >>> >>> >>> >>> salloc --account=staff --ntasks=32 --mem-per-cpu=2G --time=120:0:0 >>> >>> >>> >>> The slurm environment is the following >>> >>> >>> >>> SLURM_JOBID=12714491 >>> >>> SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5' >>> >>> SLURM_JOB_ID=12714491 >>> >>> SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]' >>> >>> SLURM_JOB_NUM_NODES=7 >>> >>> SLURM_JOB_PARTITION=normal >>> >>> SLURM_MEM_PER_CPU=2048 >>> >>> SLURM_NNODES=7 >>> >>> SLURM_NODELIST='c1-[2,4,8,13,16,23,26]' >>> >>> SLURM_NODE_ALIASES='(null)' >>> >>> SLURM_NPROCS=32 >>> >>> SLURM_NTASKS=32 >>> >>> SLURM_SUBMIT_DIR=/cluster/home/marcink >>> >>> SLURM_SUBMIT_HOST=login-0-1.local >>> >>> SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5' >>> >>> >>> >>> The output of the command you asked for is >>> >>> >>> >>> 0: c1-2.local Cpus_allowed_list: 1-4,17-20 >>> >>> 1: c1-4.local Cpus_allowed_list: 1,15,17,31 >>> >>> 2: c1-8.local Cpus_allowed_list: 0,5,9,13-14,16,21,25,29-30 >>> >>> 3: c1-13.local Cpus_allowed_list: 3-7,19-23 >>> >>> 4: c1-16.local Cpus_allowed_list: 12-15,28-31 >>> >>> 5: c1-23.local Cpus_allowed_list: 2-4,8,13-15,18-20,24,29-31 >>> >>> 6: c1-26.local Cpus_allowed_list: 1,6,11,13,15,17,22,27,29,31 >>> >>> >>> >>> Running with command >>> >>> >>> >>> mpirun --mca rmaps_base_verbose 10 --hetero-nodes --bind-to core >>> >>> --report-bindings --map-by socket -np 32 ./affinity >>> >>> >>> >>> I have attached two output files: one for the original 1.10.1rc1, one >>> >>> for the patched version. >>> >>> >>> >>> When I said 'failed in one case' I was not precise. I got an error on >>> >>> node c1-8, which was the first one to have different number of MPI >>> >>> processes on the two sockets. It would also fail on some later nodes, >>> >>> just that because of the error we never got there. >>> >>> >>> >>> Let me know if you need more. >>> >>> >>> >>> Marcin >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>>> Cheers, >>> >>>> >>> >>>> Gilles >>> >>>> >>> >>>> On 10/4/2015 11:55 PM, marcin.krotkiewski wrote: >>> >>>>> Hi, all, >>> >>>>> >>> >>>>> I played a bit more and it seems that the problem results from >>> >>>>> >>> >>>>> trg_obj = opal_hwloc_base_find_min_bound_target_under_obj() >>> >>>>> >>> >>>>> called in rmaps_base_binding.c / bind_downwards being wrong. I do not >>> >>>>> know the reason, but I think I know when the problem happens (at >>> >>>>> least on 1.10.1rc1). It seems that by default openmpi maps by socket. >>> >>>>> The error happens when for a given compute node there is a different >>> >>>>> number of cores used on each socket. Consider previously studied case >>> >>>>> (the debug outputs I sent in last post). c1-8, which was source of >>> >>>>> error, has 5 mpi processes assigned, and the cpuset is the following: >>> >>>>> >>> >>>>> 0, 5, 9, 13, 14, 16, 21, 25, 29, 30 >>> >>>>> >>> >>>>> Cores 0,5 are on socket 0, cores 9, 13, 14 are on socket 1. Binding >>> >>>>> progresses correctly up to and including core 13 (see end of file >>> >>>>> out.1.10.1rc2, before the error). That is 2 cores on socket 0, and 2 >>> >>>>> cores on socket 1. Error is thrown when core 14 should be bound - >>> >>>>> extra core on socket 1 with no corresponding core on socket 0. At >>> >>>>> that point the returned trg_obj points to the first core on the node >>> >>>>> (os_index 0, socket 0). >>> >>>>> >>> >>>>> I have submitted a few other jobs and I always had an error in such >>> >>>>> situation. Moreover, if I now use --map-by core instead of socket, >>> >>>>> the error is gone, and I get my expected binding: >>> >>>>> >>> >>>>> rank 0 @ compute-1-2.local 1, 17, >>> >>>>> rank 1 @ compute-1-2.local 2, 18, >>> >>>>> rank 2 @ compute-1-2.local 3, 19, >>> >>>>> rank 3 @ compute-1-2.local 4, 20, >>> >>>>> rank 4 @ compute-1-4.local 1, 17, >>> >>>>> rank 5 @ compute-1-4.local 15, 31, >>> >>>>> rank 6 @ compute-1-8.local 0, 16, >>> >>>>> rank 7 @ compute-1-8.local 5, 21, >>> >>>>> rank 8 @ compute-1-8.local 9, 25, >>> >>>>> rank 9 @ compute-1-8.local 13, 29, >>> >>>>> rank 10 @ compute-1-8.local 14, 30, >>> >>>>> rank 11 @ compute-1-13.local 3, 19, >>> >>>>> rank 12 @ compute-1-13.local 4, 20, >>> >>>>> rank 13 @ compute-1-13.local 5, 21, >>> >>>>> rank 14 @ compute-1-13.local 6, 22, >>> >>>>> rank 15 @ compute-1-13.local 7, 23, >>> >>>>> rank 16 @ compute-1-16.local 12, 28, >>> >>>>> rank 17 @ compute-1-16.local 13, 29, >>> >>>>> rank 18 @ compute-1-16.local 14, 30, >>> >>>>> rank 19 @ compute-1-16.local 15, 31, >>> >>>>> rank 20 @ compute-1-23.local 2, 18, >>> >>>>> rank 29 @ compute-1-26.local 11, 27, >>> >>>>> rank 21 @ compute-1-23.local 3, 19, >>> >>>>> rank 30 @ compute-1-26.local 13, 29, >>> >>>>> rank 22 @ compute-1-23.local 4, 20, >>> >>>>> rank 31 @ compute-1-26.local 15, 31, >>> >>>>> rank 23 @ compute-1-23.local 8, 24, >>> >>>>> rank 27 @ compute-1-26.local 1, 17, >>> >>>>> rank 24 @ compute-1-23.local 13, 29, >>> >>>>> rank 28 @ compute-1-26.local 6, 22, >>> >>>>> rank 25 @ compute-1-23.local 14, 30, >>> >>>>> rank 26 @ compute-1-23.local 15, 31, >>> >>>>> >>> >>>>> Using --map-by core seems to fix the issue on 1.8.8, 1.10.0 and >>> >>>>> 1.10.1rc1. However, there is still a difference in behavior between >>> >>>>> 1.10.1rc1 and earlier versions. In the SLURM job described in last >>> >>>>> post, 1.10.1rc1 fails to bind only in 1 case, while the earlier >>> >>>>> versions fail in 21 out of 32 cases. You mentioned there was a bug in >>> >>>>> hwloc. Not sure if it can explain the difference in behavior. >>> >>>>> >>> >>>>> Hope this helps to nail this down. >>> >>>>> >>> >>>>> Marcin >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> On 10/04/2015 09:55 AM, Gilles Gouaillardet wrote: >>> >>>>>> Ralph, >>> >>>>>> >>> >>>>>> I suspect ompi tries to bind to threads outside the cpuset. >>> >>>>>> this could be pretty similar to a previous issue when ompi tried to >>> >>>>>> bind to cores outside the cpuset. >>> >>>>>> /* when a core has more than one thread, would ompi assume all the >>> >>>>>> threads are available if the core is available ? */ >>> >>>>>> I will investigate this from tomorrow >>> >>>>>> >>> >>>>>> Cheers, >>> >>>>>> >>> >>>>>> Gilles >>> >>>>>> >>> >>>>>> On Sunday, October 4, 2015, Ralph Castain <r...@open-mpi.org >>> >>>>>> <javascript:;>> wrote: >>> >>>>>> Thanks - please go ahead and release that allocation as I’m not >>> >>>>>> going to get to this immediately. I’ve got several hot irons in the >>> >>>>>> fire right now, and I’m not sure when I’ll get a chance to track >>> >>>>>> this down. >>> >>>>>> >>> >>>>>> Gilles or anyone else who might have time - feel free to take a >>> >>>>>> gander and see if something pops out at you. >>> >>>>>> >>> >>>>>> Ralph >>> >>>>>> >>> >>>>>> >>> >>>>>>> On Oct 3, 2015, at 11:05 AM, marcin.krotkiewski >>> >>>>>>> <marcin.krotkiew...@gmail.com <javascript:;>> wrote: >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> Done. I have compiled 1.10.0 and 1.10.rc1 with --enable-debug and >>> >>>>>>> executed >>> >>>>>>> >>> >>>>>>> mpirun --mca rmaps_base_verbose 10 --hetero-nodes --report-bindings >>> >>>>>>> --bind-to core -np 32 ./affinity >>> >>>>>>> >>> >>>>>>> In case of 1.10.rc1 I have also added :overload-allowed - output in >>> >>>>>>> a separate file. This option did not make much difference for >>> >>>>>>> 1.10.0, so I did not attach it here. >>> >>>>>>> >>> >>>>>>> First thing I noted for 1.10.0 are lines like >>> >>>>>>> >>> >>>>>>> [login-0-1.local:03399] [[37945,0],0] GOT 1 CPUS >>> >>>>>>> [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] BITMAP >>> >>>>>>> [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] ON c1-26 >>> >>>>>>> IS NOT BOUND >>> >>>>>>> >>> >>>>>>> with an empty BITMAP. >>> >>>>>>> >>> >>>>>>> The SLURM environment is >>> >>>>>>> >>> >>>>>>> set | grep SLURM >>> >>>>>>> SLURM_JOBID=12714491 >>> >>>>>>> SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5' >>> >>>>>>> SLURM_JOB_ID=12714491 >>> >>>>>>> SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]' >>> >>>>>>> SLURM_JOB_NUM_NODES=7 >>> >>>>>>> SLURM_JOB_PARTITION=normal >>> >>>>>>> SLURM_MEM_PER_CPU=2048 >>> >>>>>>> SLURM_NNODES=7 >>> >>>>>>> SLURM_NODELIST='c1-[2,4,8,13,16,23,26]' >>> >>>>>>> SLURM_NODE_ALIASES='(null)' >>> >>>>>>> SLURM_NPROCS=32 >>> >>>>>>> SLURM_NTASKS=32 >>> >>>>>>> SLURM_SUBMIT_DIR=/cluster/home/marcink >>> >>>>>>> SLURM_SUBMIT_HOST=login-0-1.local >>> >>>>>>> SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5' >>> >>>>>>> >>> >>>>>>> I have submitted an interactive job on screen for 120 hours now to >>> >>>>>>> work with one example, and not change it for every post :) >>> >>>>>>> >>> >>>>>>> If you need anything else, let me know. I could introduce some >>> >>>>>>> patch/printfs and recompile, if you need it. >>> >>>>>>> >>> >>>>>>> Marcin >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> On 10/03/2015 07:17 PM, Ralph Castain wrote: >>> >>>>>>>> Rats - just realized I have no way to test this as none of the >>> >>>>>>>> machines I can access are setup for cgroup-based multi-tenant. Is >>> >>>>>>>> this a debug version of OMPI? If not, can you rebuild OMPI with >>> >>>>>>>> —enable-debug? >>> >>>>>>>> >>> >>>>>>>> Then please run it with —mca rmaps_base_verbose 10 and pass along >>> >>>>>>>> the output. >>> >>>>>>>> >>> >>>>>>>> Thanks >>> >>>>>>>> Ralph >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>>> On Oct 3, 2015, at 10:09 AM, Ralph Castain < >>> >>>>>>>>> <javascript:;>r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote: >>> >>>>>>>>> >>> >>>>>>>>> What version of slurm is this? I might try to debug it here. I’m >>> >>>>>>>>> not sure where the problem lies just yet. >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>>> On Oct 3, 2015, at 8:59 AM, marcin.krotkiewski < >>> >>>>>>>>>> <javascript:;>marcin.krotkiew...@gmail.com >>> >>>>>>>>>> <mailto:marcin.krotkiew...@gmail.com>> wrote: >>> >>>>>>>>>> >>> >>>>>>>>>> Here is the output of lstopo. In short, (0,16) are core 0, >>> >>>>>>>>>> (1,17) - core 1 etc. >>> >>>>>>>>>> >>> >>>>>>>>>> Machine (64GB) >>> >>>>>>>>>> NUMANode L#0 (P#0 32GB) >>> >>>>>>>>>> Socket L#0 + L3 L#0 (20MB) >>> >>>>>>>>>> L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 >>> >>>>>>>>>> PU L#0 (P#0) >>> >>>>>>>>>> PU L#1 (P#16) >>> >>>>>>>>>> L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 >>> >>>>>>>>>> PU L#2 (P#1) >>> >>>>>>>>>> PU L#3 (P#17) >>> >>>>>>>>>> L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 >>> >>>>>>>>>> PU L#4 (P#2) >>> >>>>>>>>>> PU L#5 (P#18) >>> >>>>>>>>>> L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 >>> >>>>>>>>>> PU L#6 (P#3) >>> >>>>>>>>>> PU L#7 (P#19) >>> >>>>>>>>>> L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 >>> >>>>>>>>>> PU L#8 (P#4) >>> >>>>>>>>>> PU L#9 (P#20) >>> >>>>>>>>>> L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 >>> >>>>>>>>>> PU L#10 (P#5) >>> >>>>>>>>>> PU L#11 (P#21) >>> >>>>>>>>>> L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 >>> >>>>>>>>>> PU L#12 (P#6) >>> >>>>>>>>>> PU L#13 (P#22) >>> >>>>>>>>>> L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 >>> >>>>>>>>>> PU L#14 (P#7) >>> >>>>>>>>>> PU L#15 (P#23) >>> >>>>>>>>>> HostBridge L#0 >>> >>>>>>>>>> PCIBridge >>> >>>>>>>>>> PCI 8086:1521 >>> >>>>>>>>>> Net L#0 "eth0" >>> >>>>>>>>>> PCI 8086:1521 >>> >>>>>>>>>> Net L#1 "eth1" >>> >>>>>>>>>> PCIBridge >>> >>>>>>>>>> PCI 15b3:1003 >>> >>>>>>>>>> Net L#2 "ib0" >>> >>>>>>>>>> OpenFabrics L#3 "mlx4_0" >>> >>>>>>>>>> PCIBridge >>> >>>>>>>>>> PCI 102b:0532 >>> >>>>>>>>>> PCI 8086:1d02 >>> >>>>>>>>>> Block L#4 "sda" >>> >>>>>>>>>> NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB) >>> >>>>>>>>>> L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 >>> >>>>>>>>>> PU L#16 (P#8) >>> >>>>>>>>>> PU L#17 (P#24) >>> >>>>>>>>>> L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 >>> >>>>>>>>>> PU L#18 (P#9) >>> >>>>>>>>>> PU L#19 (P#25) >>> >>>>>>>>>> L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core >>> >>>>>>>>>> L#10 >>> >>>>>>>>>> PU L#20 (P#10) >>> >>>>>>>>>> PU L#21 (P#26) >>> >>>>>>>>>> L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core >>> >>>>>>>>>> L#11 >>> >>>>>>>>>> PU L#22 (P#11) >>> >>>>>>>>>> PU L#23 (P#27) >>> >>>>>>>>>> L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core >>> >>>>>>>>>> L#12 >>> >>>>>>>>>> PU L#24 (P#12) >>> >>>>>>>>>> PU L#25 (P#28) >>> >>>>>>>>>> L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core >>> >>>>>>>>>> L#13 >>> >>>>>>>>>> PU L#26 (P#13) >>> >>>>>>>>>> PU L#27 (P#29) >>> >>>>>>>>>> L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core >>> >>>>>>>>>> L#14 >>> >>>>>>>>>> PU L#28 (P#14) >>> >>>>>>>>>> PU L#29 (P#30) >>> >>>>>>>>>> L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core >>> >>>>>>>>>> L#15 >>> >>>>>>>>>> PU L#30 (P#15) >>> >>>>>>>>>> PU L#31 (P#31) >>> >>>>>>>>>> >>> >>>>>>>>>> >>> >>>>>>>>>> >>> >>>>>>>>>> On 10/03/2015 05:46 PM, Ralph Castain wrote: >>> >>>>>>>>>>> Maybe I’m just misreading your HT map - that slurm nodelist >>> >>>>>>>>>>> syntax is a new one to me, but they tend to change things >>> >>>>>>>>>>> around. Could you run lstopo on one of those >>> >>>>>>>>>>> compute nodes and send the output? >>> >>>>>>>>>>> >>> >>>>>>>>>>> I’m just suspicious because I’m not seeing a clear pairing of >>> >>>>>>>>>>> HT numbers in your output, but HT numbering is BIOS-specific >>> >>>>>>>>>>> and I may just not be understanding your >>> >>>>>>>>>>> particular pattern. Our error message is clearly indicating >>> >>>>>>>>>>> that we are seeing individual HTs (and not complete cores) >>> >>>>>>>>>>> assigned, and I don’t know the source of that confusion. >>> >>>>>>>>>>> >>> >>>>>>>>>>> >>> >>>>>>>>>>>> On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski < >>> >>>>>>>>>>>> <javascript:;>marcin.krotkiew...@gmail.com >>> >>>>>>>>>>>> <mailto:marcin.krotkiew...@gmail.com>> wrote: >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> On 10/03/2015 04:38 PM, Ralph Castain wrote: >>> >>>>>>>>>>>>> If mpirun isn’t trying to do any binding, then you will of >>> >>>>>>>>>>>>> course get the right mapping as we’ll just inherit whatever >>> >>>>>>>>>>>>> we received. >>> >>>>>>>>>>>> Yes. I meant that whatever you received (what SLURM gives) is >>> >>>>>>>>>>>> a correct cpu map and assigns _whole_ CPUs, not a single HT to >>> >>>>>>>>>>>> MPI processes. In the case mentioned earlier openmpi should >>> >>>>>>>>>>>> start 6 tasks on c1-30. If HT would be treated as separate and >>> >>>>>>>>>>>> independent cores, sched_getaffinity of an MPI process started >>> >>>>>>>>>>>> on c1-30 would return a map with 6 entries only. In my case it >>> >>>>>>>>>>>> returns a map >>> >>>>>>>>>>>> with 12 entries - 2 for each core. So one process is >>> >>>>>>>>>>>> in fact allocated both HTs, not only one. Is what I'm saying >>> >>>>>>>>>>>> correct? >>> >>>>>>>>>>>> >>> >>>>>>>>>>>>> Looking at your output, it’s pretty clear that you are >>> >>>>>>>>>>>>> getting independent HTs assigned and not full cores. >>> >>>>>>>>>>>> How do you mean? Is the above understanding wrong? I would >>> >>>>>>>>>>>> expect that on c1-30 with --bind-to core openmpi should bind >>> >>>>>>>>>>>> to logical cores 0 and 16 (rank 0), 1 and 17 (rank 2) and so >>> >>>>>>>>>>>> on. All those logical cores are available in sched_getaffinity >>> >>>>>>>>>>>> map, and there is twice as many logical cores as there are MPI >>> >>>>>>>>>>>> processes started on the node. >>> >>>>>>>>>>>> >>> >>>>>>>>>>>>> My guess is that something in slurm has changed such that it >>> >>>>>>>>>>>>> detects that HT has been enabled, and then begins treating >>> >>>>>>>>>>>>> the HTs as completely independent cpus. >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> Try changing “-bind-to core” to “-bind-to hwthread >>> >>>>>>>>>>>>> -use-hwthread-cpus” and see if that works >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>> I have and the binding is wrong. For example, I got this output >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> rank 0 @ compute-1-30.local 0, >>> >>>>>>>>>>>> rank 1 @ compute-1-30.local 16, >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> Which means that two ranks have been bound to the same >>> >>>>>>>>>>>> physical core (logical cores 0 and 16 are two HTs of the same >>> >>>>>>>>>>>> core). If I use --bind-to core, I get the following correct >>> >>>>>>>>>>>> binding >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> rank 0 @ compute-1-30.local 0, 16, >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> The problem is many other ranks get bad binding with 'rank XXX >>> >>>>>>>>>>>> is not bound (or bound to all available processors)' warning. >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> But I think I was not entirely correct saying that 1.10.1rc1 >>> >>>>>>>>>>>> did not fix things. It still might have improved something, >>> >>>>>>>>>>>> but not everything. Consider this job: >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> SLURM_JOB_CPUS_PER_NODE='5,4,6,5(x2),7,5,9,5,7,6' >>> >>>>>>>>>>>> SLURM_JOB_NODELIST='c8-[31,34],c9-[30-32,35-36],c10-[31-34]' >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> If I run 32 tasks as follows (with 1.10.1rc1) >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> mpirun --hetero-nodes --report-bindings --bind-to core -np 32 >>> >>>>>>>>>>>> ./affinity >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> I get the following error: >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> -------------------------------------------------------------------------- >>> >>>>>>>>>>>> A request was made to bind to that would result in binding more >>> >>>>>>>>>>>> processes than cpus on a resource: >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> Bind to: CORE >>> >>>>>>>>>>>> Node: c9-31 >>> >>>>>>>>>>>> #processes: 2 >>> >>>>>>>>>>>> #cpus: 1 >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> You can override this protection by adding the >>> >>>>>>>>>>>> "overload-allowed" >>> >>>>>>>>>>>> option to your binding directive. >>> >>>>>>>>>>>> -------------------------------------------------------------------------- >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> If I now use --bind-to core:overload-allowed, then openmpi >>> >>>>>>>>>>>> starts and _most_ of the threads are bound correctly (i.e., >>> >>>>>>>>>>>> map contains two logical cores in ALL cases), except this case >>> >>>>>>>>>>>> that required the overload flag: >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> rank 15 @ compute-9-31.local 1, 17, >>> >>>>>>>>>>>> rank 16 @ compute-9-31.local 11, 27, >>> >>>>>>>>>>>> rank 17 @ compute-9-31.local 2, 18, >>> >>>>>>>>>>>> rank 18 @ compute-9-31.local 12, 28, >>> >>>>>>>>>>>> rank 19 @ compute-9-31.local 1, 17, >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> Note pair 1,17 is used twice. The original SLURM delivered map >>> >>>>>>>>>>>> (no binding) on this node is >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> rank 15 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27, >>> >>>>>>>>>>>> 28, 29, >>> >>>>>>>>>>>> rank 16 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27, >>> >>>>>>>>>>>> 28, 29, >>> >>>>>>>>>>>> rank 17 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27, >>> >>>>>>>>>>>> 28, 29, >>> >>>>>>>>>>>> rank 18 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27, >>> >>>>>>>>>>>> 28, 29, >>> >>>>>>>>>>>> rank 19 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27, >>> >>>>>>>>>>>> 28, 29, >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> Why does openmpi use cores (1,17) twice instead of using core >>> >>>>>>>>>>>> (13,29)? Clearly, the original SLURM-delivered map has 5 CPUs >>> >>>>>>>>>>>> included, enough for 5 MPI processes. >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> Cheers, >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> Marcin >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>>> On Oct 3, 2015, at 7:12 AM, marcin.krotkiewski < >>> >>>>>>>>>>>>>> <javascript:;>marcin.krotkiew...@gmail.com >>> >>>>>>>>>>>>>> <mailto:marcin.krotkiew...@gmail.com>> wrote: >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> On 10/03/2015 01:06 PM, Ralph Castain wrote: >>> >>>>>>>>>>>>>>> Thanks Marcin. Looking at this, I’m guessing that Slurm may >>> >>>>>>>>>>>>>>> be treating HTs as “cores” - i.e., as independent cpus. Any >>> >>>>>>>>>>>>>>> chance that is true? >>> >>>>>>>>>>>>>> Not to the best of my knowledge, and at least not >>> >>>>>>>>>>>>>> intentionally. SLURM starts as many processes as there are >>> >>>>>>>>>>>>>> physical cores, not threads. To verify this, consider this >>> >>>>>>>>>>>>>> test case: >>> >>>>>> >>> >>>>>> >>> >>>>>> _______________________________________________ >>> >>>>>> users mailing list >>> >>>>>> >>> >>>>>> us...@open-mpi.org <javascript:;> >>> >>>>>> >>> >>>>>> Subscription: >>> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>> >>>>>> >>> >>>>>> Link to this post: >>> >>>>>> http://www.open-mpi.org/community/lists/users/2015/10/27790.php >>> >>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27790.php> >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> _______________________________________________ >>> >>>>> users mailing list >>> >>>>> >>> >>>>> us...@open-mpi.org <javascript:;> >>> >>>>> >>> >>>>> Subscription: >>> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>> >>>>> >>> >>>>> Link to this post: >>> >>>>> http://www.open-mpi.org/community/lists/users/2015/10/27791.php >>> >>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27791.php> >>> >>>> >>> >>>> >>> >>>> >>> >>>> _______________________________________________ >>> >>>> users mailing list >>> >>>> >>> >>>> us...@open-mpi.org <javascript:;> >>> >>>> >>> >>>> Subscription: >>> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>> >>>> >>> >>>> Link to this post: >>> >>>> http://www.open-mpi.org/community/lists/users/2015/10/27792.php >>> >>>> <http://www.open-mpi.org/community/lists/users/2015/10/27792.php> >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> >>> users mailing list >>> >>> >>> >>> us...@open-mpi.org <javascript:;> >>> >>> >>> >>> Subscription: >>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>> >>> >>> >>> Link to this post: >>> >>> http://www.open-mpi.org/community/lists/users/2015/10/27814.php >>> >>> <http://www.open-mpi.org/community/lists/users/2015/10/27814.php> >>> >> >>> >> >>> >> >>> >> _______________________________________________ >>> >> users mailing list >>> >> >>> >> us...@open-mpi.org <javascript:;> >>> >> >>> >> Subscription: >>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>> >> >>> >> Link to this post: >>> >> http://www.open-mpi.org/community/lists/users/2015/10/27815.php >>> >> <http://www.open-mpi.org/community/lists/users/2015/10/27815.php> >>> > >>> > <heterogeneous_topologies.patch>_______________________________________________ >>> > users mailing list >>> > us...@open-mpi.org <javascript:;> >>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> > <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>> > Link to this post: >>> > http://www.open-mpi.org/community/lists/users/2015/10/27827.php >>> > <http://www.open-mpi.org/community/lists/users/2015/10/27827.php> >>> >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com <javascript:;> >>> For corporate legal information go to: >>> <http://www.cisco.com/web/about/doing_business/legal/cri/>http://www.cisco.com/web/about/doing_business/legal/cri/ >>> <http://www.cisco.com/web/about/doing_business/legal/cri/> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org <javascript:;> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/10/27828.php >>> <http://www.open-mpi.org/community/lists/users/2015/10/27828.php> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/10/27830.php >>> <http://www.open-mpi.org/community/lists/users/2015/10/27830.php> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/10/27834.php >> <http://www.open-mpi.org/community/lists/users/2015/10/27834.php> > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/10/27847.php