Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-08 Thread Ralph Castain
If you get a chance, you might test this patch: https://github.com/open-mpi/ompi-release/pull/656 I think it will resolve the problem you mentioned, and is small enough to go into 1.10.1 Ralph > On Oct 8, 2015, at 12:36 PM, marcin.krotkiewski > wrote: > >

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-08 Thread marcin.krotkiewski
Sorry, I think I confused one thing: On 10/08/2015 09:15 PM, marcin.krotkiewski wrote: For version 1.10.1rc1 and up the situation is a bit different: it seems that in many cases all cores are present in the cpuset, just that the binding does not take place in a lot of cases. Instead,

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-08 Thread Ralph Castain
I agree that makes sense. I’ve been somewhat limited in my ability to work on this lately, and I think Gilles has been in a similar situation. I’ll try to create a 1.10 patch later today. Depending how minimal I can make it, we may still be able to put it into 1.10.1, though the window on that

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-08 Thread marcin.krotkiewski
Dear Ralph, Gilles, and Jeff Thanks a lot for your effort.. Understanding this problem has been a very interesting exercise for me that let me understand OpenMPI much better (I think:). I have given it all a little more thought, and done some more tests on our production system, and I think

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-08 Thread Marcin Krotkiewski
Hi, Gilles, I have briefly tested your patch with master. So far everything works. I must say what I really like about this version is that it with --report-bindings it actually shows how the heterogeneous architectures looks like, i.e., varying number of cores/sockets per compute node. This

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-07 Thread Ralph Castain
I’m a little nervous about this one, Gilles. It’s doing a lot more than just addressing the immediate issue, and I’m concerned about any potential side-effects that we don’t fully unocver prior to release. I’d suggest a two-pronged approach: 1. use my alternative method for 1.10.1 to solve the

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-07 Thread Jeff Squyres (jsquyres)
Is this something that needs to go into v1.10.1? If so, a PR needs to be filed ASAP. We were supposed to make the next 1.10.1 RC yesterday, but slipped to today due to some last second patches. > On Oct 7, 2015, at 4:32 AM, Gilles Gouaillardet wrote: > > Marcin, > > here

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-07 Thread Gilles Gouaillardet
Marcin, here is a patch for the master, hopefully it fixes all the issues we discussed i will make sure it applies fine vs latest 1.10 tarball from tomorrow Cheers, Gilles On 10/6/2015 7:22 PM, marcin.krotkiewski wrote: Gilles, Yes, it seemed that all was fine with binding in the patched

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-06 Thread marcin.krotkiewski
Gilles, Yes, it seemed that all was fine with binding in the patched 1.10.1rc1 - thank you. Eagerly waiting for the other patches, let me know and I will test them later this week. Marcin On 10/06/2015 12:09 PM, Gilles Gouaillardet wrote: Marcin, my understanding is that in this case,

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-06 Thread Gilles Gouaillardet
Marcin, my understanding is that in this case, patched v1.10.1rc1 is working just fine. am I right ? I prepared two patches one to remove the warning when binding on one core if only one core is available, an other one to add a warning if the user asks a binding policy that makes no sense with

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-05 Thread Jeff Squyres (jsquyres)
I filed an issue to track this problem here: https://github.com/open-mpi/ompi/issues/978 > On Oct 5, 2015, at 1:01 PM, Ralph Castain wrote: > > Thanks Marcin. I think we have three things we need to address: > > 1. the warning needs to be emitted regardless of whether

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-05 Thread Ralph Castain
Thanks Marcin. I think we have three things we need to address: 1. the warning needs to be emitted regardless of whether or not —report-bindings was given. Not sure how that warning got “covered” by the option, but it is clearly a bug 2. improve the warning to include binding info - relatively

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-05 Thread marcin.krotkiewski
Hi, Gilles you mentionned you had one failure with 1.10.1rc1 and -bind-to core could you please send the full details (script, allocation and output) in your slurm script, you can do srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none -l grep Cpus_allowed_list /proc/self/status before

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-05 Thread Gilles Gouaillardet
Marcin, there is no need to pursue 1.10.0 since it is known to be broken for some scenario. it would really help me if you could provide the logs I requested, so I can reproduce the issue and make sure we both talk about the same scenario. imho, there is no legitimate reason to -map-by hwthread

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-05 Thread marcin.krotkiewski
I have applied the patch to both 1.10.0 and 1.10.1rc1. For 1.10.0 it did not help - I am not sure how much (if) you want pursue this. For 1.10.1rc1 I was so far unable to reproduce any binding problems with jobs of up to 128 tasks. Some cosmetic suggestions. The warning it all started with

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-05 Thread Ralph Castain
I think this is okay, in general. I would only make one change: I would only search for an alternative site if the binding policy wasn’t set by the user. If the user specifies a mapping/binding pattern, then we should error out as we cannot meet it. I did think of one alternative that might be

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-05 Thread Gilles Gouaillardet
Ralph and Marcin, here is a proof of concept for a fix (assert should be replaced with proper error handling) for v1.10 branch. if you have any chance to test it, please let me know the results Cheers, Gilles On 10/5/2015 1:08 PM, Gilles Gouaillardet wrote: OK, i'll see what i can do :-)

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-05 Thread Gilles Gouaillardet
OK, i'll see what i can do :-) On 10/5/2015 12:39 PM, Ralph Castain wrote: I would consider that a bug, myself - if there is some resource available, we should use it On Oct 4, 2015, at 5:42 PM, Gilles Gouaillardet > wrote: Marcin, i ran a

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-05 Thread Ralph Castain
I would consider that a bug, myself - if there is some resource available, we should use it > On Oct 4, 2015, at 5:42 PM, Gilles Gouaillardet wrote: > > Marcin, > > i ran a simple test with v1.10.1rc1 under a cpuset with > - one core (two threads 0,16) on socket 0 > - two

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-04 Thread Gilles Gouaillardet
Marcin, i ran a simple test with v1.10.1rc1 under a cpuset with - one core (two threads 0,16) on socket 0 - two cores (two threads each 8,9,24,25) on socket 1 $ mpirun -np 3 -bind-to core ./hello_c -- A request was made to

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-04 Thread marcin.krotkiewski
Hi, all, I played a bit more and it seems that the problem results from trg_obj = opal_hwloc_base_find_min_bound_target_under_obj() called in rmaps_base_binding.c / bind_downwards being wrong. I do not know the reason, but I think I know when the problem happens (at least on 1.10.1rc1). It

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-04 Thread Gilles Gouaillardet
Ralph, I suspect ompi tries to bind to threads outside the cpuset. this could be pretty similar to a previous issue when ompi tried to bind to cores outside the cpuset. /* when a core has more than one thread, would ompi assume all the threads are available if the core is available ? */ I will

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread Ralph Castain
Thanks - please go ahead and release that allocation as I’m not going to get to this immediately. I’ve got several hot irons in the fire right now, and I’m not sure when I’ll get a chance to track this down. Gilles or anyone else who might have time - feel free to take a gander and see if

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread marcin.krotkiewski
Done. I have compiled 1.10.0 and 1.10.rc1 with --enable-debug and executed mpirun --mca rmaps_base_verbose 10 --hetero-nodes --report-bindings --bind-to core -np 32 ./affinity In case of 1.10.rc1 I have also added :overload-allowed - output in a separate file. This option did not make much

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread Ralph Castain
Rats - just realized I have no way to test this as none of the machines I can access are setup for cgroup-based multi-tenant. Is this a debug version of OMPI? If not, can you rebuild OMPI with —enable-debug? Then please run it with —mca rmaps_base_verbose 10 and pass along the output. Thanks

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread Ralph Castain
What version of slurm is this? I might try to debug it here. I’m not sure where the problem lies just yet. > On Oct 3, 2015, at 8:59 AM, marcin.krotkiewski > wrote: > > Here is the output of lstopo. In short, (0,16) are core 0, (1,17) - core 1 > etc. > >

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread marcin.krotkiewski
Here is the output of lstopo. In short, (0,16) are core 0, (1,17) - core 1 etc. Machine (64GB) NUMANode L#0 (P#0 32GB) Socket L#0 + L3 L#0 (20MB) L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#16) L2 L#1 (256KB) + L1d L#1

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread Ralph Castain
Maybe I’m just misreading your HT map - that slurm nodelist syntax is a new one to me, but they tend to change things around. Could you run lstopo on one of those compute nodes and send the output? I’m just suspicious because I’m not seeing a clear pairing of HT numbers in your output, but HT

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread marcin.krotkiewski
On 10/03/2015 04:38 PM, Ralph Castain wrote: If mpirun isn’t trying to do any binding, then you will of course get the right mapping as we’ll just inherit whatever we received. Yes. I meant that whatever you received (what SLURM gives) is a correct cpu map and assigns _whole_ CPUs, not a

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread Ralph Castain
If mpirun isn’t trying to do any binding, then you will of course get the right mapping as we’ll just inherit whatever we received. Looking at your output, it’s pretty clear that you are getting independent HTs assigned and not full cores. My guess is that something in slurm has changed such

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread marcin.krotkiewski
On 10/03/2015 01:06 PM, Ralph Castain wrote: Thanks Marcin. Looking at this, I’m guessing that Slurm may be treating HTs as “cores” - i.e., as independent cpus. Any chance that is true? Not to the best of my knowledge, and at least not intentionally. SLURM starts as many processes as there

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread Gilles Gouaillardet
Marcin, could you give a try at v1.10.1rc1 that was released today ? it fixes a bug when hwloc was trying to bind outside the cpuset. Ralph and all, imho, there are several issues here - if slurm allocates threads instead of core, then the --oversubscribe mpirun option could be mandatory - with

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread Ralph Castain
Thanks Marcin. Looking at this, I’m guessing that Slurm may be treating HTs as “cores” - i.e., as independent cpus. Any chance that is true? I’m wondering because bind-to core will attempt to bind your proc to both HTs on the core. For some reason, we thought that 8.24 were HTs on the same

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-03 Thread marcin.krotkiewski
Hi, Ralph, I submit my slurm job as follows salloc --ntasks=64 --mem-per-cpu=2G --time=1:0:0 Effectively, the allocated CPU cores are spread amount many cluster nodes. SLURM uses cgroups to limit the CPU cores available for mpi processes running on a given cluster node. Compute nodes are

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-02 Thread Ralph Castain
Can you please send me the allocation request you made (so I can see what you specified on the cmd line), and the mpirun cmd line? Thanks Ralph > On Oct 2, 2015, at 8:25 AM, Marcin Krotkiewski > wrote: > > Hi, > > I fail to make OpenMPI bind to cores correctly

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-02 Thread Grigory Shamov
Hi All, I just got the same behaviour with old Torque (2.5, uses cpusets) we have and OpenMPI 1.10.0; when --bind-to core is set, occasionally (not always) it fails Open MPI tried to bind a new process, but something went wrong. The process was killed without launching the target application.

[OMPI users] Process binding with SLURM and 'heterogeneous' nodes

2015-10-02 Thread Marcin Krotkiewski
Hi, I fail to make OpenMPI bind to cores correctly when running from within SLURM-allocated CPU resources spread over a range of compute nodes in an otherwise homogeneous cluster. I have found this thread http://www.open-mpi.org/community/lists/users/2014/06/24682.php and did try to use