Re: [hwloc-devel] Bug report: topology strange on SGI UltraViolet

2010-07-29 Thread Samuel Thibault
Brice Goglin, le Thu 29 Jul 2010 13:01:10 +0200, a écrit :
> > To my opinion, the job hwloc does in forming "groups" is basically OK.
> > Also the group content makes sense.
> 
> We're lucky that it somehow matches the physical ordering,
> but it is really meaningless given the distance matrix.

Well, it does, even if it is arbitrary, it corresponds to distances and
can be useful for binding applications. It could be an optional module
in hwloc.

Samuel


Re: [hwloc-devel] Bug report: topology strange on SGI UltraViolet

2010-07-29 Thread Brice Goglin
> To my opinion, the job hwloc does in forming "groups" is basically OK.
> Also the group content makes sense.

We're lucky that it somehow matches the physical ordering,
but it is really meaningless given the distance matrix.
That's why Group2 matches nothing in reality.
Group3 matches nothing as well from what I understand.

This meaningless part will disappear in hwloc 1.1. You will
only see 24 Group0 objects.

> The only "strange" thing is, that the grouping code becomes disturbed on
> this special machine, which only contains 3/4 of the NUMA nodes that are
> found in a fully equipped rack.

It's an artifact of the aforementioned meaningless grouping code.
If you have 2^N objects with such a distance matrix, the grouping code
will create a binary tree. If it's not 2^N, you'll artifact like here
since the binary tree isn't complete.

> Physically, the 2nd enclosure is only
> half filled. I'm wondering what would happen in a fully equipped rack.
> 
> Will there be 4xGroup3, or 2xGroup4 with 2xGroup3 each? From my feeling
> the latter should be happen.

Yes, the latter would happen. Such a distance matrix always groups pairs
of consecutive objects starting from #0. So you'll get two pairs of Group3
grouped in 2 Group4 objects.

Brice


Re: [hwloc-devel] Bug report: topology strange on SGI UltraViolet

2010-07-29 Thread Brice Goglin
Le 28/07/2010 18:53, Brice Goglin a écrit :
> Actually no, but it's very hard to see :)
>   lstopo - | egrep "(NUMA|Group)"
> shows that Group4#0 only contains Group3#0 and #1.
> Group3#2 is directly a child of the machine (the indentation is smaller).
>
>   

For the record, this is caused by the fact that Group objects are
ignored when they bring no new hierarchy (when they have a single child
or are the only child of another object). Group4#1 is created first and
removed later. I don't think there's any way to change this default
behavior with the current API. Maybe we should have something
intermediate such as "ignore what does not bring no new hierarchy if you
can remove the entire level" so that we don't get only half of Group4 level.

Brice



Re: [hwloc-devel] Bug report: topology strange on SGI UltraViolet

2010-07-28 Thread Brice Goglin
Le 28/07/2010 20:59, Bernd Kallies a écrit :
> So it seems to me that you basically get a distance matrix of PU objects
>   

NUMA node objects actually. That's what Linux and Solaris report.

> from the system (the machine vendor), and probably you do agglomerative
> average linkage cluster analysis on it to determine the number and
> hierarchy of HWLOC_OBJ_GROUP objects (beyond what can be named by some
> hardware building block like core or cache etc). Is this right?
> I'm wondering if this is the right approach. Did you try other distance
> functions (e.g. single linkage)?
>   

In 1.0.x we look at "complete graphs with minimal distances" and then at
"transitive graphs with minimal distances". One problem with this old
code is:
if finds that Group0#0 and #1 have minimal distance between them (22)
but it ignores the fact that Group0#2 is also at the same distance from
#1. And so on.

This code actually gives completely invalid groups on some strange HP
machines. In trunk, the code was reworked/cleaned to only look for
transitive graphs. Given your distance matrix, everybody is transitively
connected to everybody through one or several minimal distance links, so
everybody is grouped together in the end.

> Besides that, and from the viewpoint of a tree representation of the
> result of clustering, I would expect that every pair of two objects of
> same type have common anchestors of the same type. For the given UV
> topology I would not expect that there are two Group3 that have a Group4
> ancestor, while the 3rd Group3 is direct child of Machine. I would
> expect EITHER that the 3rd Group3 is also child of a Group4 (maybe a
> second one), OR that there is no Group4.
>   

Right, I'll see if I can fix this without changing to many things in the
1.0 branch.

Brice



Re: [hwloc-devel] Bug report: topology strange on SGI UltraViolet

2010-07-28 Thread Bernd Kallies
On Wed, 2010-07-28 at 20:36 +0200, Brice Goglin wrote:
> Le 28/07/2010 18:53, Brice Goglin a écrit :
> > Distance matrix between Group0 objects:
> > 13 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66
> > 22 13 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64
> > 24 22 13 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62
> > 26 24 22 13 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60
> > 28 26 24 22 13 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58
> > 30 28 26 24 22 13 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56
> > 32 30 28 26 24 22 13 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54
> > 34 32 30 28 26 24 22 13 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52
> > 36 34 32 30 28 26 24 22 13 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
> > 38 36 34 32 30 28 26 24 22 13 22 24 26 28 30 32 34 36 38 40 42 44 46 48
> > 40 38 36 34 32 30 28 26 24 22 13 22 24 26 28 30 32 34 36 38 40 42 44 46
> > 42 40 38 36 34 32 30 28 26 24 22 13 22 24 26 28 30 32 34 36 38 40 42 44
> > 44 42 40 38 36 34 32 30 28 26 24 22 13 22 24 26 28 30 32 34 36 38 40 42
> > 46 44 42 40 38 36 34 32 30 28 26 24 22 13 22 24 26 28 30 32 34 36 38 40
> > 48 46 44 42 40 38 36 34 32 30 28 26 24 22 13 22 24 26 28 30 32 34 36 38
> > 50 48 46 44 42 40 38 36 34 32 30 28 26 24 22 13 22 24 26 28 30 32 34 36
> > 52 50 48 46 44 42 40 38 36 34 32 30 28 26 24 22 13 22 24 26 28 30 32 34
> > 54 52 50 48 46 44 42 40 38 36 34 32 30 28 26 24 22 13 22 24 26 28 30 32
> > 56 54 52 50 48 46 44 42 40 38 36 34 32 30 28 26 24 22 13 22 24 26 28 30
> > 58 56 54 52 50 48 46 44 42 40 38 36 34 32 30 28 26 24 22 13 22 24 26 28
> > 60 58 56 54 52 50 48 46 44 42 40 38 36 34 32 30 28 26 24 22 13 22 24 26
> > 62 60 58 56 54 52 50 48 46 44 42 40 38 36 34 32 30 28 26 24 22 13 22 24
> > 64 62 60 58 56 54 52 50 48 46 44 42 40 38 36 34 32 30 28 26 24 22 13 22
> > 66 64 62 60 58 56 54 52 50 48 46 44 42 40 38 36 34 32 30 28 26 24 22 13
> >
> > Between Group1:
> > 17 24 28 32 36 40 44 48 52 56 60 64
> > 24 17 24 28 32 36 40 44 48 52 56 60
> > 28 24 17 24 28 32 36 40 44 48 52 56
> > 32 28 24 17 24 28 32 36 40 44 48 52
> > 36 32 28 24 17 24 28 32 36 40 44 48
> > 40 36 32 28 24 17 24 28 32 36 40 44
> > 44 40 36 32 28 24 17 24 28 32 36 40
> > 48 44 40 36 32 28 24 17 24 28 32 36
> > 52 48 44 40 36 32 28 24 17 24 28 32
> > 56 52 48 44 40 36 32 28 24 17 24 28
> > 60 56 52 48 44 40 36 32 28 24 17 24
> > 64 60 56 52 48 44 40 36 32 28 24 17
> >
> > Group2:
> > 20 28 36 44 52 60
> > 28 20 28 36 44 52
> > 36 28 20 28 36 44
> > 44 36 28 20 28 36
> > 52 44 36 28 20 28
> > 60 52 44 36 28 20
> >
> > Group3:
> > 24 36 52
> > 36 24 36
> > 52 36 24
> >   
> 
> Actually, all these distance matrices (except the NUMA nodes' one, the
> one not included above) show a ring topology without the link between
> the first and the last object. So grouping makes no sense there. hwloc
> 1.0.x groups object #2N with object #2N+1 because its grouping algorithm
> isn't very clever. It could also link #2N-1 with #2N, it wouldn't be
> worse. The grouping algorithm is more clever in svn trunk. It detects
> this ring properly and does not group anything (except pairs of NUMA node).
> 
> It's actually surprising that this machine doesn't show a better
> distance matrix. I guess SGI still has a hypercube or whatever nice
> topology interconnected IRUs and blades. Older Altix machines had very
> nice distance matrices were we would detect multiple levels of groups
> that really showed the physical hierarchy of blades/IRUs/... I wonder if
> your SGI BIOS is buggy :)

It would not be the first case of a buggy BIOS. I'll forward our
discussion to our SGI representatives and Alexis Cousein and Rüdiger
Wolff from SGI (M. Raymond may know him). Let's see what they say. We
are one of the early UltraViolet customers.

>From my point of view, having the groupings beyond the blade level in
the hwloc topology is good for our purposes. We want to use the hwloc
topology to calculate pinning maps for MPI applications. Currently we
use the distance map got via hwloc to scatter tasks according to a
maximum-distance approach between HWLOC_OBJ_PU objects. I also gave our
current algorithm to the MVAPICH2 dev team, which wants to put it into
the next 1.5.x release.
With the example UV topology we discuss here, our pinning map starts
with PU objects os_index 0,256,128,320,... that means 1st task on 1st
CPU of 1st Group3, 2nd task on 1st CPU of 3rd Group3 (which is the
lonely one), 3rd task on 1st CPU of 2nd Group3. Having in mind that an
MPI application that got all CPUs of this topology may start only 3
tasks and each task allocates a lot of memory far beyond than what a
single NUMA node has directly attached, then reducing the topology to
the NUMA-node or blade level would be a bad idea, because then our
pinning map would start with 0,16,32,48,... (when having only the Group0
level but not the higher groupings).

Comments appreciated !!!

> Michael Raymond, anything 

Re: [hwloc-devel] Bug report: topology strange on SGI UltraViolet

2010-07-28 Thread Brice Goglin
Le 28/07/2010 18:53, Brice Goglin a écrit :
> Distance matrix between Group0 objects:
> 13 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66
> 22 13 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64
> 24 22 13 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62
> 26 24 22 13 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60
> 28 26 24 22 13 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58
> 30 28 26 24 22 13 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56
> 32 30 28 26 24 22 13 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54
> 34 32 30 28 26 24 22 13 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52
> 36 34 32 30 28 26 24 22 13 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
> 38 36 34 32 30 28 26 24 22 13 22 24 26 28 30 32 34 36 38 40 42 44 46 48
> 40 38 36 34 32 30 28 26 24 22 13 22 24 26 28 30 32 34 36 38 40 42 44 46
> 42 40 38 36 34 32 30 28 26 24 22 13 22 24 26 28 30 32 34 36 38 40 42 44
> 44 42 40 38 36 34 32 30 28 26 24 22 13 22 24 26 28 30 32 34 36 38 40 42
> 46 44 42 40 38 36 34 32 30 28 26 24 22 13 22 24 26 28 30 32 34 36 38 40
> 48 46 44 42 40 38 36 34 32 30 28 26 24 22 13 22 24 26 28 30 32 34 36 38
> 50 48 46 44 42 40 38 36 34 32 30 28 26 24 22 13 22 24 26 28 30 32 34 36
> 52 50 48 46 44 42 40 38 36 34 32 30 28 26 24 22 13 22 24 26 28 30 32 34
> 54 52 50 48 46 44 42 40 38 36 34 32 30 28 26 24 22 13 22 24 26 28 30 32
> 56 54 52 50 48 46 44 42 40 38 36 34 32 30 28 26 24 22 13 22 24 26 28 30
> 58 56 54 52 50 48 46 44 42 40 38 36 34 32 30 28 26 24 22 13 22 24 26 28
> 60 58 56 54 52 50 48 46 44 42 40 38 36 34 32 30 28 26 24 22 13 22 24 26
> 62 60 58 56 54 52 50 48 46 44 42 40 38 36 34 32 30 28 26 24 22 13 22 24
> 64 62 60 58 56 54 52 50 48 46 44 42 40 38 36 34 32 30 28 26 24 22 13 22
> 66 64 62 60 58 56 54 52 50 48 46 44 42 40 38 36 34 32 30 28 26 24 22 13
>
> Between Group1:
> 17 24 28 32 36 40 44 48 52 56 60 64
> 24 17 24 28 32 36 40 44 48 52 56 60
> 28 24 17 24 28 32 36 40 44 48 52 56
> 32 28 24 17 24 28 32 36 40 44 48 52
> 36 32 28 24 17 24 28 32 36 40 44 48
> 40 36 32 28 24 17 24 28 32 36 40 44
> 44 40 36 32 28 24 17 24 28 32 36 40
> 48 44 40 36 32 28 24 17 24 28 32 36
> 52 48 44 40 36 32 28 24 17 24 28 32
> 56 52 48 44 40 36 32 28 24 17 24 28
> 60 56 52 48 44 40 36 32 28 24 17 24
> 64 60 56 52 48 44 40 36 32 28 24 17
>
> Group2:
> 20 28 36 44 52 60
> 28 20 28 36 44 52
> 36 28 20 28 36 44
> 44 36 28 20 28 36
> 52 44 36 28 20 28
> 60 52 44 36 28 20
>
> Group3:
> 24 36 52
> 36 24 36
> 52 36 24
>   

Actually, all these distance matrices (except the NUMA nodes' one, the
one not included above) show a ring topology without the link between
the first and the last object. So grouping makes no sense there. hwloc
1.0.x groups object #2N with object #2N+1 because its grouping algorithm
isn't very clever. It could also link #2N-1 with #2N, it wouldn't be
worse. The grouping algorithm is more clever in svn trunk. It detects
this ring properly and does not group anything (except pairs of NUMA node).

It's actually surprising that this machine doesn't show a better
distance matrix. I guess SGI still has a hypercube or whatever nice
topology interconnected IRUs and blades. Older Altix machines had very
nice distance matrices were we would detect multiple levels of groups
that really showed the physical hierarchy of blades/IRUs/... I wonder if
your SGI BIOS is buggy :)

Michael Raymond, anything to say about this?

Brice



Re: [hwloc-devel] Bug report: topology strange on SGI UltraViolet

2010-07-28 Thread Samuel Thibault
Bernd Kallies, le Wed 28 Jul 2010 18:09:28 +0200, a écrit :
> > > topology is understandeable. I'm wondering about "Group4", which
> > > contains the three "Group3" objects. lstopo should print "1534GB"
> > > instead of "1022GB". There is only one "Group4" object, and there are no
> > > other direct children of the root object.
> > 
> > Indeed, there's something wrong.
> > Can you send the output of tests/linux/gather_topology.sh so that I try
> > to debug this from here?
> 
> Is attached.

Actually the Group4 object doesn't contain the three Group3 objects:

€ grep 'Group[34]' gather-topology-uv.tar.gz.output  
  Group4 #0 (total=1071374336KB)
Group3 #0 (total=534634496KB)
Group3 #1 (total=536739840KB)
  Group3 #2 (total=536739840KB)

You can also see it using
lstopo --gridsize 2 --fontsize 5 
for instance.

So it seems all good to me.

> We have one UV rack, which is filled with 3/4 of the max. number of
> blades. According to the specs, two NUMA nodes form one "blade". This
> level corresponds to "Group0" in the hwloc topology. Two blades are
> cross-linked via the NUMAlink, forming "paired nodes" = "Group1". What
> "Group2" might correspond to - I don't know. "Group3" corresponds to one
> "chassis" or IRU. "Group4" may be an "enclosure", and "Machine" is the
> "rack".

Wow, it's impressive that hwloc actually finds out all this just from
the distance matrix :)

> From my opinion the hwloc topology for our machine should contain 2x
> Group4.

hwloc can not find Group4: it finds out groups from the distance matrix.
Since there are no two Group3 objects to group, it doesn't know some
notion of Group4 exists there.

> However, when walking the topology tree via the API, then it seems to
> contain correct details.

Yep :)

Samuel