Re: [hwloc-devel] structure assumptions, duplication

2009-09-30 Thread Fawzi Mohamed


On 30-set-09, at 09:20, Brice Goglin wrote:


Fawzi Mohamed wrote:

1) a fully hierarchical representation of the machine/hardware where
each level is a partition, and each level fully covers the previous
one (from any node you go through all levels using father/childrens,
father/child are just one level away from each other.


Actually, we support heterogeneous topologies where one level may not
cover entirely the previous one, for instance if you have two  
different

processors with different levels of caches.


thanks that is good to know.
In that case the upper level (let's assume it is a NODE) is still all  
at the same level, but the depth of the children of this might be  
different from depth-1?
like this (I am skipping some levels, as just the structure is  
important):


NODE 1
  cache2
cache1
  p0
  cache2
cache1
  p1
NODE 2
cache1
  p2
cache1
  p3

or is it even possible to have a node with children of a single node  
of different depth?


this is supposing that you want to keep that a level has a single type  
(and given the api I suppose it is so).


Fawzi




Re: [hwloc-devel] structure assumptions, duplication

2009-09-29 Thread Fawzi Mohamed

Hi Samuel,

On 29-set-09, at 18:14, Samuel Thibault wrote:


Fawzi Mohamed, le Tue 29 Sep 2009 17:39:17 +0200, a écrit :

so that in the future one could avoid storing it at least in the
deepest levels where it is easy and relatively cheap to generate (and
where one would have the largest savings).


Even the deepest levels would have a L1 cache level on top of maybe  
just
at most 4 threads.  Here we only save the "children" pointers, which  
is

not so many, compared to the siblings & cousins pointers, I'm not sure
it is really worth the pain of defining a long series of functions.


ok those were two separate things, I was thinking

cpuset -> cpuset_ptr (or just a flag that says if the structure has  
it, and thus two structures, a long one with it and a short one  
without, differing only in the tail if you really want to be hacky).
Then cpuset is generated on the fly for the deepest level (like less  
than 4-8 proc ->  lots of memory savings on large machines).

(cost 1 function, and copying or building the cpuset)

sibling/cousin -> only cousins (you can make them loop first on  
siblings, then to the others if it really is a partition)

children -> only one representation (arity/childrens or first/last)
(cost many functions)

the main point is that these changes/optimizations can be done even  
later without breaking anything if you use functions.



I would say that for most operations (cpuset, next_sibling,...) using
functions that get a hwloc_obj_t (and if needed also a topology) and
return what requested is the way to go.


That means a long series of functions, I'm not sure it's really  
clearer

for the user. obj->father looks to me easier to read than
hwloc_obj_father(obj), particularly in complex expressions.


ok I can see that, so I guess you will have to evaluate if the  
abstraction cost is worth the potential savings, maybe for cpuset it  
is; for sibling,... you might be right that it isn't, for father it  
sure isn't.



I suppose that most of these operations are not performance critical.


I wouldn't suppose this actually. Detection time is probably not
performance critical, but it could be useful to make browsing the
topology very efficient.


ok, I was thinking that maybe you did/would like to provide in the
future something akin to what opensolaris does with locality groups
http://opensolaris.org/os/community/performance/mpo_overview.pdf


Yes, we intend to provide something similar.


In fact what I "need" (or at least I think I need ;) is just the next
neighbors, basically I go up the hierarchy, and look which new
neighbors I have, so some hierarchy like the lgroups is close to what
I need, and simpler to handle than the full graph.


That's what future heuristics would build for you, yes.


tha's great, I am really looking forward to it.

and sorry if I seem to be criticizing a lot, as I am mainly a user,  
not a developer of hwloc, but I hope it is constructive, and maybe  
helps making hwloc better...


ciao
Fawzi


Re: [hwloc-devel] structure assumptions, duplication

2009-09-29 Thread Samuel Thibault
Fawzi Mohamed, le Tue 29 Sep 2009 17:39:17 +0200, a écrit :
> so that in the future one could avoid storing it at least in the
> deepest levels where it is easy and relatively cheap to generate (and
> where one would have the largest savings).

Even the deepest levels would have a L1 cache level on top of maybe just
at most 4 threads.  Here we only save the "children" pointers, which is
not so many, compared to the siblings & cousins pointers, I'm not sure
it is really worth the pain of defining a long series of functions.

> I would say that for most operations (cpuset, next_sibling,...) using  
> functions that get a hwloc_obj_t (and if needed also a topology) and  
> return what requested is the way to go.

That means a long series of functions, I'm not sure it's really clearer
for the user. obj->father looks to me easier to read than
hwloc_obj_father(obj), particularly in complex expressions.

> I suppose that most of these operations are not performance critical.

I wouldn't suppose this actually. Detection time is probably not
performance critical, but it could be useful to make browsing the
topology very efficient.

> ok, I was thinking that maybe you did/would like to provide in the  
> future something akin to what opensolaris does with locality groups
> http://opensolaris.org/os/community/performance/mpo_overview.pdf

Yes, we intend to provide something similar.

> In fact what I "need" (or at least I think I need ;) is just the next  
> neighbors, basically I go up the hierarchy, and look which new  
> neighbors I have, so some hierarchy like the lgroups is close to what  
> I need, and simpler to handle than the full graph.

That's what future heuristics would build for you, yes.

Samuel


Re: [hwloc-devel] structure assumptions, duplication

2009-09-29 Thread Fawzi Mohamed

Thanks for the quick answers!

On 29-set-09, at 16:59, Samuel Thibault wrote:


Fawzi Mohamed, le Tue 29 Sep 2009 16:39:47 +0200, a écrit :

Maybe I worry too much, but with machines with 1'000 of processor
coming, and maybe wanting local restricted copies to know the  
topology

of the whole machine (to communicate with others) I worry also about
few pointers here an there.


Actually the size of all the pointers is not so much compared to the
size of the cpuset (by default 1024/8 = 128 bytes, worth 16 pointers).


yes and so the question if that should not be returned by a function  
that initializes its argument (or returned by value), so that in the  
future one could avoid storing it at least in the deepest levels where  
it is easy and relatively cheap to generate (and where one would have  
the largest savings).
I would say that for most operations (cpuset, next_sibling,...) using  
functions that get a hwloc_obj_t (and if needed also a topology) and  
return what requested is the way to go.
Basically OOP in C, so that the actual implementation is hidden, and  
if you change the implementation the user has just to recompile.
If the function is simple and gets inlined the speed hit should be  
basically zero, and anyway I suppose that most of these operations are  
not performance critical.

This way you will be more free in the future to make aggressive changes.


2) assumption on the structure

Also a ring like topology cannot be cleanly represented with a
partition if one wants to have objects for groups with uniform  
latency.


Our plan was to not only provide a hierarcical view but also the  
precise

graph.  For instance, that means that for an Altix machine with a
2D-mesh of 3*4 NUMA nodes, the hierarchy would be system containing
12 nodes, themselves containing a socket etc. And the fact that the
12 nodes are organized as a 2D-mesh would be expressed by a graph
structure, independently of the hierarchy.


ok I see

[... (thanks for the answers) ...]


The idea is that some programmers will only want to cope with a
hierarchical machine, while others will want to fine-tune according to
the precise topology (much harder, not polynomial at least). And  
thus we

should provide both. For the first kind of programmers, to tackle the
3*4 2D-mesh case above, we could provide a flag "make it  
hierarchical",
which would heuristically group nodes recursively according to  
locality.
For now we only do such grouping from the NUMA distances when there  
are

clear NUMA subgroups.


ok, I was thinking that maybe you did/would like to provide in the  
future something akin to what opensolaris does with locality groups

http://opensolaris.org/os/community/performance/mpo_overview.pdf

In fact what I "need" (or at least I think I need ;) is just the next  
neighbors, basically I go up the hierarchy, and look which new  
neighbors I have, so some hierarchy like the lgroups is close to what  
I need, and simpler to handle than the full graph.



So I wanted to know how you cope with those things, and also if
something will probably change in the future, as some assumptions  
will

inevitably creep in my code... and I would prefer to make the good
ones :)


We should probably (agree on and) state what we want to always provide
indeed.


Yes that would be great

ciao
Fawzi