Re: [hwloc-users] Netloc feature suggestion

2019-08-19 Thread Brice Goglin
Hello


Indeed we would like to expose this kind of info but Netloc is
unfornately undermanpowered these days. The code in git master is
outdated. We have a big rework in a branch but it still needs quite a
lot of polishing before being merged


The API is still mostly-scotch-oriented (i.e. for process placement
using communication graphs) because that's pretty-much the only clear
user-request we got in the last years (most people said "we want netloc"
but never gave any idea of what API they actually needed). Of course,
there will be a way to say "I want the entire machine" or "only my
allocated nodes".


The non-scotch API for exposing topology details has been made private
until we understand better what users want. And your request would
definitely help there.


Brice




Le 19/08/2019 à 09:31, Rigel Falcao do Couto Alves a écrit :
>
> Thanks John and Jeff for the replies.
>
>
> Indeed, we are using Slurm here at our cluster; so, for now, I can
> stick with the runtime reading of the network topology's description
> file​, explained here:
>
>
> https://slurm.schedmd.com/topology.conf.html
>
>
> But given the idea of the project is to produce a library that can be
> distributed to anyone in the world, it would still worth it to have a
> way to gather such information on-the-go -- as I can already do
> with /hwloc/​'s topology information. No problem about starting
> simple, i.e. only single-path hierarchies supported in the beginning.
>
>
> The additional /switch/ information (coming from /netloc/) would then
> be added to the graphical output of our tools, allowing users to
> visually analyse how resources placement (both /intra/ and /inter/​
> node) affect their applications.
>
>
>
> 
> *Von:* hwloc-users  im Auftrag
> von John Hearns via hwloc-users 
> *Gesendet:* Freitag, 16. August 2019 07:16
> *An:* Hardware locality user list
> *Cc:* John Hearns
> *Betreff:* Re: [hwloc-users] Netloc feature suggestion
>  
> Hi Rigel. This is very interesting.
> First though I should say - most batch systems have built in node
> grouping utilities.
> PBSPro has bladesets - I think they are called placement groups now.
> I used these when running CFD codes in a Formula 1 team.
> The systems administrator has to set these up manually, using
> knowledge of the switch topology.
> In PBSPro jobs would then 'prefer' to run within the smallest bladeset
> which could accomodate them.
> So you define bladesets for (say) 8/16/24/48 node jobs.
>
> https://pbspro.atlassian.net/wiki/spaces/PD/pages/455180289/Finer+grained+node+grouping
>
> Similarly for Slurm
> https://slurm.schedmd.com/topology.html
>
>
> On Wed, 14 Aug 2019 at 18:53, Rigel Falcao do Couto Alves
> mailto:rigel.al...@tu-dresden.de>> wrote:
>
> Hi,
>
>
> I am doing a PhD in performance analysis of highly parallel CFD
> codes and would like to suggest a feature for Netloc: from
> topic /Build Scotch sub-architectures/
> (at https://www.open-mpi.org/projects/hwloc/doc/v2.0.3/a00329.php),
> create a function-version of /netloc_get_resources/, which could
> retrieve at runtime the network details of the available cluster
> resources (i.e. the nodes allocated to the job). I am mostly
> interested about how many switches (the gray circles in the figure
> below) need to be traversed in order for any pair of
> allocated nodes to communicate with each other:
>
> [removed 200kB image]
>
>
> For example, suppose my job is running within 4 nodes in the
> cluster, illustrated by the numbers above. All I would love to get
> from Netloc - at runtime - is some sort of classification of the
> nodes, like:
>
>
> 1: aa
>
> 2: ab
>
> 3: ba
>
> 4: ca
>
>
> The difference between nodes 1 and 2 is on the last digit, which
> means their MPI communications only need to traverse 1 switch;
> however, between any of them and nodes 3 or 4, the difference
> starts on the second-last digit, which means their communications
> need to traverse two switches. More digits may be left-added to
> the string, per necessity; i.e. if the central gray circle on the
> above figure is connected to another switch, which in turnleads to
> another part of the cluster's structure (with its own switches,
> nodes etc.). For me, it is at the present moment irrelevant
> whether e.g. nodes 1 and 2 are physically - or logically -
> consecutive to each other: /a/, /b/, /c/ etc. would be just
> arbitrary identifiers.
>
>
> I would then use this data to plot the process placem

Re: [hwloc-users] Netloc feature suggestion

2019-08-16 Thread Jeff Squyres (jsquyres) via hwloc-users
Don't forget that network topologies can also be complex -- it's not always a 
simple, single-path hierarchy.  There can be multiple paths between any pair of 
hosts on the network.  Sometimes the hosts are aware of the multiple paths, 
sometimes they are not (e.g., sometimes the fabric routing changes during the 
course of a single MPI job, and the hosts/MPI applications are unaware).

Meaning: the information about which network paths are taken for a given 
host-A-to-host-B traversal may be both distributed and transient.


On Aug 14, 2019, at 11:05 AM, Rigel Falcao do Couto Alves 
mailto:rigel.al...@tu-dresden.de>> wrote:

Hi,

I am doing a PhD in performance analysis of highly parallel CFD codes and would 
like to suggest a feature for Netloc: from topic Build Scotch sub-architectures 
(at https://www.open-mpi.org/projects/hwloc/doc/v2.0.3/a00329.php), create a 
function-version of netloc_get_resources, which could retrieve at runtime the 
network details of the available cluster resources (i.e. the nodes allocated to 
the job). I am mostly interested about how many switches (the gray circles in 
the figure below) need to be traversed in order for any pair of allocated nodes 
to communicate with each other:



For example, suppose my job is running within 4 nodes in the cluster, 
illustrated by the numbers above. All I would love to get from Netloc - at 
runtime - is some sort of classification of the nodes, like:

1: aa
2: ab
3: ba
4: ca

The difference between nodes 1 and 2 is on the last digit, which means their 
MPI communications only need to traverse 1 switch; however, between any of them 
and nodes 3 or 4, the difference starts on the second-last digit, which means 
their communications need to traverse two switches. More digits may be 
left-added to the string, per necessity; i.e. if the central gray circle on the 
above figure is connected to another switch, which in turnleads to another part 
of the cluster's structure (with its own switches, nodes etc.). For me, it is 
at the present moment irrelevant whether e.g. nodes 1 and 2 are physically - or 
logically - consecutive to each other: a, b, c etc. would be just arbitrary 
identifiers.

I would then use this data to plot the process placement, using open-source 
tools developed here in the University of Dresden (Germany); i.e. Scotch is not 
an option for me. The results of my study will be open-source as well and I can 
gladly share them with you once the thesis is finished.

I hope I have clearly explained what I have in mind; please let me know if 
there are any questions. Finally, it is important that this feature is part of 
Netloc's API (as it is supposed to be integrated with the tools we develop 
here), works at runtime and doesn't require root privileges (as those tools are 
used by our cluster's costumers on their every-day job submissions).

Kind regards,


--
Dipl.-Ing. Rigel Alves
researcher

Technische Universität Dresden
Center for Information Services and High Performance Computing (ZIH)
Zellescher Weg 12 A 218, 01069 Dresden | Germany

�� +49 (351) 463.42418
�� https://tu-dresden.de/zih/die-einrichtung/struktur/rigel-alves


___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users


--
Jeff Squyres
jsquy...@cisco.com



___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users