Le 04/09/2011 23:30, Brice Goglin a écrit : > Le 04/09/2011 22:35, Ake Sandgren a écrit : >> On Sun, 2011-09-04 at 22:13 +0200, Brice Goglin wrote: >>> Hello, >>> >>> Could you log again on this node (with same cgroups enabled), run >>> hwloc-gather-topology <name> >>> and send the resulting <name>.output and <name>.tar.bz2? >>> >>> Send them to the hwloc-devel or open a ticket on >>> https://svn.open-mpi.org/trac/hwloc (or send them to me in private if >>> you don't want to subscribe). >> Since it's a bit late here i'm lazy and sending to you directly. >> >> Output from both nodes involved in the batchjob >> slurm -N 2 --ntasks-per-node=1 ... was what i was using. >> >> Hope it helps. If not let me know if there is anything else i can do. >> >> /Åke S. > Thanks, I understand the problem but it's not easy to fix. To workaround > the crash until I come with a real fix, you can comment-out > hwloc_topology__set_distance_matrix() > at the end of look_sysfsnode() in topology-linux.c
Dear Ake, Could you try the attached patch? It's not optimized, but it's probably going in the right direction. (and don't forget to remove the above comment-out if you tried it). Thanks Brice
Index: src/topology.c =================================================================== --- src/topology.c (révision 3750) +++ src/topology.c (copie de travail) @@ -1856,6 +1856,8 @@ /* * Now that objects are numbered, take distance matrices from backends and put them in the main topology */ + hwloc_restrict_distances(topology, HWLOC_RESTRICT_FLAG_ADAPT_DISTANCES); + hwloc_convert_distances_indexes_into_objects(topology); hwloc_finalize_logical_distances(topology); # ifdef HWLOC_HAVE_XML