Hi!

I'm getting a segfault in hwloc_setup_distances_from_os_matrix in the
call to hwloc_bitmap_or due to objs or objs[i]->cpuset being freed and
containing garbage, objs[i]->cpuset has infinite < 0.

I only get this when using slurm with cgroups, asking for 2 nodes with 1
cpu each. The cpuset is then already set when mpiexec starts and
something breaks down.

valgrind on mpiexec says:
==27540== Invalid read of size 8
==27540==    at 0x7178F79:
opal_paffinity_hwloc_finalize_logical_distances (distances.c:412)
==27540==    by 0x7172C1E: hwloc_discover (topology.c:1805)
==27540==    by 0x71745F2: opal_paffinity_hwloc_topology_load
(topology.c:2244)
==27540==    by 0x7164FB4: hwloc_open (paffinity_hwloc_component.c:93)
==27540==    by 0x4F98D2E: mca_base_components_open
(mca_base_components_open.c:214)
==27540==    by 0x500084B: opal_paffinity_base_open
(paffinity_base_open.c:120)
==27540==    by 0x4F525BB: opal_init (opal_init.c:307)
==27540==    by 0x4E50CA8: orte_init (orte_init.c:78)
==27540==    by 0x403C8F: orterun (orterun.c:615)
==27540==    by 0x4032C3: main (main.c:13)
==27540==  Address 0x6e38380 is 160 bytes inside a block of size 248
free'd
==27540==    at 0x4C270BD: free (vg_replace_malloc.c:366)
==27540==    by 0x716B6A1: unlink_and_free_object_and_children
(topology.c:1131)
==27540==    by 0x716BB35: remove_empty (topology.c:1150)
==27540==    by 0x7170CBB: hwloc_discover (topology.c:1768)
==27540==    by 0x71745F2: opal_paffinity_hwloc_topology_load
(topology.c:2244)
==27540==    by 0x7164FB4: hwloc_open (paffinity_hwloc_component.c:93)
==27540==    by 0x4F98D2E: mca_base_components_open
(mca_base_components_open.c:214)
==27540==    by 0x500084B: opal_paffinity_base_open
(paffinity_base_open.c:120)
==27540==    by 0x4F525BB: opal_init (opal_init.c:307)
==27540==    by 0x4E50CA8: orte_init (orte_init.c:78)
==27540==    by 0x403C8F: orterun (orterun.c:615)
==27540==    by 0x4032C3: main (main.c:13)

I hope the above info is enough and that you can fix it :-)

/Åke S.

Reply via email to