Hello
When you assemble multiple nodes' topologies into a single one, the
resulting topology cannot be used for binding. Binding is only possible
when using objects/cpusets that correspond to the current node. The
assembled topology contains many objects that can't be used for binding:
objects that contain multiple nodes, and objects that don't come from
the node where the current process is running.
Open-MPI does not support these cases, hence the crash. I see that
individual XMLs worked fine. So why did you try this?
By the way, the ability to assemble multiple topologies will be removed
in 2.0.
Brice
Le 30/10/2015 02:13, Andrej Prsa a écrit :
> Hi all,
>
> I have a 6-node cluster with the buggy L3 H8QG6 AMD boards. Brice
> Goglin recently provided a fix to Fabian Wein and I applied the same
> fix (by diffing Fabian's original and Brice's fixed XML and then
> incorporating the equivalent changes to our XML). It did the trick
> perfectly, using openmpi-1.10.0 and hwloc 1.11.1. I then proceeded to
> produce a patched XML for each node in our cluster.
>
> The problem arises when I try to combine the XMLs. To test the assembly
> of just two XMLs, I ran:
>
> hwloc-assembler combo.xml \
> --name clusty clusty_fixed.xml \
> --name node1 node1_fixed.xml
>
> I then set HWLOC_XMLFILE to combo.xml and, when trying to mpirun a test
> program on either of the two nodes, I get a segfault:
>
> andrej@clusty:~/MPI$ mpirun -np 44 python testmpi.py
> [clusty:19136] *** Process received signal ***
> [clusty:19136] Signal: Segmentation fault (11)
> [clusty:19136] Signal code: Address not mapped (1)
> [clusty:19136] Failing at address: (nil)
> [clusty:19136]
> [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7fdf37f38340]
> [clusty:19136]
> [ 1] /usr/local/hwloc/lib/libhwloc.so.5(hwloc_bitmap_and+0x17)[0x7fdf37934e77]
> [clusty:19136]
> [ 2]
> /opt/openmpi-1.10.0/lib/libopen-pal.so.13(opal_hwloc_base_filter_cpus+0x37c)[0x7fdf381b239c]
> [clusty:19136]
> [ 3]
> /opt/openmpi-1.10.0/lib/libopen-pal.so.13(opal_hwloc_base_get_topology+0xcb)[0x7fdf381b412b]
> [clusty:19136]
> [ 4] /opt/openmpi-1.10.0/lib/openmpi/mca_ess_hnp.so(+0x47ea)[0x7fdf35c1c7ea]
> [clusty:19136]
> [ 5]
> /opt/openmpi-1.10.0/lib/libopen-rte.so.12(orte_init+0x168)[0x7fdf384062b8]
> [clusty:19136] [ 6] mpirun[0x404497] [clusty:19136] [ 7]
> mpirun[0x40363d] [clusty:19136]
> [ 8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fdf37b81ec5]
> [clusty:19136] [ 9] mpirun[0x403559] [clusty:19136] *** End of error
> message *** Segmentation fault (core dumped)
>
> Each individual XML file works (i.e. no hwloc complaints and mpirun
> works perfectly), but the assembled one does not. I'm attaching all
> three XMLs: clusty.xml, node1.xml and combo.xml. Any ideas?
>
> Thanks,
> Andrej
>
>
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> Link to this post:
> http://www.open-mpi.org/community/lists/hwloc-users/2015/10/1214.php