Re: [hwloc-users] Assembling multiple node XMLs

2015-10-30 Thread Brice Goglin
Hello
Can you have a startup script set
HWLOC_XMLFILE=/common/path/${hostname}.xml in the system-wide environment?
Brice



Le 30/10/2015 13:57, Andrej Prsa a écrit :
> Hi Brice,
>
>> When you assemble multiple nodes' topologies into a single one, the
>> resulting topology cannot be used for binding. Binding is only
>> possible when using objects/cpusets that correspond to the current
>> node.
> Ah, that explains it.
>
>> Open-MPI does not support these cases, hence the crash. I see that
>> individual XMLs worked fine. So why did you try this?
> I was trying to figure out a way to fix buggy BIOS topologies for all
> nodes from the server side. With our current setup, all the nodes are
> diskless, they boot up an identical image from NFS and I don't know how
> to tell each one what XML file to use. The reason I know that the XML
> file fixes the problem is by logging onto that node, exporting
> HWLOC_XMLFILE and running mpirun from there. Any suggestions on how to
> do that system-wide?
>
> Thanks,
> Andrej
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-users/2015/10/1217.php



Re: [hwloc-users] Assembling multiple node XMLs

2015-10-30 Thread Andrej Prsa
Hi Brice,

> When you assemble multiple nodes' topologies into a single one, the
> resulting topology cannot be used for binding. Binding is only
> possible when using objects/cpusets that correspond to the current
> node.

Ah, that explains it.

> Open-MPI does not support these cases, hence the crash. I see that
> individual XMLs worked fine. So why did you try this?

I was trying to figure out a way to fix buggy BIOS topologies for all
nodes from the server side. With our current setup, all the nodes are
diskless, they boot up an identical image from NFS and I don't know how
to tell each one what XML file to use. The reason I know that the XML
file fixes the problem is by logging onto that node, exporting
HWLOC_XMLFILE and running mpirun from there. Any suggestions on how to
do that system-wide?

Thanks,
Andrej


Re: [hwloc-users] Assembling multiple node XMLs

2015-10-30 Thread Brice Goglin
Hello

When you assemble multiple nodes' topologies into a single one, the
resulting topology cannot be used for binding. Binding is only possible
when using objects/cpusets that correspond to the current node. The
assembled topology contains many objects that can't be used for binding:
objects that contain multiple nodes, and objects that don't come from
the node where the current process is running.

Open-MPI does not support these cases, hence the crash. I see that
individual XMLs worked fine. So why did you try this?

By the way, the ability to assemble multiple topologies will be removed
in 2.0.

Brice



Le 30/10/2015 02:13, Andrej Prsa a écrit :
> Hi all,
>
> I have a 6-node cluster with the buggy L3 H8QG6 AMD boards. Brice
> Goglin recently provided a fix to Fabian Wein and I applied the same
> fix (by diffing Fabian's original and Brice's fixed XML and then
> incorporating the equivalent changes to our XML). It did the trick
> perfectly, using openmpi-1.10.0 and hwloc 1.11.1. I then proceeded to
> produce a patched XML for each node in our cluster.
>
> The problem arises when I try to combine the XMLs. To test the assembly
> of just two XMLs, I ran:
>
> hwloc-assembler combo.xml \
>   --name clusty clusty_fixed.xml \
>   --name node1 node1_fixed.xml
>
> I then set HWLOC_XMLFILE to combo.xml and, when trying to mpirun a test
> program on either of the two nodes, I get a segfault:
>
> andrej@clusty:~/MPI$ mpirun -np 44 python testmpi.py 
> [clusty:19136] *** Process received signal ***
> [clusty:19136] Signal: Segmentation fault (11)
> [clusty:19136] Signal code: Address not mapped (1)
> [clusty:19136] Failing at address: (nil)
> [clusty:19136]
> [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7fdf37f38340]
> [clusty:19136]
> [ 1] /usr/local/hwloc/lib/libhwloc.so.5(hwloc_bitmap_and+0x17)[0x7fdf37934e77]
> [clusty:19136]
> [ 2] 
> /opt/openmpi-1.10.0/lib/libopen-pal.so.13(opal_hwloc_base_filter_cpus+0x37c)[0x7fdf381b239c]
> [clusty:19136]
> [ 3] 
> /opt/openmpi-1.10.0/lib/libopen-pal.so.13(opal_hwloc_base_get_topology+0xcb)[0x7fdf381b412b]
> [clusty:19136]
> [ 4] /opt/openmpi-1.10.0/lib/openmpi/mca_ess_hnp.so(+0x47ea)[0x7fdf35c1c7ea]
> [clusty:19136]
> [ 5] 
> /opt/openmpi-1.10.0/lib/libopen-rte.so.12(orte_init+0x168)[0x7fdf384062b8]
> [clusty:19136] [ 6] mpirun[0x404497] [clusty:19136] [ 7]
> mpirun[0x40363d] [clusty:19136]
> [ 8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fdf37b81ec5]
> [clusty:19136] [ 9] mpirun[0x403559] [clusty:19136] *** End of error
> message *** Segmentation fault (core dumped)
>
> Each individual XML file works (i.e. no hwloc complaints and mpirun
> works perfectly), but the assembled one does not. I'm attaching all
> three XMLs: clusty.xml, node1.xml and combo.xml. Any ideas?
>
> Thanks,
> Andrej
>
>
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-users/2015/10/1214.php