David,

This is because of hcoll symbols conflict with ml coll module inside OMPI.
HCOLL is derived from ml module. This issue is fixed in hcoll library and
will be available in next HPCX release.

Some earlier discussion on this issue:
http://www.open-mpi.org/community/lists/users/2015/06/27154.php
http://www.open-mpi.org/community/lists/devel/2015/06/17562.php

-Devendar

On Wed, Aug 12, 2015 at 2:52 PM, David Shrader <dshra...@lanl.gov> wrote:

> Interesting... the seg faults went away:
>
> [dshrader@zo-fe1 tests]$ export LD_PRELOAD=/usr/lib64/libhcoll.so
> [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
> [1439416182.732720] [zo-fe1:14690:0]         shm.c:65   MXM  WARN  Could
> not open the KNEM device file at /dev/knem : No such file or direc
> tory. Won't use knem.
> [1439416182.733640] [zo-fe1:14689:0]         shm.c:65   MXM  WARN  Could
> not open the KNEM device file at /dev/knem : No such file or direc
> tory. Won't use knem.
> 0: Running on host zo-fe1.lanl.gov
> 0: We have 2 processors
> 0: Hello 1! Processor 1 on host zo-fe1.lanl.gov reporting for duty
>
> This implies to me that some other library is being used instead of
> /usr/lib64/libhcoll.so, but I am not sure how that could be...
>
> Thanks,
> David
>
> On 08/12/2015 03:30 PM, Deva wrote:
>
> Hi David,
>
> I tried same tarball on OFED-1.5.4.1 and I could not reproduce the issue.
> Can you do one more quick test with seeing LD_PRELOAD to hcoll lib?
>
> $LD_PRELOAD=<path/to/hcoll/lib/libhcoll.so>  mpirun -n 2  -mca coll ^ml
> ./a.out
>
> -Devendar
>
> On Wed, Aug 12, 2015 at 12:52 PM, David Shrader <dshra...@lanl.gov> wrote:
>
>> The admin that rolled the hcoll rpm that we're using (and got it in
>> system space) said that she got it from
>> hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.
>>
>> Thanks,
>> David
>>
>>
>> On 08/12/2015 10:51 AM, Deva wrote:
>>
>> From where did you grab this HCOLL lib?  MOFED or HPCX? what version?
>>
>> On Wed, Aug 12, 2015 at 9:47 AM, David Shrader < <dshra...@lanl.gov>
>> dshra...@lanl.gov> wrote:
>>
>>> Hey Devendar,
>>>
>>> It looks like I still get the error:
>>>
>>> [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
>>> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
>>> [1439397957.351764] [zo-fe1:14678:0]         shm.c:65   MXM  WARN  Could
>>> not open the KNEM device file at /dev/knem : No such file or direc
>>> tory. Won't use knem.
>>> [1439397957.352704] [zo-fe1:14677:0]         shm.c:65   MXM  WARN  Could
>>> not open the KNEM device file at /dev/knem : No such file or direc
>>> tory. Won't use knem.
>>> [zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
>>> [zo-fe1:14678:0] Caught signal 11 (Segmentation fault)
>>> ==== backtrace ====
>>> 2 0x0000000000056cdc mxm_handle_error()
>>>  
>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
>>> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
>>>
>>> 3 0x0000000000056e4c mxm_error_signal_handler()
>>>  
>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
>>> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
>>>
>>> 4 0x00000000000326a0 killpg()  ??:0
>>> 5 0x00000000000b82cb base_bcol_basesmuma_setup_library_buffers()  ??:0
>>> 6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
>>> 7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery()
>>>  coll_ml_module.c:0
>>> 8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
>>> 9 0x000000000006ace9 hcoll_create_context()  ??:0
>>> 10 0x00000000000f9706 mca_coll_hcoll_comm_query()  ??:0
>>> 11 0x00000000000f684e mca_coll_base_comm_select()  ??:0
>>> 12 0x0000000000073fc4 ompi_mpi_init()  ??:0
>>> 13 0x0000000000092ea0 PMPI_Init()  ??:0
>>> 14 0x00000000004009b6 main()  ??:0
>>> 15 0x000000000001ed5d __libc_start_main()  ??:0
>>> 16 0x00000000004008c9 _start()  ??:0
>>> ===================
>>> ==== backtrace ====
>>> 2 0x0000000000056cdc mxm_handle_error()
>>>  
>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
>>> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
>>>
>>> 3 0x0000000000056e4c mxm_error_signal_handler()
>>>  
>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
>>> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
>>>
>>> 4 0x00000000000326a0 killpg()  ??:0
>>> 5 0x00000000000b82cb base_bcol_basesmuma_setup_library_buffers()  ??:0
>>> 6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
>>> 7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery()
>>>  coll_ml_module.c:0
>>> 8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
>>> 9 0x000000000006ace9 hcoll_create_context()  ??:0
>>> 10 0x00000000000f9706 mca_coll_hcoll_comm_query()  ??:0
>>> 11 0x00000000000f684e mca_coll_base_comm_select()  ??:0
>>> 12 0x0000000000073fc4 ompi_mpi_init()  ??:0
>>> 13 0x0000000000092ea0 PMPI_Init()  ??:0
>>> 14 0x00000000004009b6 main()  ??:0
>>> 15 0x000000000001ed5d __libc_start_main()  ??:0
>>> 16 0x00000000004008c9 _start()  ??:0
>>> ===================
>>> --------------------------------------------------------------------------
>>>
>>> mpirun noticed that process rank 1 with PID 14678 on node zo-fe1 exited
>>> on signal 11 (Segmentation fault).
>>>
>>> --------------------------------------------------------------------------
>>>
>>> Thanks,
>>> David
>>>
>>> On 08/12/2015 10:42 AM, Deva wrote:
>>>
>>> Hi David,
>>>
>>> This issue is from hcoll library. This could be because of symbol
>>> conflict with ml module.  This is fixed recently in HCOLL.  Can you try
>>> with "-mca coll ^ml" and see if this workaround works in your setup?
>>>
>>> -Devendar
>>>
>>> On Wed, Aug 12, 2015 at 9:30 AM, David Shrader < <dshra...@lanl.gov>
>>> dshra...@lanl.gov> wrote:
>>>
>>>> Hello Gilles,
>>>>
>>>> Thank you very much for the patch! It is much more complete than mine.
>>>> Using that patch and re-running autogen.pl, I am able to build 1.8.8
>>>> with './configure --with-hcoll' without errors.
>>>>
>>>> I do have issues when it comes to running 1.8.8 with hcoll built in,
>>>> however. In my quick sanity test of running a basic parallel hello world C
>>>> program, I get the following:
>>>>
>>>> [dshrader@zo-fe1 tests]$ mpirun -n 2 ./a.out
>>>> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
>>>> [1439390789.039197] [zo-fe1:31354:0]         shm.c:65   MXM  WARN
>>>>  Could not open the KNEM device file at /dev/knem : No such file or direc
>>>> tory. Won't use knem.
>>>> [1439390789.040265] [zo-fe1:31353:0]         shm.c:65   MXM  WARN
>>>>  Could not open the KNEM device file at /dev/knem : No such file or direc
>>>> tory. Won't use knem.
>>>> [zo-fe1:31353:0] Caught signal 11 (Segmentation fault)
>>>> [zo-fe1:31354:0] Caught signal 11 (Segmentation fault)
>>>> ==== backtrace ====
>>>> 2 0x0000000000056cdc mxm_handle_error()
>>>>  
>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
>>>> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
>>>>
>>>> 3 0x0000000000056e4c mxm_error_signal_handler()
>>>>  
>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
>>>> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
>>>>
>>>> 4 0x00000000000326a0 killpg()  ??:0
>>>> 5 0x00000000000b91eb base_bcol_basesmuma_setup_library_buffers()  ??:0
>>>> 6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
>>>> 7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery()
>>>>  coll_ml_module.c:0
>>>> 8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
>>>> 9 0x000000000006ace9 hcoll_create_context()  ??:0
>>>> 10 0x00000000000fa626 mca_coll_hcoll_comm_query()  ??:0
>>>> 11 0x00000000000f776e mca_coll_base_comm_select()  ??:0
>>>> 12 0x0000000000074ee4 ompi_mpi_init()  ??:0
>>>> 13 0x0000000000093dc0 PMPI_Init()  ??:0
>>>> 14 0x00000000004009b6 main()  ??:0
>>>> 15 0x000000000001ed5d __libc_start_main()  ??:0
>>>> 16 0x00000000004008c9 _start()  ??:0
>>>> ===================
>>>> ==== backtrace ====
>>>> 2 0x0000000000056cdc mxm_handle_error()
>>>>  
>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
>>>> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
>>>>
>>>> 3 0x0000000000056e4c mxm_error_signal_handler()
>>>>  
>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
>>>> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
>>>>
>>>> 4 0x00000000000326a0 killpg()  ??:0
>>>> 5 0x00000000000b91eb base_bcol_basesmuma_setup_library_buffers()  ??:0
>>>> 6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
>>>> 7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery()
>>>>  coll_ml_module.c:0
>>>> 8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
>>>> 9 0x000000000006ace9 hcoll_create_context()  ??:0
>>>> 10 0x00000000000fa626 mca_coll_hcoll_comm_query()  ??:0
>>>> 11 0x00000000000f776e mca_coll_base_comm_select()  ??:0
>>>> 12 0x0000000000074ee4 ompi_mpi_init()  ??:0
>>>> 13 0x0000000000093dc0 PMPI_Init()  ??:0
>>>> 14 0x00000000004009b6 main()  ??:0
>>>> 15 0x000000000001ed5d __libc_start_main()  ??:0
>>>> 16 0x00000000004008c9 _start()  ??:0
>>>> ===================
>>>> --------------------------------------------------------------------------
>>>>
>>>> mpirun noticed that process rank 0 with PID 31353 on node zo-fe1 exited
>>>> on signal 11 (Segmentation fault).
>>>>
>>>> --------------------------------------------------------------------------
>>>>
>>>> I do not get this message with only 1 process.
>>>>
>>>> I am using hcoll 3.2.748. Could this be an issue with hcoll itself or
>>>> something with my ompi build?
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>> On 08/12/2015 12:26 AM, Gilles Gouaillardet wrote:
>>>>
>>>> Thanks David,
>>>>
>>>> i made a PR for the v1.8 branch at
>>>> <https://github.com/open-mpi/ompi-release/pull/492>
>>>> https://github.com/open-mpi/ompi-release/pull/492
>>>>
>>>> the patch is attached (it required some back-porting)
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>> On 8/12/2015 4:01 AM, David Shrader wrote:
>>>>
>>>> I have cloned Gilles' topic/hcoll_config branch and, after running
>>>> autogen.pl, have found that './configure --with-hcoll' does indeed
>>>> work now. I used Gilles' branch as I wasn't sure how best to get the pull
>>>> request changes in to my own clone of master. It looks like the proper
>>>> checks are happening, too:
>>>>
>>>> --- MCA component coll:hcoll (m4 configuration macro)
>>>> checking for MCA component coll:hcoll compile mode... dso
>>>> checking --with-hcoll value... simple ok (unspecified)
>>>> checking hcoll/api/hcoll_api.h usability... yes
>>>> checking hcoll/api/hcoll_api.h presence... yes
>>>> checking for hcoll/api/hcoll_api.h... yes
>>>> looking for library without search path
>>>> checking for library containing hcoll_get_version... -lhcoll
>>>> checking if MCA component coll:hcoll can compile... yes
>>>>
>>>> I haven't checked whether or not Open MPI builds successfully as I
>>>> don't have much experience running off of the latest source. For now, I
>>>> think I will try to generate a patch to the 1.8.8 configure script and see
>>>> if that works as expected.
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>> On 08/11/2015 06:34 AM, Jeff Squyres (jsquyres) wrote:
>>>>
>>>> On Aug 11, 2015, at 1:39 AM, Åke Sandgren <ake.sandg...@hpc2n.umu.se> 
>>>> <ake.sandg...@hpc2n.umu.se> wrote:
>>>>
>>>> Please fix the hcoll test (and code) to be correct.
>>>>
>>>> Any configure test that adds /usr/lib and/or /usr/include to any compile 
>>>> flags is broken.
>>>>
>>>> +1
>>>>
>>>> Gilles filed https://github.com/open-mpi/ompi/pull/796; I just added some 
>>>> comments to it.
>>>>
>>>>
>>>>
>>>> --
>>>> David Shrader
>>>> HPC-3 High Performance Computer Systems
>>>> Los Alamos National Lab
>>>> Email: dshrader <at> lanl.gov
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing listus...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2015/08/27432.php
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing listus...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2015/08/27434.php
>>>>
>>>>
>>>> --
>>>> David Shrader
>>>> HPC-3 High Performance Computer Systems
>>>> Los Alamos National Lab
>>>> Email: dshrader <at> lanl.gov
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> <http://www.open-mpi.org/community/lists/users/2015/08/27438.php>
>>>> http://www.open-mpi.org/community/lists/users/2015/08/27438.php
>>>>
>>>
>>>
>>>
>>> --
>>>
>>>
>>> -Devendar
>>>
>>>
>>> _______________________________________________
>>> users mailing listus...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/08/27439.php
>>>
>>>
>>> --
>>> David Shrader
>>> HPC-3 High Performance Computer Systems
>>> Los Alamos National Lab
>>> Email: dshrader <at> lanl.gov
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2015/08/27440.php
>>>
>>
>>
>>
>> --
>>
>>
>> -Devendar
>>
>>
>> _______________________________________________
>> users mailing listus...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/08/27441.php
>>
>>
>> --
>> David Shrader
>> HPC-3 High Performance Computer Systems
>> Los Alamos National Lab
>> Email: dshrader <at> lanl.gov
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/08/27445.php
>>
>
>
>
> --
>
>
> -Devendar
>
>
> --
> David Shrader
> HPC-3 High Performance Computer Systems
> Los Alamos National Lab
> Email: dshrader <at> lanl.gov
>
>


-- 


-Devendar

Reply via email to