It's been a while on this, but we are still having trouble getting OpenMPI
to work with Infiniband on this cluster. We tried with latest 1.8.4 as
well, but it's still the same.

To recap, we get the following error when MPI initializes (in the simple
Hello world C example) with Infiniband. Everything works fine if we
explicitly turn off openib with --mca btl ^openib

This is the error I got after debugging with gdb as you suggested.

hello_c: connect/btl_openib_connect_udcm.c:736: udcm_module_finalize:
Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *)
(&m->cm_recv_msg_queue))->obj_magic_id' failed.

Thank you,
Saliya

On Mon, Nov 10, 2014 at 10:01 AM, Saliya Ekanayake <esal...@gmail.com>
wrote:

> Thank you Jeff, I'll try this and  let you know.
>
> Saliya
> On Nov 10, 2014 6:42 AM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com>
> wrote:
>
>> I am sorry for the delay; I've been caught up in SC deadlines.  :-(
>>
>> I don't see anything blatantly wrong in this output.
>>
>> Two things:
>>
>> 1. Can you try a nightly v1.8.4 snapshot tarball?  This will check to see
>> if whatever the bug is has been fixed for the upcoming release:
>>
>>     http://www.open-mpi.org/nightly/v1.8/
>>
>> 2. Build Open MPI with the --enable-debug option (note that this adds a
>> slight-but-noticeable performance penalty).  When you run, it should dump a
>> core file.  Load that core file in a debugger and see where it is failing
>> (i.e., file and line in the OMPI source).
>>
>> We don't usually have to resort to asking users to perform #2, but
>> there's no additional information to give a clue as to what is happening.
>> :-(
>>
>>
>>
>> On Nov 9, 2014, at 11:43 AM, Saliya Ekanayake <esal...@gmail.com> wrote:
>>
>> > Hi Jeff,
>> >
>> > You are probably busy, but just checking if you had a chance to look at
>> this.
>> >
>> > Thanks,
>> > Saliya
>> >
>> > On Thu, Nov 6, 2014 at 9:19 AM, Saliya Ekanayake <esal...@gmail.com>
>> wrote:
>> > Hi Jeff,
>> >
>> > I've attached a tar file with information.
>> >
>> > Thank you,
>> > Saliya
>> >
>> > On Tue, Nov 4, 2014 at 4:18 PM, Jeff Squyres (jsquyres) <
>> jsquy...@cisco.com> wrote:
>> > Looks like it's failing in the openib BTL setup.
>> >
>> > Can you send the info listed here?
>> >
>> >     http://www.open-mpi.org/community/help/
>> >
>> >
>> >
>> > On Nov 4, 2014, at 1:10 PM, Saliya Ekanayake <esal...@gmail.com> wrote:
>> >
>> > > Hi,
>> > >
>> > > I am using OpenMPI 1.8.1 in a Linux cluster that we recently setup.
>> It builds fine, but when I try to run even the simplest hello.c program
>> it'll cause a segfault. Any suggestions on how to correct this?
>> > >
>> > > The steps I did and error message are below.
>> > >
>> > > 1. Built OpenMPI 1.8.1 on the cluster. The ompi_info is attached.
>> > > 2. cd to examples directory and mpicc hello_c.c
>> > > 3. mpirun -np 2 ./a.out
>> > > 4. Error text is attached.
>> > >
>> > > Please let me know if you need more info.
>> > >
>> > > Thank you,
>> > > Saliya
>> > >
>> > >
>> > > --
>> > > Saliya Ekanayake esal...@gmail.com
>> > > Cell 812-391-4914 Home 812-961-6383
>> > > http://saliya.org
>> > >
>> <ompi_info.txt><error.txt>_______________________________________________
>> > > users mailing list
>> > > us...@open-mpi.org
>> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> > > Link to this post:
>> http://www.open-mpi.org/community/lists/users/2014/11/25668.php
>> >
>> >
>> > --
>> > Jeff Squyres
>> > jsquy...@cisco.com
>> > For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> >
>> > _______________________________________________
>> > users mailing list
>> > us...@open-mpi.org
>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> > Link to this post:
>> http://www.open-mpi.org/community/lists/users/2014/11/25672.php
>> >
>> >
>> >
>> > --
>> > Saliya Ekanayake esal...@gmail.com
>> > Cell 812-391-4914 Home 812-961-6383
>> > http://saliya.org
>> >
>> >
>> >
>> > --
>> > Saliya Ekanayake esal...@gmail.com
>> > Cell 812-391-4914 Home 812-961-6383
>> > http://saliya.org
>> > _______________________________________________
>> > users mailing list
>> > us...@open-mpi.org
>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> > Link to this post:
>> http://www.open-mpi.org/community/lists/users/2014/11/25717.php
>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2014/11/25723.php
>>
>


-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell 812-391-4914
http://saliya.org

Reply via email to