Do you get any interfaces shown when you run "ibstat" on any of the nodes your job is spawned on?

--td

On 2/15/2012 1:27 AM, Tohiko Looka wrote:
Mm... This is really strange
I don't have that service and there is no ib* output in 'ifconfig -a' or 'Infinband' in 'lspci' Which makes me believe that I don't have such a network. I also checked on an identical computer on the same network with the same results.

What's strange is that these messages didn't use to show up and they don't show up on that identical computer; only on mine. Even though both computers have the same hardware, openMPI version and on the same network.

I guess I can safely ignore these warnings and run on Ethernet, but it would be nice to know what happened there, in case anybody has an idea.

Thank you,

On Wed, Feb 15, 2012 at 12:52 AM, Gustavo Correa <g...@ldeo.columbia.edu <mailto:g...@ldeo.columbia.edu>> wrote:

    Hi Tohiko

    OpenFabrics network a.k.a. Infiniband a.k.a. IB.
    To check if the compute nodes have IB interfaces, try:

    lspci [and search the output for Infinband]

    To see if the IB interface is configured try:

    ifconfig -a  [and search the output for ib0, ib1, or similar]

    To check if the OFED module is up try:

    'service openibd status'


    As an alternative, you could also try to run your program over
    Ethernet, avoiding Infinband,
    in case you don't have IB or if somehow it is broken.
    It is slower than Infiniband, though.

    Try something like this:

    mpiexec -mca btl tcp,sm,self -np 4 ./my_mpi_program

    I hope this helps,
    Gus Correa

    On Feb 14, 2012, at 4:02 PM, Tohiko Looka wrote:

    > Sorry for the noob question, but how do I check my network type
    and if OFED service is running correctly or not? And how do I run it
    >
    > Thank you,
    >
    > On Tue, Feb 14, 2012 at 2:14 PM, Jeff Squyres
    <jsquy...@cisco.com <mailto:jsquy...@cisco.com>> wrote:
    > Do you have an OpenFabrics-based network?  (e.g., InfiniBand or
    iWarp)
    >
    > If so, this error message usually means that OFED is either
    installed incorrectly, or is not running properly (e.g., its
    services didn't get started properly upon boot).
    >
    > If you don't have an OpenFabrics-based network, then it usually
    means that you have OpenFabrics services running when you really
    shouldn't (because you don't have any OpenFabrics-based devices).
    >
    >
    > On Feb 14, 2012, at 4:48 AM, Tohiko Looka wrote:
    >
    > > Greetings,
    > >
    > > Until today I was running my openmpi applications with no
    errors/warnings
    > > Today I restarted my computer (possibly after an automatic
    openmpi update) and got these warnings when
    > > running my program
    > > [tohiko@kw12614 1d]$ mpirun -x LD_LIBRARY_PATH -hostfile hosts
    -np 10 hello
    > > librdmacm: couldn't read ABI version.
    > > librdmacm: assuming: 4
    > > CMA: unable to get RDMA device list
    > >
    --------------------------------------------------------------------------
    > > [[21652,1],0]: A high-performance Open MPI point-to-point
    messaging module
    > > was unable to find any relevant network interfaces:
    > >
    > > Module: OpenFabrics (openib)
    > >   Host: kw12614
    > >
    > > Another transport will be used instead, although this may
    result in
    > > lower performance.
    > >
    --------------------------------------------------------------------------
    > > [kw12614:03195] 10 more processes have sent help message
    help-mpi-btl-base.txt / btl:no-nics
    > > [kw12614:03195] Set MCA parameter "orte_base_help_aggregate"
    to 0 to see all help / error messages
    > >
    > >
    > > Is this normal? And how come it happened now?
    > > -- Tohiko
    > > _______________________________________________
    > > users mailing list
    > > us...@open-mpi.org <mailto:us...@open-mpi.org>
    > > http://www.open-mpi.org/mailman/listinfo.cgi/users
    >
    >
    > --
    > Jeff Squyres
    > jsquy...@cisco.com <mailto:jsquy...@cisco.com>
    > For corporate legal information go to:
    > http://www.cisco.com/web/about/doing_business/legal/cri/
    >
    >
    > _______________________________________________
    > users mailing list
    > us...@open-mpi.org <mailto:us...@open-mpi.org>
    > http://www.open-mpi.org/mailman/listinfo.cgi/users
    >
    > _______________________________________________
    > users mailing list
    > us...@open-mpi.org <mailto:us...@open-mpi.org>
    > http://www.open-mpi.org/mailman/listinfo.cgi/users


    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    http://www.open-mpi.org/mailman/listinfo.cgi/users




_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>



Reply via email to