Hi Cristobal

Cristobal Navarro wrote:


On Wed, Jul 28, 2010 at 11:09 AM, Gus Correa <g...@ldeo.columbia.edu <mailto:g...@ldeo.columbia.edu>> wrote:

    Hi Cristobal

    In case you are not using full path name for mpiexec/mpirun,
    what does "which mpirun" say?


--> $which mpirun
      /opt/openmpi-1.4.2


    Often times this is a source of confusion, old versions may
    be first on the PATH.

    Gus


openMPI version problem is now gone, i can confirm that the version is consistent now :), thanks.


This is good news.

however, i keep getting this kernel crash randomnly when i execute with -np higher than 5
these are Xeons, with Hyperthreading On, is that a problem??


The problem may be with Hyperthreading, maybe not.
Which Xeons?
If I remember right, the old hyperthreading on old Xeons was problematic.

OTOH, about 1-2 months ago I had trouble with OpenMPI on a relatively new Xeon Nehalem machine with (the new) Hyperthreading turned on,
and Fedora Core 13.
The machine would hang with the OpenMPI connectivity example.
I reported this to the list, you may find in the archives.
Apparently other people got everything (OpenMPI with HT on Nehalem)
working in more stable distributions (CentOS, RHEL, etc).

That problem was likely to be in the FC13 kernel,
because even turning off HT I still had the machine hanging.
Nothing worked with shared memory turned on,
so I had to switch OpenMPI to use tcp instead,
which is kind of ridiculous in a standalone machine.


im trying to locate the kernel error on logs, but after rebooting a crash, the error is not in the kern.log (neither kern.log.1).
all i remember is that it starts with "Kernel BUG..."
and somepart it mentions a certain CPU X, where that cpu can be any from 0 to 15 (im testing only in main node). Someone knows where the log of kernel error could be?


Have you tried to turn off hyperthreading?
In any case, depending on the application, it may not help much performance to have HT on.

A more radical alternative is to try
-mca btl tcp,self
in the mpirun command line.
That is what worked in the case I mentioned above.

My $0.02
Gus Correa


    Cristobal Navarro wrote:


        On Tue, Jul 27, 2010 at 7:29 PM, Gus Correa
        <g...@ldeo.columbia.edu <mailto:g...@ldeo.columbia.edu>
        <mailto:g...@ldeo.columbia.edu <mailto:g...@ldeo.columbia.edu>>>
        wrote:

           Hi Cristobal

           Does it run only on the head node alone?
           (Fuego? Agua? Acatenango?)
           Try to put only the head node on the hostfile and execute
        with mpiexec.

        --> i will try only with the head node, and post results back
           This may help sort out what is going on.
           Hopefully it will run on the head node.

           Also, do you have Infinband connecting the nodes?
           The error messages refer to the openib btl (i.e. Infiniband),
           and complains of


        no we are just using normal network 100MBit/s , since i am just
        testing yet.


           "perhaps a missing symbol, or compiled for a different
           version of Open MPI?".
           It sounds as a mixup of versions/builds.


        --> i agree, somewhere there must be the remains of the older
        version

           Did you configure/build OpenMPI from source, or did you install
           it with apt-get?
           It may be easier/less confusing to install from source.
           If you did, what configure options did you use?


        -->i installed from source, ./configure
        --prefix=/opt/openmpi-1.4.2 --with-sge --without-xgid
        --disable--static

           Also, as for the OpenMPI runtime environment,
           it is not enough to set it on
           the command line, because it will be effective only on the
        head node.
           You need to either add them to the PATH and LD_LIBRARY_PATH
           on your .bashrc/.cshrc files (assuming these files and your home
           directory are *also* shared with the nodes via NFS),
           or use the --prefix option of mpiexec to point to the OpenMPI
        main
           directory.


        yes, all nodes have their PATH and LD_LIBRARY_PATH set up
        properly inside the login scripts ( .bashrc in my case  )

           Needless to say, you need to check and ensure that the OpenMPI
           directory (and maybe your home directory, and your work
        directory)
           is (are)
           really mounted on the nodes.


        --> yes, doublechecked that they are

           I hope this helps,


        --> thanks really!

           Gus Correa

           Update: i just reinstalled openMPI, with the same parameters,
        and it
           seems that the problem has gone, i couldnt test entirely but
        when i
           get back to lab ill confirm.

        best regards! Cristobal


        ------------------------------------------------------------------------

        _______________________________________________
        users mailing list
        us...@open-mpi.org <mailto:us...@open-mpi.org>
        http://www.open-mpi.org/mailman/listinfo.cgi/users


    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    http://www.open-mpi.org/mailman/listinfo.cgi/users



------------------------------------------------------------------------

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to