[Pw_forum] problem with neb calculations / openmpi

Axel Kohlmeyer Wed, 14 Mar 2012 16:36:15 -0400

On Wed, Mar 14, 2012 at 4:18 PM, Torstein Fjermestad
<torstein.fjermestad at kjemi.uio.no> wrote:
> ?Dear all,
>
> ?I recently installed quantum espresso v4.3.2 in my home directory at an
> ?external supercomputer cluster.
> ?The way I did this was to execute the following commands:
>
> ?./configure
> ?make all
>
> ?after first having loaded the the mpi environment, the fortran and C
> ?compiler with the following commands:
> ?module load openmpi
> ?module load g95/093
> ?module load gcc
>
> ?./configure was successful and make seemed to finish normally (at least
> ?I did not get any error message).
>
> ?So far I have only been using the pw.x and neb.x executables.
> ?In a file named "slurm-jobID.out" that is generated by the queuing
> ?system, I get the following message when running both pw.x and neb.x:
>
> ?mca: base: component_find: unable to open
> ?/site/VERSIONS/openmpi-1.3.3.gnu/lib/openmpi/mca_mtl_psm: perhaps a
> ?missing symbol, or compiled for a different version of Open MPI?
> ?(ignored)
>
> ?This message seems rather clear, but I am not sure how relevant it is
> ?because pw.x runs without problem on 64 processors (I have compared the
> ?output with that generated on another machine). neb.x on the other hand
> ?works when running on a single processor, but fails when running in
> ?parallel (yes, I have used the -inp option).
>
> ?The output of the neb calculation is only 13 lines and the last three
> ?lines are
>
> ? ? ?Parallel version (MPI), running on ? ?16 processors
> ? ? ?path-images division: ?nimage ? ?= ? 10
> ? ? ?R & G space division: ?proc/pool = ? 16
>
>
> ?In the output files out.n_0 where n={1,9} the error message
>
> ? ? ?Message from routine ?read_line :
> ? ? ?read error
>
> ?is repeated several thousand times.
>
>
> ?I have a feeling that there is something I have got wrong with the
> ?parallel environment. If I (accidentally) compiled QE for a different
> ?openmpi version than 1.3.3.gnu, It would be interesting to know which
> ?one. Does anyone have an idea on how I can check this?
>
> ?In case the cause of the problem is a different one, it would be nice
> ?if someone had any suggestions on how to solve it.


this sounds a lot like one of the nodes that you are using
has a network problem and you are trying to read from
an NFS exported directory, but only i/o errors. the OpenMPI
based error supports this. at least, i have only seen this
kind of error when one of the nodes in a parallel job had
to be rebooted hard because of a an obscure and rarely
triggered bug in the ethernet driver.

you should see, if this happens always or only if there
is one specific node that is assigned to your job.
i would also talk to the sysadmin of the machine.

HTH,
    axel.


>
> ?Thank you very much in advance.
>
> ?Yours sincerely,
>
> ?Torstein Fjermestad
> ?University of Oslo,
> ?Norway
>
>
>
>
>
>
>
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://www.democritos.it/mailman/listinfo/pw_forum



-- 
Dr. Axel Kohlmeyer
akohlmey at gmail.com ?http://goo.gl/1wk0

College of Science and Technology
Temple University, Philadelphia PA, USA.

[Pw_forum] problem with neb calculations / openmpi

Reply via email to