Most common problem is use of the wrong version of blacs -- which the Intel
link advisor will provide information about.

I have very, very rarely seen anything beyond a wrong version of blacs.

On Sun, Jan 10, 2016 at 6:27 PM, Gavin Abo <gs...@crimson.ua.edu> wrote:

> From the backtrace, it does look like it crashed in libmpi.so.1, which I
> believe is an Open MPI library.  I don't know if it will solve the problem
> or not, but I would try a different Open MPI version or recompile Open MPI
> (while tweaking the configuration options [
> https://software.intel.com/en-us/articles/performance-tools-for-software-developers-building-open-mpi-with-the-intel-compilers
> ]).
>
> composer_xe_2015.3.187 => ifort version 15.0.3 [
> https://software.intel.com/en-us/articles/intel-compiler-and-composer-update-version-numbers-to-compiler-version-number-mapping
> ]
>
> In the post at the following link on the Intel forum it looks like
> openmpi-1.10.1rc2 (or newer) was recommended for ifort 15.0 (or newer) to
> resolve a Fortran run-time library (RTL) issue:
>
>
> https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/564266
>
> On 1/10/2016 3:42 PM, Hu, Wenhao wrote:
>
>
> (I accidentally replied with a wrong title. To ensure consistency, I send
> this post again. Maybe the mail list manager can delete the wrong post for
> me^)
>
> Hi, Peter:
>
> Thank you very much for your reply. By following your suggestion, I
> unified the version of all the library to be compiled or consistent with
> intel composer xe 2015 (MKL, fftw, openmpi etc.) and recompiled wien2k. The
> version of my openmpi is 1.6.5. However, I still get the same problem.
> Except for the message I posted earlier, I also have the following
> backtrace information of the process:
>
> lapw1c_mpi:14596 terminated with signal 11 at PC=2ab4dac4df79
> SP=7fff78b8e310.  Backtrace:
>
> lapw1c_mpi:14597 terminated with signal 11 at PC=2b847d2a1f79
> SP=7fff8ef89690.  Backtrace:
>
> /opt/openmpi-intel-composer_xe_2015.3.187/1.6.5/lib/libmpi.so.1(MPI_Comm_size+0x59)[0x2ab4dac4df79]
>
> /opt/openmpi-intel-composer_xe_2015.3.187/1.6.5/lib/libmpi.so.1(MPI_Comm_size+0x59)[0x2b847d2a1f79]
> /Users/wenhhu/wien2k14/lapw1c_mpi(blacs_pinfo_+0x92)[0x49cf02]
> /Users/wenhhu/wien2k14/lapw1c_mpi(blacs_pinfo_+0x92)[0x49cf02]
>
> /opt/intel/composer_xe_2015.3.187/mkl/lib/intel64/libmkl_scalapack_lp64.so(sl_init_+0x21)[0x2b8478d2e171]
>
> /opt/intel/composer_xe_2015.3.187/mkl/lib/intel64/libmkl_scalapack_lp64.so(sl_init_+0x21)[0x2ab4d66da171]
>
> /Users/wenhhu/wien2k14/lapw1c_mpi(parallel_mp_init_parallel_+0x63)[0x463cd3]
>
> /Users/wenhhu/wien2k14/lapw1c_mpi(parallel_mp_init_parallel_+0x63)[0x463cd3]
> /Users/wenhhu/wien2k14/lapw1c_mpi(gtfnam_+0x22)[0x426372]
> /Users/wenhhu/wien2k14/lapw1c_mpi(MAIN__+0x6c)[0x4493dc]
> /Users/wenhhu/wien2k14/lapw1c_mpi(main+0x2e)[0x40d19e]
> /Users/wenhhu/wien2k14/lapw1c_mpi(gtfnam_+0x22)[0x426372]
> /Users/wenhhu/wien2k14/lapw1c_mpi(MAIN__+0x6c)[0x4493dc]
> /Users/wenhhu/wien2k14/lapw1c_mpi(main+0x2e)[0x40d19e]
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x339101ed5d]
> /Users/wenhhu/wien2k14/lapw1c_mpi[0x40d0a9]
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x339101ed5d]
> /Users/wenhhu/wien2k14/lapw1c_mpi[0x40d0a9]
>
> Do you think it’s still the problem of my MKL or there’re some other
> issues I miss?
>
> Best,
> Wenhao
>
>
>
> a) Clearly, for a nanowire simulation the mpi-parallelization is best.
> Unfortunately, on some clusters mpi is not set-up properly, or users do not
> use the proper mkl-libraries for hthe particular mpi. Please use the Intel
> link-library advisor, as was mentioned in previous posts. The mkl-scalapack
> will NOT work unless you use proper version of the blacs_lp64 library.
> b) As a short term solution you should:
>
> i) Use a parallelization with OMP_NUM_THREAD=2. This speeds up the
> calculation by nearly a factor of 2 and uses 2 cores in a single lapw1
> without memory increase. ii) Reduce the number of k-points. I'm pretty sure
> you can reduce it to 2-4 for scf and structure optimization. This will save
> memory due to fewer k-parallel jobs. iii) During structure optimization you
> will end up with very small Si-H and C-H distances. So I'd reduce the H
> sphere right now to about 0.6, but keep Si and C large (for C use around
> 1.2). With such a setup, a preliminary structure optimization can be done
> with RKMAX=2.0, which should later be checked with 2.5 and 3.0 iv) Use
> iterative diagonalization ! After the first cycle, this will speed-up the
> scf by a factor of 5 !! v) And of course, reconsider the size of your
> "vacuum", i.e. the seperation of your wires. "Vacuum" is VERY expensive in
> terms of memory and one should not set it too large without test. Optimize
> your wire with small a,b; then increase the vacuum later on (x supercell)
> and check if forces appear again and distances, ban structure, ... change.
>
> Am 09.01.2016 um 22:07 schrieb Hu, Wenhao:
>
> Hi, Marks and Peter:
>
> Thank you for your suggestions. About your reply, I have several
> follow-up questions. Actually, I’m using a intermediate cluster in my
> university, which has 16 cores and 64 GB memory on standard nodes. The
> calculation I’m doing is k-point but not MPI parallelized. From the :RKM
> flag I posted in my first email, I estimate that the matrix size I need
> for a Rkmax=5+ will be at least 40000. In my current calculation, the
> lapw1 program will occupy as large as 3GB on each slot (1 k point/slot).
> So I estimate the memory for each slot will be at least 12 GB. I have 8
> k points so that 96 GB memory will be required at least (if my
> estimation is correct). Considering the current computation resources I
> have, this is way too memory demanding. On our clusters, there’s a 4 GB
> memory limit for each slot on standard node. Although I can submit
> request for high memory node, but their usages are very competitive
> among cluster users. Do you have any suggestions on accomplishing this
> calculation within the limitation of my cluster?
>
> About the details of my calculation, the material I'm looking at is a
> hydrogen terminated silicon carbide with 56 atoms. A 1x1x14 k-mesh is
> picked for k-point sampling. The radius of 1.2 is achieved from
> setrmt_lapw actually. Indeed, the radius of hydrogen is too large and
> I’m adjusting its radius during the progress of optimization all the
> time. The reason why I have such a huge matrix is mainly due to size of
> my unit cell. I’m using large unit cell to isolate the coupling between
> neighboring nanowire.
>
> Except for the above questions, I also met some problems in mpi
> calculation. By following Marks’ suggestion on parallel calculation, I
> want to test the efficiency of mpi calculation since I only used k-point
> parallelized calculation before. The MPI installed on my cluster is
> openmpi. In the output file, I get the following error:
>
>
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  LAPW0 END
>
> lapw1c_mpi:19058 terminated with signal 11 at PC=2b56d9118f79
> SP=7fffc23d6890.  Backtrace:
> ...
> mpirun has exited due to process rank 14 with PID 19061 on
> node neon-compute-2-25.local exiting improperly. There are two reasons
> this could occur:
>
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it waits
> for all processes to call "init". By rule, if one process calls "init",
> then ALL processes must call "init" prior to termination.
>
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
>
> This may have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> Uni_+6%.scf1up_1: No such file or directory.
> grep: *scf1up*: No such file or directory
>
>
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> The job script I’m using is:
>
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> !/bin/csh -f
> # -S /bin/sh
> #
> #$ -N uni_6
> #$ -q MF
> #$ -m be
> #$ -M wenhao...@uiowa.edu <
> mailto:wenhao...@ <wenhao...@>uiowa.edu
>
>
> #$ -pe smp 16
> #$ -cwd
> #$ -j y
>
> cp $PE_HOSTFILE hostfile
> echo "PE_HOSTFILE:"
> echo $PE_HOSTFILE
> rm .machines
> echo granularity:1 >>.machines
> while read hostname slot useless; do
>     i=0
>     l0=$hostname
>     while [ $i -lt $slot ]; do
>         echo 1:$hostname:2 >>.machines
>         let i=i+2
>     done
> done<hostfile
>
> echo lapw0:$l0:16 >>.machines
>
> runsp_lapw -p -min -ec 0.0001 -cc 0.001 -fc 0.5
>
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Is there any mistake I made or something missing in my script?
>
> Thank your very much for your help.
>
> Wenhao
>
>
> I do not know many compounds, for which an RMT=1.2 bohr for H makes
> any sense (maybe LiH). Use setrmt and follow the suggestion. Usually,
> H spheres of CH or OH bonds should be less than 0.6 bohr.
> Experimental H-position are often very unreliable.
> How many k-points ? Often 1 k-point is enough for 50+ atoms (at least
> at the beginning), in particular when you ahve an insulator.
> Otherwise, follow the suggestions of L.Marks about parallelization.
>
>
> Am 08.01.2016 um 07:28 schrieb Hu, Wenhao:
>
> Hi, all:
>
> I have some confusions on the Rkm in calculations with 50+ atoms. In
> my wien2k,
> the NATMAX and NUME are set to 15000 and 1700. With the highest NE
> and NAT, the
> Rkmax can only be as large as 2.05, which is much lower than the
> suggested
> value in FAQ page of WIEN2K (the smallest atom in my case is a H atom
> with
> radius of 1.2). By checking the :RKM flag in case.scf, I have the
> following
> information:
>
> :RKM  : MATRIX SIZE 11292LOs: 979  RKM= 2.05  WEIGHT= 1.00  PGR:
>
> With such a matrix size, the single cycle can take as long as two and
> half
> hours. Although I can increase the NATMAX and NUME to raise Rkmax, the
> calculation will be way slower, which will make the optimization
> calculation
> almost impossible. Before making convergence test on Rkmax, can
> anyone tell me
> whether such a Rkmax is a reasonable value?
>
> If any further information is needed, please let me know. Thanks in
> advance.
>
> Best,
> Wenhao
>
> _______________________________________________
> Wien mailing list
> Wien@zeus.theochem.tuwien.ac.at <
> mailto:Wien@zeus.theochem.tuwien.ac.at <Wien@zeus.theochem.tuwien.ac.at>
>
>
>
>
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
>
> SEARCH the MAILING-LIST at:
>
>
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
>
>
> --
> --------------------------------------------------------------------------
> Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
> Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
> Email: bl...@theochem.tuwien.ac.at <
> http://theochem.tuwien.ac.at
>
>
> WIEN2k:
>
> http://www.wien2k.at
>
>
> WWW:
>
> http://www.imc.tuwien.ac.at/staff/tc_group_e.php
>
>


-- 
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu
Corrosion in 4D: MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
"Research is to see what everybody else has seen, and to think what nobody
else has thought"
Albert Szent-Gyorgi
_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

Reply via email to