Dear all WIEN2k community members: We encountered some problem when running in parallel (K-point, MPI or both)--the calculations crashed at LAPW2. Note we had no problem running it in serial. We have tried to diagnose the problem, recompile the code with difference options and test with difference cases and parameters based on similar problems reported on the mail list, but the problem persists. So we write here hoping someone can offer us some suggestion. We have attached related files below for your reference. Your replies are appreciated in advance!
This is a TiC example running in both Kpoint and MPI parallel on two nodes r1i0n0 and r1i0n1 (8cores/node): 1. stdout (abridged) MPI: invalid option -machinefile real 0m0.004s user 0m0.000s sys 0m0.000s ... MPI: invalid option -machinefile real 0m0.003s user 0m0.000s sys 0m0.004s TiC.scf1up_1: No such file or directory. LAPW2 - Error. Check file lapw2.error cp: cannot stat `.in.tmp': No such file or directory rm: cannot remove `.in.tmp': No such file or directory rm: cannot remove `.in.tmp1': No such file or directory 2. TiC.dayfile (abridged) ... start (Sun Oct 31 16:25:06 MDT 2010) with lapw0 (40/99 to go) cycle 1 (Sun Oct 31 16:25:06 MDT 2010) (40/99 to go) > lapw0 -p (16:25:06) starting parallel lapw0 at Sun Oct 31 16:25:07 MDT > 2010 -------- .machine0 : 16 processors invalid "local" arg: -machinefile 0.436u 0.412s 0:04.63 18.1% 0+0k 2600+0io 1pf+0w > lapw1 -up -p (16:25:12) starting parallel lapw1 at Sun Oct 31 > 16:25:12 MDT 2010 -> starting parallel LAPW1 jobs at Sun Oct 31 16:25:12 MDT 2010 running LAPW1 in parallel mode (using .machines) 2 number_of_parallel_jobs r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1) r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1(1) r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1) Summary of lapw1para: r1i0n0 k=0 user=0 wallclock=0 r1i0n1 k=0 user=0 wallclock=0 ... 0.116u 0.316s 0:10.48 4.0% 0+0k 0+0io 0pf+0w > lapw2 -up -p (16:25:34) running LAPW2 in parallel mode ** LAPW2 crashed! 0.032u 0.104s 0:01.13 11.5% 0+0k 82304+0io 8pf+0w error: command /home/xiew/WIEN2k_10/lapw2para -up uplapw2.def failed 3. uplapw2.error Error in LAPW2 'LAPW2' - can't open unit: 18 'LAPW2' - filename: TiC.vspup 'LAPW2' - status: old form: formatted ** testerror: Error in Parallel LAPW2 4. .machines # 1:r1i0n0:8 1:r1i0n1:8 lapw0:r1i0n0:8 r1i0n1:8 granularity:1 extrafine:1 5. compilers, MPI and options Intel Compilers and MKL 11.1.046 Intel MPI 3.2.0.011 current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback current:LDFLAGS:$(FOPT) -L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t -pthread current:DPARALLEL:'-DParallel' current:R_LIBS:-lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread -lguide current:RP_LIBS:-L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t -lmkl_scalapack_lp64 /usr/local/intel/Compiler/11.1/046/mkl/lib/em64t/libmkl_solver_lp64.a -Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -Wl,--end-group -openmp -lpthread -L/home/xiew/fftw-2.1.5/lib -lfftw_mpi -lfftw $(R_LIBS) current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_ Best regards, Wei Xie Computational Materials Group University of Wisconsin-Madison -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20101031/5eec4c81/attachment.htm>