hi all, I am having troubles getting MPICH2 working with Quantum Espresso, it seems to be working somewhat but with errors.
There may be an issue with my environment setup, some is guess-work as I couldn't find much documentation for espresso and mpich2. server4 is the system the user will submit from. server3 and server5 are extra nodes on the same subnet. mpich2 was built with gfortran support and ssh keys are setup with no message-Of-Day so nodes can talk properly. The espresso directory /usr/local/espresso-4.2.1 is located on server4 and NFS shared out to server3 and server5 with read/write mode (when I initially had NFS set to read-only, espresso was unable to write a temp file to the tests directory so I made it a read-write NFS export). This means all 3 systems have read-write access to the same espresso directory, is this correct? extra info: inside the examples/environment_variables file I have the line "TMP_DIR=/home/mpiexec_espresso_tmp" however no files were written there (each system has its own instance of that dir) it seems to just use /usr/local/espresso-4.2.1/tmp ,as I can see files from today's date. This is fine with me if it prefers to use the NFS partition. I tried messing with PARA_PREFIX and PARA_POSTFIX inside the variables file but I only ran into worse issues. here is my test command... mpiexec -f ~/mpiMachinefile.txt -n 10 -wdir /usr/local/espresso-4.2.1/tests ./check-pw.x.j here is the contents of my ~/mpiMachinefile.txt file... server3:24 server4:24 server5:24 I have also tried with -n 30 and it seemed to put most of the process on the first server in the list as expected, however I never saw 10 or more process using 100% of the core in 'top'. When there were many processes at once they were only using small percentage of resources, with one of the process using 100% of core/cpu. From my understanding there should be 24 different process per machine at 100%, depending on what it's doing. In summary: the errors are troubling, and I don't think the system(s) are using their full potential for simulating. below is the first part of command-line output, from the mpiexec using the -n 10 option. any advice appreciated, thanks, Nick - Linux Administrator $ mpiexec -f ~/mpiMachinefile.txt -n 10 -wdir /usr/local/espresso-4.2.1/tests ./check-pw.x.j Checking atom...Checking atom...Checking atom...Checking atom...Checking atom...Checking atom...Checking atom...Checking atom...Checking atom...Checking atom...passed Checking atom-lsda...passed Checking atom-pbe...discrepancy in pressure detected Reference: -14.44, You got: -14.43 Checking atom-sigmapbe...passed Checking atom-lsda...passed Checking atom-lsda...passed Checking atom-lsda...passed Checking atom-lsda...passed Checking atom-lsda...passed Checking atom-lsda...passed Checking atom-lsda...passed passed awk: cmd. line:6: fatal: cannot open file `atom.tmp' for reading (No such file or directory) /bin/rm: cannot remove `atom.tmp': No such file or directory Checking atom-lsda...Checking atom-lsda...STOP 2 FAILED with error condition! Input: atom-lsda.in, Output: atom-lsda.out, Reference: atom-lsda.ref Aborting STOP 2 STOP 2 FAILED with error condition! Input: atom-lsda.in, Output: atom-lsda.out, Reference: atom-lsda.ref Aborting FAILED with error condition! Input: atom-lsda.in, Output: atom-lsda.out, Reference: atom-lsda.ref Aborting STOP 2 FAILED with error condition! Input: atom-lsda.in, Output: atom-lsda.out, Reference: atom-lsda.ref Aborting STOP 2 FAILED with error condition! Input: atom-lsda.in, Output: atom-lsda.out, Reference: atom-lsda.ref Aborting STOP 2 FAILED with error condition! Input: atom-lsda.in, Output: atom-lsda.out, Reference: atom-lsda.ref Aborting discrepancy in number of scf iterations detected Reference: 7, You got: 11 Checking atom-pbe...discrepancy in number of scf iterations detected Reference: 7, You got: 11 Checking atom-pbe...discrepancy in pressure detected Reference: -14.44, You got: -14.43 Checking atom-sigmapbe...discrepancy in pressure detected Reference: -14.44, You got: -14.43 Checking atom-sigmapbe...STOP 2 FAILED with error condition! Input: atom-sigmapbe.in, Output: atom-sigmapbe.out, Reference: atom-sigmapbe.ref Aborting discrepancy in number of scf iterations detected Reference: 7, You got: 25 Checking atom-pbe...discrepancy in pressure detected Reference: -14.44, You got: -14.43 Checking atom-sigmapbe...discrepancy in total energy detected Reference: -31.491047, You got: 0.000000 discrepancy in number of scf iterations detected Reference: 16, You got: discrepancy in pressure detected Reference: -15.02, You got: Checking berry...passed Checking berry, step 2 ...discrepancy in number of scf iterations detected Reference: 16, You got: 34 discrepancy in pressure detected Reference: -15.02, You got: -15.11 Checking berry...STOP 2 FAILED with error condition! Input: berry.in2, Output: berry.out2, Reference: berry.ref2 Aborting passed Checking berry, step 2 ...STOP 2 FAILED with error condition! Input: berry.in2, Output: berry.out2, Reference: berry.ref2 Aborting discrepancy in number of scf iterations detected Reference: 16, You got: 32 discrepancy in pressure detected Reference: -15.02, You got: -14.98 Checking berry...passed Checking berry, step 2 ... <output chopped/incomplete> end. ____________________________________________________________ Publish your photos in seconds for FREE TRY IM TOOLPACK at http://www.imtoolpack.com/default.aspx?rc=if4
