hi all,

I am having troubles getting MPICH2 working with Quantum Espresso, it seems to 
be working somewhat but with errors.

There may be an issue with my environment setup, some is guess-work as I 
couldn't find much documentation for espresso and mpich2.

server4 is the system the user will submit from. server3 and server5 are extra 
nodes on the same subnet. mpich2 was built with gfortran support and ssh keys 
are setup with no message-Of-Day so nodes can talk properly.

The espresso directory /usr/local/espresso-4.2.1 is located on server4 and NFS 
shared out to server3 and server5 with read/write mode (when I initially had 
NFS set to read-only, espresso was unable to write a temp file to the tests 
directory so I made it a read-write NFS export). This means all 3 systems have 
read-write access to the same espresso directory, is this correct?

extra info: inside the examples/environment_variables file I have the line 
"TMP_DIR=/home/mpiexec_espresso_tmp" however no files were written there (each 
system has its own instance of that dir) it seems to just use 
/usr/local/espresso-4.2.1/tmp ,as I can see files from today's date. This is 
fine with me if it prefers to use the NFS partition. I tried messing with 
PARA_PREFIX and PARA_POSTFIX inside the variables file but I only ran into 
worse issues.

here is my test command...

mpiexec -f ~/mpiMachinefile.txt -n 10 -wdir /usr/local/espresso-4.2.1/tests 
./check-pw.x.j

here is the contents of my ~/mpiMachinefile.txt file...
server3:24
server4:24
server5:24

I have also tried with -n 30 and it seemed to put most of the process on the 
first server in the list as expected, however I never saw 10 or more process 
using 100% of the core in 'top'. When there were many processes at once they 
were only using small percentage of resources, with one of the process using 
100% of core/cpu. From my understanding there should be 24 different process 
per machine at 100%, depending on what it's doing.

In summary: the errors are troubling, and I don't think the system(s) are using 
their full potential for simulating.

below is the first part of command-line output, from the mpiexec using the -n 
10 option.
any advice appreciated, thanks,
Nick - Linux Administrator

$ mpiexec -f ~/mpiMachinefile.txt -n 10 -wdir /usr/local/espresso-4.2.1/tests 
./check-pw.x.j 
Checking atom...Checking atom...Checking atom...Checking atom...Checking 
atom...Checking atom...Checking atom...Checking atom...Checking atom...Checking 
atom...passed
Checking atom-lsda...passed
Checking atom-pbe...discrepancy in pressure detected
Reference: -14.44, You got: -14.43
Checking atom-sigmapbe...passed
Checking atom-lsda...passed
Checking atom-lsda...passed
Checking atom-lsda...passed
Checking atom-lsda...passed
Checking atom-lsda...passed
Checking atom-lsda...passed
Checking atom-lsda...passed
passed
awk: cmd. line:6: fatal: cannot open file `atom.tmp' for reading (No such file 
or directory)
/bin/rm: cannot remove `atom.tmp': No such file or directory
Checking atom-lsda...Checking atom-lsda...STOP 2
FAILED with error condition!
Input: atom-lsda.in, Output: atom-lsda.out, Reference: atom-lsda.ref
Aborting
STOP 2
STOP 2
FAILED with error condition!
Input: atom-lsda.in, Output: atom-lsda.out, Reference: atom-lsda.ref
Aborting
FAILED with error condition!
Input: atom-lsda.in, Output: atom-lsda.out, Reference: atom-lsda.ref
Aborting
STOP 2
FAILED with error condition!
Input: atom-lsda.in, Output: atom-lsda.out, Reference: atom-lsda.ref
Aborting
STOP 2
FAILED with error condition!
Input: atom-lsda.in, Output: atom-lsda.out, Reference: atom-lsda.ref
Aborting
STOP 2
FAILED with error condition!
Input: atom-lsda.in, Output: atom-lsda.out, Reference: atom-lsda.ref
Aborting
discrepancy in number of scf iterations detected
Reference: 7, You got: 11
Checking atom-pbe...discrepancy in number of scf iterations detected
Reference: 7, You got: 11
Checking atom-pbe...discrepancy in pressure detected
Reference: -14.44, You got: -14.43
Checking atom-sigmapbe...discrepancy in pressure detected
Reference: -14.44, You got: -14.43
Checking atom-sigmapbe...STOP 2
FAILED with error condition!
Input: atom-sigmapbe.in, Output: atom-sigmapbe.out, Reference: atom-sigmapbe.ref
Aborting
discrepancy in number of scf iterations detected
Reference: 7, You got: 25
Checking atom-pbe...discrepancy in pressure detected
Reference: -14.44, You got: -14.43
Checking atom-sigmapbe...discrepancy in total energy detected
Reference:   -31.491047, You got:     0.000000
discrepancy in number of scf iterations detected
Reference: 16, You got: 
discrepancy in pressure detected
Reference: -15.02, You got: 
Checking berry...passed
Checking berry, step 2 ...discrepancy in number of scf iterations detected
Reference: 16, You got: 34
discrepancy in pressure detected
Reference: -15.02, You got: -15.11
Checking berry...STOP 2
FAILED with error condition!
Input: berry.in2, Output: berry.out2, Reference: berry.ref2
Aborting
passed
Checking berry, step 2 ...STOP 2
FAILED with error condition!
Input: berry.in2, Output: berry.out2, Reference: berry.ref2
Aborting
discrepancy in number of scf iterations detected
Reference: 16, You got: 32
discrepancy in pressure detected
Reference: -15.02, You got: -14.98
Checking berry...passed
Checking berry, step 2 ...

<output chopped/incomplete> end.

____________________________________________________________
Publish your photos in seconds for FREE
TRY IM TOOLPACK at http://www.imtoolpack.com/default.aspx?rc=if4

Reply via email to