On Mon, 2 Mar 2009, Marci wrote: MV> Hi Axel, MV> MV> > marton, MV> > MV> > are you trying to run the postprocessing on your local MV> > machine or on the IBM machine? MV> MV> on the IBM machine. I had bad experiences with postprocessing on a MV> different machine because of using the iotk package, converting binary MV> files to text files and back is quite time consuming... (and I hate MV> ssh-ing gygabites of files)
just checking. actually, there are ways to make fortran read IEEE-754 compliant binary floating point numbers on different endian hardware, but i never checked whether iotk can handle this as well. [...] MV> Unfortunately, the espresso I'm using on BASSI was not compiled by MV> myself, and now I'm scared of compiling mine because I'm not sure that MV> it will be able to read the binary that was made with an espresso MV> probably compiled with different compilers and/or compiler options. there is a big difference between linux and non-linux machines. on linux there is a zoo of compilers and math libraries and there are all kinds of subtle compatibility issues. on AIX or other "commercial" platforms, this is generally less of an issue, only that it is not as easy to replace one compiler by another, in case the system provided compiler is broken. MV> Yeah, I know... I should have compiled my own version of quantum MV> espresso before making serious calculations to avoid these MV> situtations. MV> MV> So... I made some changes in diropn.f90 in espresso4.0/PW and compiled MV> my own version of espresso (with this I get the same error) to print MV> the values below in the case of the big run, honestly I do not really MV> know much about this cluster, but I'm sure I'm using compiler xl MV> fortran version 11.1.0.3 and library essl 4.2.0.3. that is fine. MV> MV> recl: 415578000 MV> DIRECT_IO_FACTOR: 8 MV> unf_recl: -970343296 bingo! this is your problem. 8x415578000 is larger than 2^31, so unf_recl defined as integer*4 will overflow. MV> On my home cluster, I used a parallelized espresso-4.0.3 on system MV> "Intel Xeon E5410 @ 2.33Ghz, 16 GB RAM" with ifort 10.1.015, intel mkl MV> libraries 10.0.1.014 and openmpi-1.2.6 and with a smaller but similar MV> system (same pseudos, same cutoff, only gamma point), as I said there MV> is no "wrong record length" error and I got the following values: MV> MV> recl: 97079200 MV> DIRECT_IO_FACTOR: 8 MV> unf_recl: 776633600 MV> MV> If I'm right... 415578000*8 = 3324624000 which is bigger than the MV> largest value of a signed 32 bit integer, maybe that causes the MV> problem? exactly. the interesting question is now, how to work around this problem. you could try and declare unf_recl as integer*8 and try to recompile. perhaps, even just removing the test for negative unf_recl might work, but i doubt it. good luck, axel. MV> Thanks for your help, MV> Marton MV> -- ======================================================================= Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu Center for Molecular Modeling -- University of Pennsylvania Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323 tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425 ======================================================================= If you make something idiot-proof, the universe creates a better idiot.
