[Wien] Problems in parallel jobs
This is not an error, just something untidy in the code (which should be cleaned up), not an error. The compilation script does a "grep -e Error" on a file compile.msg which sometimes (depends upon what c flags are used) gives a warning for the two lines you mentioned. The grep picks up the material within the "/*" and "*/" which is a c inline comment and prints it out in a fashion which is misleading. N.B., if you need to fix this just remove "Error" from SRC_vecpratt/W2kutils.c so the relevant line just reads signal ( SIGBUS, w2ksignal_bus ); /* Bus */ On Tue, Jul 27, 2010 at 5:41 AM, bothina hamad wrote: > Dear Laurence, > ? ? ? ? ? ? Thank you for assisting us in our problems, we are still facing > problems, I know it is not related to the code, but we still need the help > from experts in parallel compilation. > > The code works just fine on a cluster where we have compiled it using the > extra '-assu buff' flag to alleviate NFS problems. > However, on another cluster that has a GPFS parallel file system and does not > use NFS the only problem that we encounter is the following compile issue: > > > > ?Compile time errors (if any) were: > ?SRC_vecpratt/compile.msg: ? ? ? ? ?signal ( SIGBUS, w2ksignal_bus ); /* Bus > Error */ > ?SRC_vecpratt/compile.msg: ? ? ? ? ?signal ( SIGBUS, w2ksignal_bus ); /* Bus > Error */ > > > ?Check file ? compile.msg ? in the corresponding SRC_* directory for the > ?compilation log and more info on any compilation problem. > > > ?I'm not sure what the vecpratt part of the code actually does. > > ?When I look in the compile.msg file there are just warnings, not errors and > looking in the SRC_vecpratt directory I see that the executables are actually > built. > > where is vecpratt used by the program? > > > Thanks in advance > Bothina > > > --- On Wed, 7/21/10, Laurence Marks wrote: > >> From: Laurence Marks >> Subject: Re: [Wien] Problems in parallel jobs >> To: "A Mailing list for WIEN2k users" >> Date: Wednesday, July 21, 2010, 3:03 PM >> Also: >> >> 5) Use ldd of $WIENROOT/lapw1_mpi on different nodes to >> check that >> these are correct, and also "which mpirun". >> >> On Wed, Jul 21, 2010 at 6:47 AM, Laurence Marks >> >> wrote: >> > Hard to know for certain, but this looks like a OS >> problem rather than >> > a WIen2k issue. Things to check: >> > >> > 1) Use ompi_info and check, carefully, that the >> compilation >> > options/libraries used for openmpi are the same as >> what you are using >> > to compile Wien2k. >> > >> > 2) For 10.1, ulimit -s is not needed for mpi (and in >> any case does >> > nothing with openmpi) as this is done in software in >> Wien2kutils.c. >> > Make sure that you are exporting environmental >> parameters in your >> > mpirun call, for instance use in parallel_options >> > setenv WIEN_MPIRUN "mpirun -x LD_LIBRARY_PATH -x PATH >> -np _NP_ >> > -machinefile _HOSTS_ _EXEC_" >> > >> > 3) Check the size of the job you are running, e.g. via >> top, by looking >> > in case.output1_X. by using "lapw1 -p -nmat_only", >> using ganglia or >> > nmon, cat /proc/meminfo (or anything else you have >> available). >> > Particularly with openmpi but with some other flavors >> as well, if you >> > are asking for too much memory and/or have too many >> processes running, >> > problems occur. A race condition can also occur in >> openmpi which makes >> > this problem worse (maybe patched in latest version, I >> am not sure). >> > >> > 4) Check, carefully (twice) for format errors in the >> input files. It >> > turns out that ifort has it's own signal traps so a >> child can exit >> > without correctly calling mpi_abort. A race condition >> can occur with >> > openmpi when the parent is trying to find a child, the >> child does not >> > exist, the parent waits then keeps going >> > >> > 5) Check the OS logs in /var/log (beyond my >> competence). You may have >> > too high a nfs load, bad infiniband/myrinet (recent >> OFED?) etc. Use >> > -assu buff in compilation options to reduce nfs load. >> > >> > On Wed, Jul 21, 2010 at 3:53 AM, bothina hamad >> wrote: >> >> Dear Wien users, >> >> >> >> When running optimisation jobs under torque >> queuing system for anything but >> >> very smal
[Wien] Problems in parallel jobs
Dear Laurence, Thank you for assisting us in our problems, we are still facing problems, I know it is not related to the code, but we still need the help from experts in parallel compilation. The code works just fine on a cluster where we have compiled it using the extra '-assu buff' flag to alleviate NFS problems. However, on another cluster that has a GPFS parallel file system and does not use NFS the only problem that we encounter is the following compile issue: Compile time errors (if any) were: SRC_vecpratt/compile.msg: signal ( SIGBUS, w2ksignal_bus ); /* Bus Error */ SRC_vecpratt/compile.msg: signal ( SIGBUS, w2ksignal_bus ); /* Bus Error */ Check file compile.msg in the corresponding SRC_* directory for the compilation log and more info on any compilation problem. I'm not sure what the vecpratt part of the code actually does. When I look in the compile.msg file there are just warnings, not errors and looking in the SRC_vecpratt directory I see that the executables are actually built. where is vecpratt used by the program? Thanks in advance Bothina --- On Wed, 7/21/10, Laurence Marks wrote: > From: Laurence Marks > Subject: Re: [Wien] Problems in parallel jobs > To: "A Mailing list for WIEN2k users" > Date: Wednesday, July 21, 2010, 3:03 PM > Also: > > 5) Use ldd of $WIENROOT/lapw1_mpi on different nodes to > check that > these are correct, and also "which mpirun". > > On Wed, Jul 21, 2010 at 6:47 AM, Laurence Marks > > wrote: > > Hard to know for certain, but this looks like a OS > problem rather than > > a WIen2k issue. Things to check: > > > > 1) Use ompi_info and check, carefully, that the > compilation > > options/libraries used for openmpi are the same as > what you are using > > to compile Wien2k. > > > > 2) For 10.1, ulimit -s is not needed for mpi (and in > any case does > > nothing with openmpi) as this is done in software in > Wien2kutils.c. > > Make sure that you are exporting environmental > parameters in your > > mpirun call, for instance use in parallel_options > > setenv WIEN_MPIRUN "mpirun -x LD_LIBRARY_PATH -x PATH > -np _NP_ > > -machinefile _HOSTS_ _EXEC_" > > > > 3) Check the size of the job you are running, e.g. via > top, by looking > > in case.output1_X. by using "lapw1 -p -nmat_only", > using ganglia or > > nmon, cat /proc/meminfo (or anything else you have > available). > > Particularly with openmpi but with some other flavors > as well, if you > > are asking for too much memory and/or have too many > processes running, > > problems occur. A race condition can also occur in > openmpi which makes > > this problem worse (maybe patched in latest version, I > am not sure). > > > > 4) Check, carefully (twice) for format errors in the > input files. It > > turns out that ifort has it's own signal traps so a > child can exit > > without correctly calling mpi_abort. A race condition > can occur with > > openmpi when the parent is trying to find a child, the > child does not > > exist, the parent waits then keeps going > > > > 5) Check the OS logs in /var/log (beyond my > competence). You may have > > too high a nfs load, bad infiniband/myrinet (recent > OFED?) etc. Use > > -assu buff in compilation options to reduce nfs load. > > > > On Wed, Jul 21, 2010 at 3:53 AM, bothina hamad > wrote: > >> Dear Wien users, > >> > >> When running optimisation jobs under torque > queuing system for anything but > >> very small systems: > >> > >> Job runs for many cycles using lapw0, lapw1, lapw2 > (parallel) successfully but eventually the 'mom-superior' > node (that launches ) mpirun becomes non-communicating with > the other nodes involved with the job. > >> > >> At the console of this node there is correct load > (4 for quad processor) and memory free... but can no longer > access any nfs mounts, can no longer ping other nodes in > cluster... am eventually forced to reboot node and kill job > from cluster queuing system (job enters 'E' state and stays > there... need to stop pbs_server and manually remove > jobfiles from /var/spool/torque/server_priv/jobs... then > restart pbs_server) > >> > >> A similar problem is encountered on larger cluster > (same install procedure) but with added problem that the > .dayfile reports that for lapw2 only the 'mom-superior' node > is reporting doing work (even though logging into other job > nodes top reports correct load and 100%cpu use). > >>
[Wien] Problems in parallel jobs
Also: 5) Use ldd of $WIENROOT/lapw1_mpi on different nodes to check that these are correct, and also "which mpirun". On Wed, Jul 21, 2010 at 6:47 AM, Laurence Marks wrote: > Hard to know for certain, but this looks like a OS problem rather than > a WIen2k issue. Things to check: > > 1) Use ompi_info and check, carefully, that the compilation > options/libraries used for openmpi are the same as what you are using > to compile Wien2k. > > 2) For 10.1, ulimit -s is not needed for mpi (and in any case does > nothing with openmpi) as this is done in software in Wien2kutils.c. > Make sure that you are exporting environmental parameters in your > mpirun call, for instance use in parallel_options > setenv WIEN_MPIRUN "mpirun -x LD_LIBRARY_PATH -x PATH -np _NP_ > -machinefile _HOSTS_ _EXEC_" > > 3) Check the size of the job you are running, e.g. via top, by looking > in case.output1_X. by using "lapw1 -p -nmat_only", using ganglia or > nmon, cat /proc/meminfo (or anything else you have available). > Particularly with openmpi but with some other flavors as well, if you > are asking for too much memory and/or have too many processes running, > problems occur. A race condition can also occur in openmpi which makes > this problem worse (maybe patched in latest version, I am not sure). > > 4) Check, carefully (twice) for format errors in the input files. It > turns out that ifort has it's own signal traps so a child can exit > without correctly calling mpi_abort. A race condition can occur with > openmpi when the parent is trying to find a child, the child does not > exist, the parent waits then keeps going > > 5) Check the OS logs in /var/log (beyond my competence). You may have > too high a nfs load, bad infiniband/myrinet (recent OFED?) etc. Use > -assu buff in compilation options to reduce nfs load. > > On Wed, Jul 21, 2010 at 3:53 AM, bothina hamad > wrote: >> Dear Wien users, >> >> When running optimisation jobs under torque queuing system for anything but >> very small systems: >> >> Job runs for many cycles using lapw0, lapw1, lapw2 (parallel) successfully >> but eventually the 'mom-superior' node (that launches ) mpirun becomes >> non-communicating with the other nodes involved with the job. >> >> At the console of this node there is correct load (4 for quad processor) and >> memory free... but can no longer access any nfs mounts, can no longer ping >> other nodes in cluster... am eventually forced to reboot node and kill job >> from cluster queuing system (job enters 'E' state and stays there... need to >> stop pbs_server and manually remove jobfiles from >> /var/spool/torque/server_priv/jobs... then restart pbs_server) >> >> A similar problem is encountered on larger cluster (same install procedure) >> but with added problem that the .dayfile reports that for lapw2 only the >> 'mom-superior' node is reporting doing work (even though logging into other >> job nodes top reports correct load and 100%cpu use). >> >> DOS calculation seems to work properly on both clusters... >> >> We have used a modified x_lapw that you provided earlier. >> We have been inserting 'ulimit -s unlimited' into job-scripts >> >> We are using... >> Centos5.3 x86_64 >> Intel compiler suite with mkl v11.1/072 >> openmpi-1.4.2, compiled with intel compilers >> fftw-2.1.5, compiled with intel compilers and openmpi above >> Wien2k v10.1 >> >> Optimisation jobs for small systems complete OK on both clusters. >> >> The working directories for this job are large (>2GB). >> >> ?Please let us know what >> files we could send you from these that may be helpful for diagnosis... >> >> Best regards >> Bothina >> >> >> >> ___ >> Wien mailing list >> Wien at zeus.theochem.tuwien.ac.at >> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien >> > > > > -- > Laurence Marks > Department of Materials Science and Engineering > MSE Rm 2036 Cook Hall > 2220 N Campus Drive > Northwestern University > Evanston, IL 60208, USA > Tel: (847) 491-3996 Fax: (847) 491-7820 > email: L-marks at northwestern dot edu > Web: www.numis.northwestern.edu > Chair, Commission on Electron Crystallography of IUCR > www.numis.northwestern.edu/ > Electron crystallography is the branch of science that uses electron > scattering and imaging to study the structure of matter. > -- Laurence Marks Department of Materials Science and Engineering MSE Rm 2036 Cook Hall 2220 N Campus Drive Northwestern University Evanston, IL 60208, USA Tel: (847) 491-3996 Fax: (847) 491-7820 email: L-marks at northwestern dot edu Web: www.numis.northwestern.edu Chair, Commission on Electron Crystallography of IUCR www.numis.northwestern.edu/ Electron crystallography is the branch of science that uses electron scattering and imaging to study the structure of matter.
[Wien] Problems in parallel jobs
Hard to know for certain, but this looks like a OS problem rather than a WIen2k issue. Things to check: 1) Use ompi_info and check, carefully, that the compilation options/libraries used for openmpi are the same as what you are using to compile Wien2k. 2) For 10.1, ulimit -s is not needed for mpi (and in any case does nothing with openmpi) as this is done in software in Wien2kutils.c. Make sure that you are exporting environmental parameters in your mpirun call, for instance use in parallel_options setenv WIEN_MPIRUN "mpirun -x LD_LIBRARY_PATH -x PATH -np _NP_ -machinefile _HOSTS_ _EXEC_" 3) Check the size of the job you are running, e.g. via top, by looking in case.output1_X. by using "lapw1 -p -nmat_only", using ganglia or nmon, cat /proc/meminfo (or anything else you have available). Particularly with openmpi but with some other flavors as well, if you are asking for too much memory and/or have too many processes running, problems occur. A race condition can also occur in openmpi which makes this problem worse (maybe patched in latest version, I am not sure). 4) Check, carefully (twice) for format errors in the input files. It turns out that ifort has it's own signal traps so a child can exit without correctly calling mpi_abort. A race condition can occur with openmpi when the parent is trying to find a child, the child does not exist, the parent waits then keeps going 5) Check the OS logs in /var/log (beyond my competence). You may have too high a nfs load, bad infiniband/myrinet (recent OFED?) etc. Use -assu buff in compilation options to reduce nfs load. On Wed, Jul 21, 2010 at 3:53 AM, bothina hamad wrote: > Dear Wien users, > > When running optimisation jobs under torque queuing system for anything but > very small systems: > > Job runs for many cycles using lapw0, lapw1, lapw2 (parallel) successfully > but eventually the 'mom-superior' node (that launches ) mpirun becomes > non-communicating with the other nodes involved with the job. > > At the console of this node there is correct load (4 for quad processor) and > memory free... but can no longer access any nfs mounts, can no longer ping > other nodes in cluster... am eventually forced to reboot node and kill job > from cluster queuing system (job enters 'E' state and stays there... need to > stop pbs_server and manually remove jobfiles from > /var/spool/torque/server_priv/jobs... then restart pbs_server) > > A similar problem is encountered on larger cluster (same install procedure) > but with added problem that the .dayfile reports that for lapw2 only the > 'mom-superior' node is reporting doing work (even though logging into other > job nodes top reports correct load and 100%cpu use). > > DOS calculation seems to work properly on both clusters... > > We have used a modified x_lapw that you provided earlier. > We have been inserting 'ulimit -s unlimited' into job-scripts > > We are using... > Centos5.3 x86_64 > Intel compiler suite with mkl v11.1/072 > openmpi-1.4.2, compiled with intel compilers > fftw-2.1.5, compiled with intel compilers and openmpi above > Wien2k v10.1 > > Optimisation jobs for small systems complete OK on both clusters. > > The working directories for this job are large (>2GB). > > ?Please let us know what > files we could send you from these that may be helpful for diagnosis... > > Best regards > Bothina > > > > ___ > Wien mailing list > Wien at zeus.theochem.tuwien.ac.at > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien > -- Laurence Marks Department of Materials Science and Engineering MSE Rm 2036 Cook Hall 2220 N Campus Drive Northwestern University Evanston, IL 60208, USA Tel: (847) 491-3996 Fax: (847) 491-7820 email: L-marks at northwestern dot edu Web: www.numis.northwestern.edu Chair, Commission on Electron Crystallography of IUCR www.numis.northwestern.edu/ Electron crystallography is the branch of science that uses electron scattering and imaging to study the structure of matter.
[Wien] Problems in parallel jobs
Dear Wien users, When running optimisation jobs under torque queuing system for anything but very small systems: Job runs for many cycles using lapw0, lapw1, lapw2 (parallel) successfully but eventually the 'mom-superior' node (that launches ) mpirun becomes non-communicating with the other nodes involved with the job. At the console of this node there is correct load (4 for quad processor) and memory free... but can no longer access any nfs mounts, can no longer ping other nodes in cluster... am eventually forced to reboot node and kill job from cluster queuing system (job enters 'E' state and stays there... need to stop pbs_server and manually remove jobfiles from /var/spool/torque/server_priv/jobs... then restart pbs_server) A similar problem is encountered on larger cluster (same install procedure) but with added problem that the .dayfile reports that for lapw2 only the 'mom-superior' node is reporting doing work (even though logging into other job nodes top reports correct load and 100%cpu use). DOS calculation seems to work properly on both clusters... We have used a modified x_lapw that you provided earlier. We have been inserting 'ulimit -s unlimited' into job-scripts We are using... Centos5.3 x86_64 Intel compiler suite with mkl v11.1/072 openmpi-1.4.2, compiled with intel compilers fftw-2.1.5, compiled with intel compilers and openmpi above Wien2k v10.1 Optimisation jobs for small systems complete OK on both clusters. The working directories for this job are large (>2GB). Please let us know what files we could send you from these that may be helpful for diagnosis... Best regards Bothina