Dear Axel I have some questions in this topic, too. Suppose we use a poor DP floating point performance GTX cards for QE-GPU. Does the codes in GPU technology of nVIDIA (such as CUDA and ...) change the DP of QE to SP automatically? or program ended with error. If it changes DP to SP automatically, users might get wrong results without any attention.
How about a full SP GPU card for full DP code? Does code run on it? Regards David Foster Ph.D. Student of Chemistry -------------------------------------------- On Sat, 6/21/14, Axel Kohlmeyer <akohlmey at gmail.com> wrote: Subject: Re: [Pw_forum] A "relax" input runs on CPU (pw.x) but not on CPU-GPU (pw-gpu.x) To: "PWSCF Forum" <pw_forum at pwscf.org> Date: Saturday, June 21, 2014, 2:20 AM On Sat, Jun 21, 2014 at 4:20 AM, Reza Behjatmanesh-Ardakani <reza_b_m_a at yahoo.com> wrote: > Dear Axel > This was just a proposal. If I am right, Terachem code can use gaming cards for GPU calculations (I saw some of its authors' papers). yes, but terachem was written from ground up with new algorithms to avoid loss of precision. in quantum mechanics this is important, since a lot of calculations depend on comparing large numbers of equal sign and magnitude and looking at the difference. about the only part of a plane wave DFT calculation that is "conservative" in terms of precision without a massive redesign are the FFTs. the loss of precision is fairly small when replacing double precision FFTs with single precision ones. for the many 3d-FFTs required, this is particularly beneficial when trying to scale out via MPI, as this reduces the amount of bytes that need to be sent and copied around in half and also reduces the strain on memory bandwidth. > As you know, the main problem of GTX cards comes back to two important things. One, single precision, and the other lack of ECC. ECC is a lesser issue. and it is not a problem of single precision, but lack of performance with double precision due to having only a fraction of double precision units. another issue is the lack of RAM. also you have to distinguish between different GTX cards. a few of the most high-end consumer cards? *do* have the full set of double precision units and a large amount of RAM. ECC is mostly relevant for people running a large number of GPUs in a supercomputer environment. > > It is not necessary to write a stand alone code. We can test the QE-GPU with both TESLA and/or GTX and QE (cpu only), and compare the outputs. but it is pointless to run on a hardware that is not competitive. you'll have a hard time already to get a 2x speedup from using a top level tesla card vs. an all CPU run on a decent machine. what would be the point of having the GPU _decelerate_ your calculation? in general, a lot of the GPU stuff is hype and misinformation. the following is a bit old, but still worth a read: http://www.hpcwire.com/2011/12/13/ten_ways_to_fool_the_masses_when_giving_performance_results_on_gpus/ as a consequence of a very smart and successful PR strategy, there is now the impression that *any* kind of GPU will result in a *massive* speedup. even people with a laptop GPU with 2 SMs, no memory bandwidth are now expecting 100x speedups and more. however, except for a few corner cases and applications that are very well represented on GPUs (not very complex) and badly on a CPU, you will often get more like a 2x-5x speedup in a "best effort" comparison of a well equipped host with a high-end GPU. in part, this situation has become worse with some choices made by nvidia hardware and software engineers. while 5 years back, the difference between a consumer and a computing GPU was small, the consumer models have been systematically "downgraded" (via removing previously supported management features in the driver and having consumer cards be based on a simplified design that mostly makes them mid-level GPUs). > I tested it for only one case (rutile 3*3*2 supercell), and saw that the GTX output is similar to the CPU one. > > However, It is needed to test for different cases and different clusters to be sure that the lack of ECC and double precision has no effect on results. sorry, this statement doesn't make any sense. it looks to me, like you need to spend some time learning what the technical implications of ECC and single-vs-double precision are (and the fact that it is the software that chooses which precision is used, not the hardware).. whether a card has ECC or not. broken memory is broken memory. and if it works, it works. so there is not much to test. if you want to find out, whether your GPU has broken or borderline memory, run the GPU memtest. it is much more effective at finding issues than any other application. where ECC helps is for very long running calculations or calculations across a very large number of GPUs when a single bitflip can render the entire effort useless and result in a crash. in a dense cluster environment or badly cooled desktops, this is a high risk. in a well setup machine, it is less of a risk, but you have to keep in mind that running without ECC makes you "blind" for those errors. i run a cluster with a pile of Tesla GPUs and we have disabled ECC since the machines run very reliably due to some hacking around restrictions that nvidia engineers placed in their drivers. https://sites.google.com/site/akohlmey/random-hacks/nvidia-gpu-coolness we also run consumer level GPUs, particularly in the login nodes, since they work fine for development and don't cost as outrageously much as the tesla models. however, for development, absolute performance is a lesser concern. > As Filippo said formerly for GTX cards, the output may be not reproducible. However, I think due to the nature of SCF algorithm, the code can be used at least when you have memory corruption due to bad/overheated memory, no SCF algorithm will save you. if you go back 10 years when CPUs didn't have all those power management and automatic self-protection and also memory modules in desktop were often of very low quality, people experienced a lot of problems. "signal 11" and "segmentation fault" were a common topic in many mailing lists on scientific (or other) software that caused a high CPU load. but the indication of broken memory was usually a crash due to a segfault or bad data corruption leading to a massive change in numbers and often NaNs. once you have a single NaN in your data, it will spread like a highly infective virus and render the calculation invalid. a well set up consumer level GPU will run as reliable as a tesla or better, only you cannot tell since the nvidia tools will not show you. the main issue is performance and available memory. > for VC-RELAX, RELAX, and SCF types of calculations with GTX cards. Of course, it should be tested. Thank you for your interest. you are not making much sense here either. but if it makes you feel better to do those tests, don't let me discourage you. sometimes people learn the best this way. axel. > With the Best Regards > >? ? Reza Behjatmanesh-Ardakani >? ? Associate Professor of Physical Chemistry >? ? Address: >? ? Department of Chemistry, >? ? School of Science, >? ? Payame Noor University (PNU), >? ? Ardakan, >? ? Yazd, >? ? Iran. >? ? E-mails: >? ? ? ? ???1- reza_b_m_a at yahoo.com (preferred), >? ? ? ? ???2- behjatmanesh at pnu.ac.ir, >? ? ? ? ???3- reza.b.m.a at gmail.com. > > -------------------------------------------- > On Fri, 6/20/14, Axel Kohlmeyer <akohlmey at gmail.com> wrote: > >? Subject: Re: [Pw_forum] A "relax" input runs on CPU (pw.x) but not on CPU-GPU (pw-gpu.x) >? To: "PWSCF Forum" <pw_forum at pwscf.org> >? Date: Friday, June 20, 2014, 2:19 PM > >? On Fri, Jun 20, 2014 at 4:22 AM, Reza >? Behjatmanesh-Ardakani >? <reza_b_m_a at yahoo.com> >? wrote: >? > Dear Filippo >? > >? > Due to the nature of QE which is iterative, I think >? lack of ECC and even double precision floating point in >? gaming cards (GTX) comparing to tesla cards >? > >? > is not serious problem for QE-GPU. Some authors have >? checked this for AMBER molecular dynamics simulation code. >? See following site: > >? classical MD is a very different animal than what you do >? with QE. >? errors due to single precision to some properties in >? classical MD are >? huge with all single precision calculations. to compute a >? force from a >? distance will not be much affected, but summing up the force >? can >? already be a problem. "good" classical MD codes usually >? employ a mixed >? precision approach, where only the accuracy insensitive >? parts are done >? in single precision. for very large system, even double >? precision can >? show significant floating point truncation errors. usually >? you are >? dependent on error cancellation, too, i.e. when you study a >? simple >? homogenous system (as it is quite common in those tests). > > >? > >? > http://www.hpcwire.com/2014/03/13/ecc-performance-price-worth-gpus >? > >? > >? > and see the following paper: >? > >? > >? > >? www.rosswalker.co.uk/papers/2014_03_ECC_AMBER_Paper_10.1002_cpe.3232.pdf >? > >? > >? > >? > I encourage the users of QE-GPU to test it for QE, and >? report the difference on the site. > >? it is a waste of time and effort. people have done DFT and >? HF in >? (partial) single precision before and only if you write a >? new code >? from scratch and have an extremely skilled programmer, you >? will >? succeed. have a look at the terachem software out of the >? group of todd >? martinez for example. > >? axel. > >? > PS: to be able to test the results for GTX and TESLA, >? it is needed QE-GPU code to be run on GTX :-) >? > >? > >? > With the Best Regards >? > >? >? ? Reza Behjatmanesh-Ardakani >? >? ? Associate Professor of Physical Chemistry >? >? ? Address: >? >? ? Department of Chemistry, >? >? ? School of Science, >? >? ? Payame Noor University (PNU), >? >? ? Ardakan, >? >? ? Yazd, >? >? ? Iran. >? >? ? E-mails: >? >? ? ? ? ???1- reza_b_m_a at yahoo.com >? (preferred), >? >? ? ? ? ???2- behjatmanesh at pnu.ac.ir, >? >? ? ? ? ???3- reza.b.m.a at gmail.com. >? > _______________________________________________ >? > Pw_forum mailing list >? > Pw_forum at pwscf.org >? > http://pwscf.org/mailman/listinfo/pw_forum > > > >? -- >? Dr. Axel Kohlmeyer? akohlmey at gmail.com >? http://goo.gl/1wk0 >? College of Science & Technology, Temple University, >? Philadelphia PA, USA >? International Centre for Theoretical Physics, Trieste. >? Italy. >? _______________________________________________ >? Pw_forum mailing list >? Pw_forum at pwscf.org >? http://pwscf.org/mailman/listinfo/pw_forum > > > _______________________________________________ > Pw_forum mailing list > Pw_forum at pwscf.org > http://pwscf.org/mailman/listinfo/pw_forum -- Dr. Axel Kohlmeyer? akohlmey at gmail.com? http://goo.gl/1wk0 College of Science & Technology, Temple University, Philadelphia PA, USA International Centre for Theoretical Physics, Trieste. Italy. _______________________________________________ Pw_forum mailing list Pw_forum at pwscf.org http://pwscf.org/mailman/listinfo/pw_forum