[Pw_forum] A "relax" input runs on CPU (pw.x) but not on CPU-GPU (pw-gpu.x)

Axel Kohlmeyer Sun, 22 Jun 2014 14:02:20 -0400

On Sun, Jun 22, 2014 at 3:12 AM, Reza Behjatmanesh-Ardakani
<reza_b_m_a at yahoo.com> wrote:
> Dear Axel
> Thank you. It was very helpful for me.
> As you said some new GTX cards have good DP floating point performance such 
> as GTX Ti Black or GTX Ti Z that for both DP is 1/3 of SP.
> They are much cheaper than Tesla cards.
> I am not sure that Ti Black or Ti Z has ECC.


no it hasn't.

> Quadro K6000 has it.

well, the quadro is practically a tesla with all graphics features
enabled. ...at a price.

> Thanks again.
>
> With the Best Regards
>
>    Reza Behjatmanesh-Ardakani
>    Associate Professor of Physical Chemistry
>    Address:
>    Department of Chemistry,
>    School of Science,
>    Payame Noor University (PNU),
>    Ardakan,
>    Yazd,
>    Iran.
>    E-mails:
>           1- reza_b_m_a at yahoo.com (preferred),
>           2- behjatmanesh at pnu.ac.ir,
>           3- reza.b.m.a at gmail.com.
>
> --------------------------------------------
> On Sat, 6/21/14, Axel Kohlmeyer <akohlmey at gmail.com> wrote:
>
>  Subject: Re: [Pw_forum] A "relax" input runs on CPU (pw.x) but not on 
> CPU-GPU (pw-gpu.x)
>  To: "PWSCF Forum" <pw_forum at pwscf.org>
>  Date: Saturday, June 21, 2014, 1:50 PM
>
>  On Sat, Jun 21, 2014 at 4:20 AM, Reza
>  Behjatmanesh-Ardakani
>  <reza_b_m_a at yahoo.com>
>  wrote:
>  > Dear Axel
>  > This was just a proposal. If I am right, Terachem code
>  can use gaming cards for GPU calculations (I saw some of its
>  authors' papers).
>
>  yes, but terachem was written from ground up with new
>  algorithms to
>  avoid loss of precision. in quantum mechanics this is
>  important, since
>  a lot of calculations depend on comparing large numbers of
>  equal sign
>  and magnitude and looking at the difference. about the only
>  part of a
>  plane wave DFT calculation that is "conservative" in terms
>  of
>  precision without a massive redesign are the FFTs. the loss
>  of
>  precision is fairly small when replacing double precision
>  FFTs with
>  single precision ones. for the many 3d-FFTs required, this
>  is
>  particularly beneficial when trying to scale out via MPI, as
>  this
>  reduces the amount of bytes that need to be sent and copied
>  around in
>  half and also reduces the strain on memory bandwidth.
>
>  > As you know, the main problem of GTX cards comes back
>  to two important things. One, single precision, and the
>  other lack of ECC.
>
>  ECC is a lesser issue. and it is not a problem of single
>  precision,
>  but lack of performance with double precision due to having
>  only a
>  fraction of double precision units. another issue is the
>  lack of RAM.
>  also you have to distinguish between different GTX cards. a
>  few of the
>  most high-end consumer cards  *do* have the full set of
>  double
>  precision units and a large amount of RAM.
>
>  ECC is mostly relevant for people running a large number of
>  GPUs in a
>  supercomputer environment.
>
>  >
>  > It is not necessary to write a stand alone code. We can
>  test the QE-GPU with both TESLA and/or GTX and QE (cpu
>  only), and compare the outputs.
>
>  but it is pointless to run on a hardware that is not
>  competitive.
>  you'll have a hard time already to get a 2x speedup from
>  using a top
>  level tesla card vs. an all CPU run on a decent machine.
>  what would be
>  the point of having the GPU _decelerate_ your calculation?
>
>  in general, a lot of the GPU stuff is hype and
>  misinformation. the
>  following is a bit old, but still worth a read:
>  
> http://www.hpcwire.com/2011/12/13/ten_ways_to_fool_the_masses_when_giving_performance_results_on_gpus/
>
>  as a consequence of a very smart and successful PR strategy,
>  there is
>  now the impression that *any* kind of GPU will result in a
>  *massive*
>  speedup. even people with a laptop GPU with 2 SMs, no memory
>  bandwidth
>  are now expecting 100x speedups and more. however, except
>  for a few
>  corner cases and applications that are very well represented
>  on GPUs
>  (not very complex) and badly on a CPU, you will often get
>  more like a
>  2x-5x speedup in a "best effort" comparison of a well
>  equipped host
>  with a high-end GPU. in part, this situation has become
>  worse with
>  some choices made by nvidia hardware and software engineers.
>  while 5
>  years back, the difference between a consumer and a
>  computing GPU was
>  small, the consumer models have been systematically
>  "downgraded" (via
>  removing previously supported management features in the
>  driver and
>  having consumer cards be based on a simplified design that
>  mostly
>  makes them mid-level GPUs).
>
>  > I tested it for only one case (rutile 3*3*2 supercell),
>  and saw that the GTX output is similar to the CPU one.
>  >
>  > However, It is needed to test for different cases and
>  different clusters to be sure that the lack of ECC and
>  double precision has no effect on results.
>
>  sorry, this statement doesn't make any sense. it looks to
>  me, like you
>  need to spend some time learning what the technical
>  implications of
>  ECC and single-vs-double precision are (and the fact that it
>  is the
>  software that chooses which precision is used, not the
>  hardware)..
>
>  whether a card has ECC or not. broken memory is broken
>  memory. and if
>  it works, it works. so there is not much to test. if you
>  want to find
>  out, whether your GPU has broken or borderline memory, run
>  the GPU
>  memtest. it is much more effective at finding issues than
>  any other
>  application.
>
>  where ECC helps is for very long running calculations or
>  calculations
>  across a very large number of GPUs when a single bitflip can
>  render
>  the entire effort useless and result in a crash. in a dense
>  cluster
>  environment or badly cooled desktops, this is a high risk.
>  in a well
>  setup machine, it is less of a risk, but you have to keep in
>  mind that
>  running without ECC makes you "blind" for those errors. i
>  run a
>  cluster with a pile of Tesla GPUs and we have disabled ECC
>  since the
>  machines run very reliably due to some hacking around
>  restrictions
>  that nvidia engineers placed in their drivers.
>  https://sites.google.com/site/akohlmey/random-hacks/nvidia-gpu-coolness
>
>  we also run consumer level GPUs, particularly in the login
>  nodes,
>  since they work fine for development and don't cost as
>  outrageously
>  much as the tesla models. however, for development,
>  absolute
>  performance is a lesser concern.
>
>  > As Filippo said formerly for GTX cards, the output may
>  be not reproducible. However, I think due to the nature of
>  SCF algorithm, the code can be used at least
>
>  when you have memory corruption due to bad/overheated
>  memory, no SCF
>  algorithm will save you. if you go back 10 years when CPUs
>  didn't have
>  all those power management and automatic self-protection and
>  also
>  memory modules in desktop were often of very low quality,
>  people
>  experienced a lot of problems. "signal 11" and "segmentation
>  fault"
>  were a common topic in many mailing lists on scientific (or
>  other)
>  software that caused a high CPU load.
>
>  but the indication of broken memory was usually a crash due
>  to a
>  segfault or bad data corruption leading to a massive change
>  in numbers
>  and often NaNs. once you have a single NaN in your data, it
>  will
>  spread like a highly infective virus and render the
>  calculation
>  invalid.
>
>  a well set up consumer level GPU will run as reliable as a
>  tesla or
>  better, only you cannot tell since the nvidia tools will not
>  show you.
>  the main issue is performance and available memory.
>
>  > for VC-RELAX, RELAX, and SCF types of calculations with
>  GTX cards. Of course, it should be tested. Thank you for
>  your interest.
>
>  you are not making much sense here either. but if it makes
>  you feel
>  better to do those tests, don't let me discourage you.
>  sometimes
>  people learn the best this way.
>
>  axel.
>
>
>  > With the Best Regards
>  >
>  >    Reza Behjatmanesh-Ardakani
>  >    Associate Professor of Physical Chemistry
>  >    Address:
>  >    Department of Chemistry,
>  >    School of Science,
>  >    Payame Noor University (PNU),
>  >    Ardakan,
>  >    Yazd,
>  >    Iran.
>  >    E-mails:
>  >           1- reza_b_m_a at yahoo.com
>  (preferred),
>  >           2- behjatmanesh at pnu.ac.ir,
>  >           3- reza.b.m.a at gmail.com.
>  >
>  > --------------------------------------------
>  > On Fri, 6/20/14, Axel Kohlmeyer <akohlmey at gmail.com>
>  wrote:
>  >
>  >  Subject: Re: [Pw_forum] A "relax" input runs on
>  CPU (pw.x) but not on CPU-GPU (pw-gpu.x)
>  >  To: "PWSCF Forum" <pw_forum at pwscf.org>
>  >  Date: Friday, June 20, 2014, 2:19 PM
>  >
>  >  On Fri, Jun 20, 2014 at 4:22 AM, Reza
>  >  Behjatmanesh-Ardakani
>  >  <reza_b_m_a at yahoo.com>
>  >  wrote:
>  >  > Dear Filippo
>  >  >
>  >  > Due to the nature of QE which is iterative,
>  I think
>  >  lack of ECC and even double precision floating
>  point in
>  >  gaming cards (GTX) comparing to tesla cards
>  >  >
>  >  > is not serious problem for QE-GPU. Some
>  authors have
>  >  checked this for AMBER molecular dynamics
>  simulation code.
>  >  See following site:
>  >
>  >  classical MD is a very different animal than what
>  you do
>  >  with QE.
>  >  errors due to single precision to some properties
>  in
>  >  classical MD are
>  >  huge with all single precision calculations. to
>  compute a
>  >  force from a
>  >  distance will not be much affected, but summing
>  up the force
>  >  can
>  >  already be a problem. "good" classical MD codes
>  usually
>  >  employ a mixed
>  >  precision approach, where only the accuracy
>  insensitive
>  >  parts are done
>  >  in single precision. for very large system, even
>  double
>  >  precision can
>  >  show significant floating point truncation
>  errors. usually
>  >  you are
>  >  dependent on error cancellation, too, i.e. when
>  you study a
>  >  simple
>  >  homogenous system (as it is quite common in those
>  tests).
>  >
>  >
>  >  >
>  >  > http://www.hpcwire.com/2014/03/13/ecc-performance-price-worth-gpus
>  >  >
>  >  >
>  >  > and see the following paper:
>  >  >
>  >  >
>  >  >
>  >
>  www.rosswalker.co.uk/papers/2014_03_ECC_AMBER_Paper_10.1002_cpe.3232.pdf
>  >  >
>  >  >
>  >  >
>  >  > I encourage the users of QE-GPU to test it
>  for QE, and
>  >  report the difference on the site.
>  >
>  >  it is a waste of time and effort. people have
>  done DFT and
>  >  HF in
>  >  (partial) single precision before and only if you
>  write a
>  >  new code
>  >  from scratch and have an extremely skilled
>  programmer, you
>  >  will
>  >  succeed. have a look at the terachem software out
>  of the
>  >  group of todd
>  >  martinez for example.
>  >
>  >  axel.
>  >
>  >  > PS: to be able to test the results for GTX
>  and TESLA,
>  >  it is needed QE-GPU code to be run on GTX :-)
>  >  >
>  >  >
>  >  > With the Best Regards
>  >  >
>  >  >    Reza Behjatmanesh-Ardakani
>  >  >    Associate Professor of Physical
>  Chemistry
>  >  >    Address:
>  >  >    Department of Chemistry,
>  >  >    School of Science,
>  >  >    Payame Noor University (PNU),
>  >  >    Ardakan,
>  >  >    Yazd,
>  >  >    Iran.
>  >  >    E-mails:
>  >  >
>     1- reza_b_m_a at yahoo.com
>  >  (preferred),
>  >  >
>     2- behjatmanesh at pnu.ac.ir,
>  >  >
>     3- reza.b.m.a at gmail.com.
>  >  >
>  _______________________________________________
>  >  > Pw_forum mailing list
>  >  > Pw_forum at pwscf.org
>  >  > http://pwscf.org/mailman/listinfo/pw_forum
>  >
>  >
>  >
>  >  --
>  >  Dr. Axel Kohlmeyer  akohlmey at gmail.com
>  >  http://goo.gl/1wk0
>  >  College of Science & Technology, Temple
>  University,
>  >  Philadelphia PA, USA
>  >  International Centre for Theoretical Physics,
>  Trieste.
>  >  Italy.
>  >  _______________________________________________
>  >  Pw_forum mailing list
>  >  Pw_forum at pwscf.org
>  >  http://pwscf.org/mailman/listinfo/pw_forum
>  >
>  >
>  > _______________________________________________
>  > Pw_forum mailing list
>  > Pw_forum at pwscf.org
>  > http://pwscf.org/mailman/listinfo/pw_forum
>
>
>
>  --
>  Dr. Axel Kohlmeyer  akohlmey at gmail.com
>  http://goo.gl/1wk0
>  College of Science & Technology, Temple University,
>  Philadelphia PA, USA
>  International Centre for Theoretical Physics, Trieste.
>  Italy.
>  _______________________________________________
>  Pw_forum mailing list
>  Pw_forum at pwscf.org
>  http://pwscf.org/mailman/listinfo/pw_forum
>
>
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://pwscf.org/mailman/listinfo/pw_forum



-- 
Dr. Axel Kohlmeyer  akohlmey at gmail.com  http://goo.gl/1wk0
College of Science & Technology, Temple University, Philadelphia PA, USA
International Centre for Theoretical Physics, Trieste. Italy.

[Pw_forum] A "relax" input runs on CPU (pw.x) but not on CPU-GPU (pw-gpu.x)

Reply via email to