On Thu, Jan 21, 2016 at 4:07 AM, Dave Love <d.l...@liverpool.ac.uk> wrote:
>
> Jeff Hammond <jeff.scie...@gmail.com> writes:
>
> > Just using Intel compilers, OpenMP and MPI.  Problem solved :-)
> >
> > (I work for Intel and the previous statement should be interpreted as a
> > joke,
>
> Good!
>
> > although Intel OpenMP and MPI interoperate as well as any
> > implementations of which I am aware.)
>
> Better than MPC (not that I've used it)?
>

MPC is a great idea, although it poses some challenges w.r.t. globals and
such (however, see below).  Unfortunately, "MPC conforms to the POSIX
Threads, OpenMP 3.1 and MPI 1.3 standards" (
http://mpc.hpcframework.paratools.com/), it does not do me much good (I'm a
heavy-duty RMA user).

For those that are interested in MPC, the Intel compilers (on Linux)
support an option to change how TLS works so that MPC works.

-f[no-]mpc_privatize
          Enables privatization of all static data for the MPC
          unified parallel runtime.  This will cause calls to
          extended thread local storage resolution run-time routines
          which are not supported on standard linux distributions.
          This option is only usable in conjunction with the MPC
          unified parallel runtime.  The default is -fno-mpc-privatize.

>
> For what it's worth, you have to worry about the batch resource manager
> as well as the MPI, and you may need to ensure you're allocated complete
> nodes.  There are known problems with IMPI and SGE specifically, and
> several times I've made users a lot happier with OMPI/GCC.
>

This is likely because GCC uses one OpenMP thread when the user does not
set OMP_NUM_THREADS, whereas Intel will use one per virtual processor
(divided by MPI processes, but only if it can figure out how many).  Both
behaviors are compliant with the OpenMP standard.  GCC is doing the
conservative thing, whereas Intel is trying to maximize performance in the
case of OpenMP-only applications (more common than you think) and
MPI+OpenMP applications where Intel MPI is used.  As experienced HPC users
always set OMP_NUM_THREADS (and OMP_PROC_BIND, OMP_WAIT_POLICY or
implementation-specific equivalents) explicitly anyways, this should not be
a problem.

As for not getting complete nodes, one is either in the cloud or the shared
debug queue and performance is secondary.  But as always, one should be
able to set OMP_NUM_THREADS, OMP_PROC_BIND, OMP_WAIT_POLICY to get the
right behavior.

My limited experience with SGE has caused me to conclude that any problems
associated with SGE + $X are almost certainly the fault of SGE and not $X.

> >> Or pray the MPI Forum and OpenMP combine and I can just look in a
> >> Standard. :D
> >>
> >>
> > echo "" > $OPENMP_STANDARD # critical step
> > cat $MPI_STANDARD $OPENMP_STANDARD > $HPC_STANDARD
> >
> > More seriously, hybrid programming sucks.  Just use MPI-3 and exploit
your
> > coherence domain via MPI_Win_allocate_shared.  That way, you won't have
to
> > mix runtimes, suffer mercilessly because of opaque race conditions in
> > thread-unsafe libraries, or reason about a bolt-on pseudo-language that
> > replicates features found in ISO languages without a well-defined
> > interoperability model.
>
> Sure, but the trouble is that "everyone knows"" you need the hybrid
> stuff.  Are there good examples of using MPI-3 instead/in comparison?
> I'd be particularly interested in convincing chemists, though as they
> don't believe in deadlock and won't measure things, that's probably a
> lost cause.  Not all chemists, of course.

PETSc (
http://www.orau.gov/hpcor2015/whitepapers/Exascale_Computing_without_Threads-Barry_Smith.pdf
)

Quantum chemistry or molecular dynamics?  Parts of quantum chemistry are so
flop heavy that stupid fork-join MPI+OpenMP is just fine.  I'm doing this
in NWChem coupled cluster codes.  I fork-join in every kernel even though
this is shameful, because my kernels do somewhere between 4 and 40 billion
FMAs and touch between 0.5 and 5 GB of memory.  For methods that aren't
coupled-cluster, OpenMP is not always a good solution, and certainly not
for legacy codes that aren't thread-safe.  OpenMP may be useful within a
core to exploit >1 thread per core (if necessary) and certainly "#pragma
omp simd" should be exploited when appropriate, but scaling OpenMP beyond
~4 threads in most quantum chemistry codes requires an intensive rewrite.
Because of load-balancing issues in atomic integral computations, TBB or
OpenMP tasking may be more appropriate.

If you want to have a more detailed discussion of programming models for
computational chemistry, I'd be happy to take that discussion offline.

Best,

Jeff



--
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/

Reply via email to