On Thu, Jan 21, 2016 at 4:07 AM, Dave Love <d.l...@liverpool.ac.uk> wrote: > > Jeff Hammond <jeff.scie...@gmail.com> writes: > > > Just using Intel compilers, OpenMP and MPI. Problem solved :-) > > > > (I work for Intel and the previous statement should be interpreted as a > > joke, > > Good! > > > although Intel OpenMP and MPI interoperate as well as any > > implementations of which I am aware.) > > Better than MPC (not that I've used it)? >
MPC is a great idea, although it poses some challenges w.r.t. globals and such (however, see below). Unfortunately, "MPC conforms to the POSIX Threads, OpenMP 3.1 and MPI 1.3 standards" ( http://mpc.hpcframework.paratools.com/), it does not do me much good (I'm a heavy-duty RMA user). For those that are interested in MPC, the Intel compilers (on Linux) support an option to change how TLS works so that MPC works. -f[no-]mpc_privatize Enables privatization of all static data for the MPC unified parallel runtime. This will cause calls to extended thread local storage resolution run-time routines which are not supported on standard linux distributions. This option is only usable in conjunction with the MPC unified parallel runtime. The default is -fno-mpc-privatize. > > For what it's worth, you have to worry about the batch resource manager > as well as the MPI, and you may need to ensure you're allocated complete > nodes. There are known problems with IMPI and SGE specifically, and > several times I've made users a lot happier with OMPI/GCC. > This is likely because GCC uses one OpenMP thread when the user does not set OMP_NUM_THREADS, whereas Intel will use one per virtual processor (divided by MPI processes, but only if it can figure out how many). Both behaviors are compliant with the OpenMP standard. GCC is doing the conservative thing, whereas Intel is trying to maximize performance in the case of OpenMP-only applications (more common than you think) and MPI+OpenMP applications where Intel MPI is used. As experienced HPC users always set OMP_NUM_THREADS (and OMP_PROC_BIND, OMP_WAIT_POLICY or implementation-specific equivalents) explicitly anyways, this should not be a problem. As for not getting complete nodes, one is either in the cloud or the shared debug queue and performance is secondary. But as always, one should be able to set OMP_NUM_THREADS, OMP_PROC_BIND, OMP_WAIT_POLICY to get the right behavior. My limited experience with SGE has caused me to conclude that any problems associated with SGE + $X are almost certainly the fault of SGE and not $X. > >> Or pray the MPI Forum and OpenMP combine and I can just look in a > >> Standard. :D > >> > >> > > echo "" > $OPENMP_STANDARD # critical step > > cat $MPI_STANDARD $OPENMP_STANDARD > $HPC_STANDARD > > > > More seriously, hybrid programming sucks. Just use MPI-3 and exploit your > > coherence domain via MPI_Win_allocate_shared. That way, you won't have to > > mix runtimes, suffer mercilessly because of opaque race conditions in > > thread-unsafe libraries, or reason about a bolt-on pseudo-language that > > replicates features found in ISO languages without a well-defined > > interoperability model. > > Sure, but the trouble is that "everyone knows"" you need the hybrid > stuff. Are there good examples of using MPI-3 instead/in comparison? > I'd be particularly interested in convincing chemists, though as they > don't believe in deadlock and won't measure things, that's probably a > lost cause. Not all chemists, of course. PETSc ( http://www.orau.gov/hpcor2015/whitepapers/Exascale_Computing_without_Threads-Barry_Smith.pdf ) Quantum chemistry or molecular dynamics? Parts of quantum chemistry are so flop heavy that stupid fork-join MPI+OpenMP is just fine. I'm doing this in NWChem coupled cluster codes. I fork-join in every kernel even though this is shameful, because my kernels do somewhere between 4 and 40 billion FMAs and touch between 0.5 and 5 GB of memory. For methods that aren't coupled-cluster, OpenMP is not always a good solution, and certainly not for legacy codes that aren't thread-safe. OpenMP may be useful within a core to exploit >1 thread per core (if necessary) and certainly "#pragma omp simd" should be exploited when appropriate, but scaling OpenMP beyond ~4 threads in most quantum chemistry codes requires an intensive rewrite. Because of load-balancing issues in atomic integral computations, TBB or OpenMP tasking may be more appropriate. If you want to have a more detailed discussion of programming models for computational chemistry, I'd be happy to take that discussion offline. Best, Jeff -- Jeff Hammond jeff.scie...@gmail.com http://jeffhammond.github.io/