Re: [Meep-discuss] Assorted issues/questions with parallel c++ Meep

2014-02-14 Thread John Ball
Thanks for the responses. I've hit a couple of snags that prevent me
from having all the information, but here is the TOP output for 1 of
the 4 nodes on which I just ran the simulation. (i.e., the job was run
on 4 nodes, using 3 of the cores on each node). The other 3 nodes
showed basically the same usage.

It appears that the total memory being used according to TOP is
approximately 75-80GB (this checks out with the system's utility that
reports memory usage for a job). I have yet to figure out how to run
FREE on a node where my job is currently running.


  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEM   TIME  P COMMAND
14847 balljm25   0 6430m 6.1g 6352 R 99.3 25.8  20:04 22 PARSIM
14848 balljm25   0 8914m 8.5g 6356 R 99.3 36.1  20:03  9 PARSIM
14846 balljm25   0 6717m 6.4g 6524 R 97.3 27.0  20:06 15 PARSIM
1 root  15   0 10344  732  608 S  0.0  0.0   0:02  7 init

Unfortunately, I'm not able to run the job on a 72GB node right now
for comparison, because that queue is currently occupied by another
user. Previously, this same job took ~30 GB in serial mode.



Hello,

Meep uses MPI-parallelization, i.e. assuming distributed memory for each
processor. Thus, when you increase the number of processors, each processor
gets its own chunk of the 3d simulation box. Of course, information must be
able to flow between these chunks, which means the chunks are a little
bit larger than just the subdivision of the space (they have a halo) and
at each time step, information is exchanged between the neighboring chunks.

So the amount of memory increase with number of processors depends
largely on the ratio of chunk volume to chunk surface for your problem,
i.e. how many cells are in the simulation box and how many cells you need
per chunk. You should check by a back-of-envelope calculation if the
memory scaling to see is consistent with what you would expect.

The speedup with respect to the number of processors depends on the same
issues.

It is clear that parallelization always comes at a performance cost and
usually scales quite a bit worse than linear. Your jobs will take the least
computational time if you submit a many long-running single processor jobs
at the same time rather than multi-core jobs sequentially. If the
simulation just takes a few minutes or hours like you wrote, I would do
that; if they take much longer, they probably involve much more cells
(resolution) and will be more efficient to parallelize.

Best wishes,
  Georg


Unfortunately, the 10-30 minute simulations are only benchmarking
runs. The total run time is ~35 hours. Even a 3x speed-up is a welcome
improvement, but the memory increase introduces other challenges.

Here's my thinking: If I'm using N cores, and I normally use 100%
memory in serial mode, then the memory use I'd expect to see in
parallel would be 100% *(1 + N*halo_volume/chunk_volume). So the
memory bloat should be about 100%*N*halo_volume/chunk_volume.

Doing some rounding, my total simulation volume is about 500 x 1000 x
500 cells. A reasonable way to divide this with N=8 (don't know how
it's actually done) would be to make 8 x 500 x 250 x 250 chunks.

The volume of a chunk's halo should be ~ 2 x (500*250 + 500*250 +
250*250) * halo_thickness. This is ~625k cells for each layer of
cells in the halo.

A chunk's volume will be about 500 x 250 x 250 = 31250k cells. I don't
know how thick the halo really is, but with 8 cores, I should get
about a 16% increase for every layer of cells in the halo.

This would imply that the halo is about 6 cells thick. Of course there
is rounding error all over the place here, not to mention that I'm
ignoring the fact that the halo will be outside the normal boundaries
of a chunk, so will be slightly larger than what I've estimated here.

Re-running this math using the job I show above instead (N=12, 12
chunks of ~333x250x250 cells), I'd expect a 26% increase for every
layer of cells, and given the 77 GB figure, I again get a halo
thickness would be 6 cells. Consistent! That's nice.


Am I on the right track? Does this seem reasonable? I'd love to know
where I'm missing something.


Again, thanks to everyone for the help.



On Thu, Feb 13, 2014 at 3:42 AM, Tran Quyet Thang 
tranquyetthang3...@gmail.com wrote:

 John Ball ballman2010@... writes:

 
 
 
 
 
 
  Hello all,I'm trying to run my c++ Meep script in parallel. I've found
 little documentation on the subject, so I'm hoping to make a record of how
 to do it here on the mailing list as well as to clear up some of my own
 confusion and questions about the issue.
 
 
  My original, bland, serial c++ compilation command comes straight from
 the
 Meep c++ tutorial page:
 
 
 
  g++ `pkg-config --cflags meep`  main.cpp -o SIM `pkg-config --libs meep`
  where I've usedexport PKG_CONFIG_PATH =
 /usr/local/apps/meep/lib/pkgconfig
 
  so that pkg-config knows where in the world the meep.pc file is.
 
  then I can simply run the compiled code WITH:./SIM
 
  In parallel, the 

Re: [Meep-discuss] Assorted issues/questions with parallel c++ Meep

2014-02-13 Thread Tran Quyet Thang
John Ball ballman2010@... writes:

 
 
 
 
 
 
 Hello all,I'm trying to run my c++ Meep script in parallel. I've found
little documentation on the subject, so I'm hoping to make a record of how
to do it here on the mailing list as well as to clear up some of my own
confusion and questions about the issue.
 
 
 My original, bland, serial c++ compilation command comes straight from the
Meep c++ tutorial page:
 
 
 
 g++ `pkg-config --cflags meep`  main.cpp -o SIM `pkg-config --libs meep` 
 where I've usedexport PKG_CONFIG_PATH = /usr/local/apps/meep/lib/pkgconfig
 
 so that pkg-config knows where in the world the meep.pc file is.
 
 then I can simply run the compiled code WITH:./SIM
 
 In parallel, the equivalent process I've settled upon using is as follows:
 
 First, I've changed the #include statement at the beginning of main.cpp to
point to the header file from the parallel install (not sure if this is
necessary, but it works):
 
 #include /usr/local/apps/meep/1.2.1/mpi/lib/include/meep.hpp
 
 To compile:
 
 mpic++ `pkg-config --cflags meep_mpi` par_main.cpp -o PAR_SIM `pkg-config
--libs meep_mpi`
 
 where I've told pkg-config to instead look for meep_mpi.pc:
 
 export PKG_CONFIG_PATH=/usr/local/apps/meep/1.2.1/mpi/lib/pkgconfig
 
 to run this, I send this command to the job scheduler:
 
 (...)/mpirun    -np $N    ./PAR_SIM
 
 where I choose N depending on the kind of node(s) I'm submitting to.
 
 This runs fine. Now I'm going to talk about performance:When submitting a
particular job to a single 16-core node with 72GB of memory, if I set N=1,
the memory usage is 30 GB, and the simulation runs at about 8.7 sec/step.
The job took about 35 minutes.When I instead set N=8, the memory usage is
62GB, and it runs at about 2.8 sec/step. The total simulation takes about 12
minutes. So! Are these numbers to be expected? A ~3x speedup going from 1 to
8 cores is less than I'd hoped for, but perhaps reasonable. What concerns me
more, though, is that while I suspected that I'd see some memory usage
increase, I did not expect to see a twofold increase when I went from 1 to 8
cores. I want to verify that this behavior is normal and I'm not misusing
the code or screwing up its compilation somehow.
 
 Finally, just asking for some advice: I could feasibly break the job up
and instead of using a single 16 core, 72 GB node like I mention above, I
could use, for example, 9 dual-core, 8GB nodes instead. My guess is that
doing so would increase the overhead due to network communications between
the nodes. However, what about memory usage? Does anyone have experience
with this? Furthermore, are there any tips or best practices for
conditioning the simulation and/or configuration to maximize throughput?
 
 Thanks in advance!
 
 
 
 
 
 
 
 ___
 meep-discuss mailing list
 meep-discuss@...
 http://ab-initio.mit.edu/cgi-bin/mailman/listinfo/meep-discuss

It is well known that FDTD simulation is memory bandwidth bound - it scales
well with increasing memory bandwidth in contrast with processing (FLOPS) power.

As you try to increase the number of cores in an SMP configuration, the
total amount of total available bandwidth of the system does not increase -
the memory controller(s) is merely more effectively utilized (saturated).
That is why a cluster with few(er) processing cores, but more memory
controller, would provide better speedup (versus cost) in FDTD than a
multiple core single CPU. In other words, multiple core CPUs are bad choices
for FDTD calculations. In fact, my current FDTD server is a dual CPU, 4
cores each, but with ~ 100GB/s memory bandwidth to feed the calculation.

In my experience with mpi-meep, relatively little memory overhead was
encountered, especially in larger simulations - perhaps you (double) counted
the shared memory space. The unix free and top commands would provide a good
estimation of real free memory. Could you post the result of free and top here?



___
meep-discuss mailing list
meep-discuss@ab-initio.mit.edu
http://ab-initio.mit.edu/cgi-bin/mailman/listinfo/meep-discuss