Thanks for the responses. I've hit a couple of snags that prevent me
from having all the information, but here is the TOP output for 1 of
the 4 nodes on which I just ran the simulation. (i.e., the job was run
on 4 nodes, using 3 of the cores on each node). The other 3 nodes
showed basically the same usage.
It appears that the total memory being used according to TOP is
approximately 75-80GB (this checks out with the system's utility that
reports memory usage for a job). I have yet to figure out how to run
FREE on a node where my job is currently running.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME P COMMAND
14847 balljm25 0 6430m 6.1g 6352 R 99.3 25.8 20:04 22 PARSIM
14848 balljm25 0 8914m 8.5g 6356 R 99.3 36.1 20:03 9 PARSIM
14846 balljm25 0 6717m 6.4g 6524 R 97.3 27.0 20:06 15 PARSIM
1 root 15 0 10344 732 608 S 0.0 0.0 0:02 7 init
Unfortunately, I'm not able to run the job on a 72GB node right now
for comparison, because that queue is currently occupied by another
user. Previously, this same job took ~30 GB in serial mode.
Hello,
Meep uses MPI-parallelization, i.e. assuming distributed memory for each
processor. Thus, when you increase the number of processors, each processor
gets its own chunk of the 3d simulation box. Of course, information must be
able to flow between these chunks, which means the chunks are a little
bit larger than just the subdivision of the space (they have a halo) and
at each time step, information is exchanged between the neighboring chunks.
So the amount of memory increase with number of processors depends
largely on the ratio of chunk volume to chunk surface for your problem,
i.e. how many cells are in the simulation box and how many cells you need
per chunk. You should check by a back-of-envelope calculation if the
memory scaling to see is consistent with what you would expect.
The speedup with respect to the number of processors depends on the same
issues.
It is clear that parallelization always comes at a performance cost and
usually scales quite a bit worse than linear. Your jobs will take the least
computational time if you submit a many long-running single processor jobs
at the same time rather than multi-core jobs sequentially. If the
simulation just takes a few minutes or hours like you wrote, I would do
that; if they take much longer, they probably involve much more cells
(resolution) and will be more efficient to parallelize.
Best wishes,
Georg
Unfortunately, the 10-30 minute simulations are only benchmarking
runs. The total run time is ~35 hours. Even a 3x speed-up is a welcome
improvement, but the memory increase introduces other challenges.
Here's my thinking: If I'm using N cores, and I normally use 100%
memory in serial mode, then the memory use I'd expect to see in
parallel would be 100% *(1 + N*halo_volume/chunk_volume). So the
memory bloat should be about 100%*N*halo_volume/chunk_volume.
Doing some rounding, my total simulation volume is about 500 x 1000 x
500 cells. A reasonable way to divide this with N=8 (don't know how
it's actually done) would be to make 8 x 500 x 250 x 250 chunks.
The volume of a chunk's halo should be ~ 2 x (500*250 + 500*250 +
250*250) * halo_thickness. This is ~625k cells for each layer of
cells in the halo.
A chunk's volume will be about 500 x 250 x 250 = 31250k cells. I don't
know how thick the halo really is, but with 8 cores, I should get
about a 16% increase for every layer of cells in the halo.
This would imply that the halo is about 6 cells thick. Of course there
is rounding error all over the place here, not to mention that I'm
ignoring the fact that the halo will be outside the normal boundaries
of a chunk, so will be slightly larger than what I've estimated here.
Re-running this math using the job I show above instead (N=12, 12
chunks of ~333x250x250 cells), I'd expect a 26% increase for every
layer of cells, and given the 77 GB figure, I again get a halo
thickness would be 6 cells. Consistent! That's nice.
Am I on the right track? Does this seem reasonable? I'd love to know
where I'm missing something.
Again, thanks to everyone for the help.
On Thu, Feb 13, 2014 at 3:42 AM, Tran Quyet Thang
tranquyetthang3...@gmail.com wrote:
John Ball ballman2010@... writes:
Hello all,I'm trying to run my c++ Meep script in parallel. I've found
little documentation on the subject, so I'm hoping to make a record of how
to do it here on the mailing list as well as to clear up some of my own
confusion and questions about the issue.
My original, bland, serial c++ compilation command comes straight from
the
Meep c++ tutorial page:
g++ `pkg-config --cflags meep` main.cpp -o SIM `pkg-config --libs meep`
where I've usedexport PKG_CONFIG_PATH =
/usr/local/apps/meep/lib/pkgconfig
so that pkg-config knows where in the world the meep.pc file is.
then I can simply run the compiled code WITH:./SIM
In parallel, the