Hello all, I'm trying to run my c++ Meep script in parallel. I've found little documentation on the subject, so I'm hoping to make a record of how to do it here on the mailing list as well as to clear up some of my own confusion and questions about the issue.
My original, bland, serial c++ compilation command comes straight from the Meep c++ tutorial page: g++ `pkg-config --cflags meep` main.cpp -o SIM `pkg-config --libs meep` where I've used export PKG_CONFIG_PATH = /usr/local/apps/meep/lib/pkgconfig so that pkg-config knows where in the world the meep.pc file is. then I can simply run the compiled code WITH: ./SIM In parallel, the equivalent process I've settled upon using is as follows: First, I've changed the #include statement at the beginning of main.cpp to point to the header file from the parallel install (not sure if this is necessary, but it works): #include "/usr/local/apps/meep/1.2.1/mpi/lib/include/meep.hpp" To compile: mpic++ `pkg-config --cflags meep_mpi` par_main.cpp -o PAR_SIM `pkg-config --libs meep_mpi` where I've told pkg-config to instead look for meep_mpi.pc: export PKG_CONFIG_PATH=/usr/local/apps/meep/1.2.1/mpi/lib/pkgconfig to run this, I send this command to the job scheduler: (...)/mpirun -np $N ./PAR_SIM where I choose N depending on the kind of node(s) I'm submitting to. This runs fine. Now I'm going to talk about performance: When submitting a particular job to a single 16-core node with 72GB of memory, if I set N=1, the memory usage is 30 GB, and the simulation runs at about 8.7 sec/step. The job took about 35 minutes. When I instead set N=8, the memory usage is 62GB, and it runs at about 2.8 sec/step. The total simulation takes about 12 minutes. So! Are these numbers to be expected? A ~3x speedup going from 1 to 8 cores is less than I'd hoped for, but perhaps reasonable. What concerns me more, though, is that while I suspected that I'd see some memory usage increase, I did not expect to see a twofold increase when I went from 1 to 8 cores. I want to verify that this behavior is normal and I'm not misusing the code or screwing up its compilation somehow. Finally, just asking for some advice: I could feasibly break the job up and instead of using a single 16 core, 72 GB node like I mention above, I could use, for example, 9 dual-core, 8GB nodes instead. My guess is that doing so would increase the overhead due to network communications between the nodes. However, what about memory usage? Does anyone have experience with this? Furthermore, are there any tips or best practices for conditioning the simulation and/or configuration to maximize throughput? Thanks in advance!
_______________________________________________ meep-discuss mailing list meep-discuss@ab-initio.mit.edu http://ab-initio.mit.edu/cgi-bin/mailman/listinfo/meep-discuss