Hi Ian, Erik, Eloisa, > I attach a very brief report of some results I obtained in 2015 after > attending a KNC workshop. >> Conclusions: By using 244 threads, with the domain split into tiles of size >> 8 × 4 × 4 points, and OpenMP threads assigned one per tile as they become >> available, the MIC was able to outperform the single CPU by a factor of 1.5. >> The same tiling strategy was used on the CPU, as it has been found to give >> good performance there in the past. Since we have not yet optimised the code >> for the MIC architecture, we believe that further speed improvements will be >> possible, and that solving the Einstein equations on the MIC architecture >> should be feasible. >> > Eloisa, are you using LoopControl? There are tiling parameters which can > also help with performance on these devices.
how does tiling work with LoopControl? Is it documented somewhere? I naively
thought that the point of tiling was to have chunks of data stored contiguously
in memory...
BTW, at the moment I am using this macro for all of my loop needs:
#define UTILS_LOOP3(NAME,I,SI,EI,J,SJ,EJ,K,SK,EK) \
_Pragma("omp for collapse(3)") \
for(int I = SI; I < EI; ++I) \
for(int J = SJ; J < EJ; ++J) \
for(int K = SK; K < EK; ++K)
How would I convert it to something equivalent using LoopControl?
Thanks,
David
PS. Seeing that Eloisa was able to compile bbox.cc with the intel-17.0.0 with
-no-vec, I made a patch to disable vectorization using pragmas inside bbox.cc
(to avoid having to compile it manually):
https://bitbucket.org/eschnett/carpet/pull-requests/16/carpetlib-fix-compilation-with-intel-1700/diff
signature.asc
Description: Message signed with OpenPGP
_______________________________________________ Users mailing list [email protected] http://lists.einsteintoolkit.org/mailman/listinfo/users
