Hi Ian, Erik, Eloisa,

> I attach a very brief report of some results I obtained in 2015 after 
> attending a KNC workshop.
>> Conclusions: By using 244 threads, with the domain split into tiles of size 
>> 8 × 4 × 4 points, and OpenMP threads assigned one per tile as they become 
>> available, the MIC was able to outperform the single CPU by a factor of 1.5. 
>> The same tiling strategy was used on the CPU, as it has been found to give 
>> good performance there in the past. Since we have not yet optimised the code 
>> for the MIC architecture, we believe that further speed improvements will be 
>> possible, and that solving the Einstein equations on the MIC architecture 
>> should be feasible.
> Eloisa, are you using LoopControl?  There are tiling parameters which can 
> also help with performance on these devices.

how does tiling work with LoopControl? Is it documented somewhere? I naively 
thought that the point of tiling was to have chunks of data stored contiguously 
in memory...

BTW, at the moment I am using this macro for all of my loop needs:

#define UTILS_LOOP3(NAME,I,SI,EI,J,SJ,EJ,K,SK,EK)                              \
    _Pragma("omp for collapse(3)")                                             \
    for(int I = SI; I < EI; ++I)                                               \
    for(int J = SJ; J < EJ; ++J)                                               \
    for(int K = SK; K < EK; ++K)

How would I convert it to something equivalent using LoopControl?



PS. Seeing that Eloisa was able to compile bbox.cc with the intel-17.0.0 with 
-no-vec, I made a patch to disable vectorization using pragmas inside bbox.cc 
(to avoid having to compile it manually):


Attachment: signature.asc
Description: Message signed with OpenPGP

Users mailing list

Reply via email to