pengzhao-intel commented on a change in pull request #9958: Parallelization for 
ROIpooling OP
URL: https://github.com/apache/incubator-mxnet/pull/9958#discussion_r172045940
 
 

 ##########
 File path: src/operator/roi_pooling.cc
 ##########
 @@ -74,7 +79,13 @@ inline void ROIPoolForward(const Tensor<cpu, 4, Dtype> &out,
 
     const Dtype* batch_data = bottom_data + data_size * roi_batch_ind;
 
+    #pragma omp parallel for firstprivate(batch_data, top_data, argmax_data)
     for (int c = 0; c < channels_; ++c) {
+      // Increment all data pointers
+      const Dtype* batch_data_c = batch_data + c * data_size_c;
+      Dtype* top_data_c = top_data + c * out_size_c;
+      Dtype* argmax_data_c = argmax_data + c * max_idx_size_c;
+
 
 Review comment:
   @cjolivier01 Thanks for the great comments to Xinyu.
   Regarding replacing multiplication (FMA) to incremental addition (+=), two 
points of my view:
   
   1)  The incremental addition is more concise and logical rather than 
computing the index from start point each time by multiplication. But there're 
strong compute dependency and we **CANT** start the loop N+1 before the loop N. 
   Thus, we have to change to multiple styles for the parallelization.
   
   2) The efficiency of Multiplication (FMA) and addition (ADD) is same in the 
latest HW (Intel skylake)
   Take SSE instruct as an example:
   ADD, latency 4, CPI 0.5;
   
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=add&techs=SSE&expand=127,3673,127
   FMA (MUL+ADD), latency 4, CPI 0.5
   
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=127,3673,2395,3673,2395,2407&text=mul&techs=FMA,Other
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to