pengzhao-intel commented on a change in pull request #9958: Parallelization for ROIpooling OP URL: https://github.com/apache/incubator-mxnet/pull/9958#discussion_r172045940
########## File path: src/operator/roi_pooling.cc ########## @@ -74,7 +79,13 @@ inline void ROIPoolForward(const Tensor<cpu, 4, Dtype> &out, const Dtype* batch_data = bottom_data + data_size * roi_batch_ind; + #pragma omp parallel for firstprivate(batch_data, top_data, argmax_data) for (int c = 0; c < channels_; ++c) { + // Increment all data pointers + const Dtype* batch_data_c = batch_data + c * data_size_c; + Dtype* top_data_c = top_data + c * out_size_c; + Dtype* argmax_data_c = argmax_data + c * max_idx_size_c; + Review comment: @cjolivier01 Thanks for the great comments to Xinyu. Regarding replacing multiplication (FMA) to incremental addition (+=), two points of my view: 1) The incremental addition is more concise and logical rather than computing the index from start point each time by multiplication. But there're strong compute dependency and we **CANT** start the loop N+1 before the loop N. Thus, we have to change to multiple styles for the parallelization. 2) The efficiency of Multiplication (FMA) and addition (ADD) is same in the latest HW (Intel skylake) Take SSE instruct as an example: ADD, latency 4, CPI 0.5; https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=add&techs=SSE&expand=127,3673,127 FMA (MUL+ADD), latency 4, CPI 0.5 https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=127,3673,2395,3673,2395,2407&text=mul&techs=FMA,Other ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services