On 01/25/2018 15:37 UTC, Wuweijia wrote:
Function1: bool CDynamicScheduling::GetProcLoop( int& nBegin, int& nEndPlusOne) { int curr = __sync_fetch_and_add(&m_nCurrent, m_nStep);
How large is 'm_nStep'? [Are you sure?] The overhead expense of switching threads in valgrind would be reduced by making m_nStep as large as possible. It looks like the code in Function2 would produce the same values regardless.
if (curr > m_nEnd) { return false; } nBegin = curr;
int limit = m_nEnd + 1;
Local variable 'limit' is unused. By itself this is unimportant, but it might be a clue to something that is not shown here.
nEndPlusOne = curr + m_nStep; return true; } Function2: .... int beginY, endY; while (pDS->GetProcLoop(beginY, endY)){ for (y = beginY; y < endY; y++){ for(x = 0; x < dstWDiv2-7; x+=8){ vtmp0 = vld2q_u16(&pSrc[(y<<1)*srcStride+(x<<1)]); vtmp1 = vld2q_u16(&pSrc[((y<<1)+1)*srcStride+(x<<1)]);
I hope the actual source contains a comment such as: Compute pDst[] as the rounded average of non-overlapping 2x2 blocks of pixels in pSrc[].
vst1q_u16(&pDst[y*dstStride+x], (vtmp0.val[0] + vtmp0.val[1] + vtmp1.val[0] + vtmp1.val[1] + vdupq_n_u16(2)) >> vdupq_n_u16(2)); } for(; x < dstWDiv2; x++){ pDst[y*dstStride+x] = (pSrc[(y<<1)*srcStride+(x<<1)] + pSrc[(y<<1)*srcStride+(x<<1)+1] + pSrc[((y<<1)+1)*srcStride+(x<<1)] + pSrc[((y<<1)+1)*srcStride+((x<<1)+1)] + 2) >> 2; } } } return; }
------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Valgrind-users mailing list Valgrind-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/valgrind-users