Hi: How large is 'm_nStep'? [Are you sure?]
The source as below, all are the integer. Do you care what value ?. class CDynamicScheduling { public: static const int m_nDefaultStepUnit; static const int m_nDefaultStepFactor; private: int m_nBegin; int m_nEnd; int m_nStep; #if defined(_MSC_VER) std::atomic<int> m_nCurrent; #else int m_nCurrent; #endif I hope the actual source contains a comment such as: Compute pDst[] as the rounded average of non-overlapping 2x2 blocks of pixels in pSrc[]. Yes, you are right. It just compute the average of 2 * 2 blocks I show you just the aarch64 neon code: This is same function, but implement is x86. UINT16 *pDstL; UINT16 *pSrcL; INT32 dstWDiv2 = srcW >> 1; // INT32 dstHDiv2 = srcH >> 1; INT32 x, y; INT32 posDst,posSrc; pSrcL = pSrc; pDstL = pDst; int beginY, endY; while (pDS->GetProcLoop(beginY, endY)) { // for (y = 0; y < dstHDiv2; y++) for (y = beginY; y < endY; y++) { for (x = 0; x < dstWDiv2; x++) { posDst = y*dstStride + x; posSrc = (y<<1)*srcStride + (x<<1); pDstL[posDst] = (pSrcL[posSrc] + pSrcL[posSrc + 1] + pSrcL[posSrc+srcStride] + pSrcL[posSrc+srcStride + 1] + 2) >> 2; } } } pSrc is image buffer, about 11m. Width:3968 Height: 2976 srcStride: 3968 It meant four thread compute the average of 2 * 2 blocks pSrc is divided into many small pieces , and compute the average of every piceces, not by designed, by status of the running threads, maybe some threads hold the cpu ,so they compute more pieces, Maybe some thread not hold the cpu, compute less pieces ; BR Owen -----邮件原件----- 发件人: John Reiser [mailto:jrei...@bitwagon.com] 发送时间: 2018年1月26日 12:44 收件人: valgrind-users@lists.sourceforge.net 主题: Re: [Valgrind-users] 答复: 答复: 答复: [Help] Valgrind sometime run the program very slowly sometimes , it last at least one hour. can you show me why or some way to analyze it? On 01/25/2018 15:37 UTC, Wuweijia wrote: > Function1: > bool CDynamicScheduling::GetProcLoop( > int& nBegin, > int& nEndPlusOne) > { > int curr = __sync_fetch_and_add(&m_nCurrent, m_nStep); How large is 'm_nStep'? [Are you sure?] The overhead expense of switching threads in valgrind would be reduced by making m_nStep as large as possible. It looks like the code in Function2 would produce the same values regardless. > if (curr > m_nEnd) > { > return false; > } > > nBegin = curr; > int limit = m_nEnd + 1; Local variable 'limit' is unused. By itself this is unimportant, but it might be a clue to something that is not shown here. > nEndPlusOne = curr + m_nStep; > return true; > } > > > Function2: > .... > int beginY, endY; > while (pDS->GetProcLoop(beginY, endY)){ > for (y = beginY; y < endY; y++){ > for(x = 0; x < dstWDiv2-7; x+=8){ > vtmp0 = vld2q_u16(&pSrc[(y<<1)*srcStride+(x<<1)]); > vtmp1 = vld2q_u16(&pSrc[((y<<1)+1)*srcStride+(x<<1)]); I hope the actual source contains a comment such as: Compute pDst[] as the rounded average of non-overlapping 2x2 blocks of pixels in pSrc[]. > vst1q_u16(&pDst[y*dstStride+x], (vtmp0.val[0] + vtmp0.val[1] + > vtmp1.val[0] + vtmp1.val[1] + vdupq_n_u16(2)) >> vdupq_n_u16(2)); > } > for(; x < dstWDiv2; x++){ > pDst[y*dstStride+x] = (pSrc[(y<<1)*srcStride+(x<<1)] + > pSrc[(y<<1)*srcStride+(x<<1)+1] + pSrc[((y<<1)+1)*srcStride+(x<<1)] + > pSrc[((y<<1)+1)*srcStride+((x<<1)+1)] + 2) >> 2; > } > } > } > > return; > } ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Valgrind-users mailing list Valgrind-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/valgrind-users ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Valgrind-users mailing list Valgrind-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/valgrind-users