Re: [x265] [PATCH] SAO: no need to unroll chroma to avoid increased code size

chen Wed, 17 Feb 2016 01:41:15 -0800


在 2016-02-17 17:32:23，"Ashok Kumar Mishra" <as...@multicorewareinc.com> 写道：





On Wed, Feb 17, 2016 at 2:42 PM, chen <chenm...@163.com> wrote:




在 2016-02-17 16:40:47，"Ashok Kumar Mishra" <as...@multicorewareinc.com> 写道：

On Wed, Feb 17, 2016 at 9:28 AM, chen <chenm...@163.com> wrote:



At 2016-02-16 23:12:02,as...@multicorewareinc.com wrote:
># HG changeset patch
># User Ashok Kumar Mishra<as...@multicorewareinc.com>
># Date 1455633570 -19800
>#      Tue Feb 16 20:09:30 2016 +0530
># Node ID 36751a3dce37e4f506f4bdec12e20ef665b42012
># Parent  33b02e2af5a4b38cd54d3f94f163aae640855dbf
>SAO: no need to unroll chroma to avoid increased code size
>
>diff -r 33b02e2af5a4 -r 36751a3dce37 source/encoder/sao.cpp
>--- a/source/encoder/sao.cpp   Tue Feb 16 20:09:30 2016 +0530
>+++ b/source/encoder/sao.cpp   Tue Feb 16 20:09:30 2016 +0530
>@@ -113,7 +113,6 @@
>         m_clipTable = &(m_clipTableBase[rangeExt]);
> 
>         // Share with fast clip lookup table
>-
>         for (int i = 0; i < rangeExt; i++)
>             m_clipTableBase[i] = 0;
> 
>@@ -638,13 +637,8 @@
> {
>     PicYuv* reconPic = m_frame->m_reconPic;
>     intptr_t stride = reconPic->m_strideC;
>-    int ctuWidth  = g_maxCUSize;
>-    int ctuHeight = g_maxCUSize;
>-
>-    {
>-        ctuWidth  >>= m_hChromaShift;
>-        ctuHeight >>= m_vChromaShift;
>-    }
>+    int ctuWidth  = g_maxCUSize >> m_hChromaShift;
>+    int ctuHeight = g_maxCUSize >> m_vChromaShift;
> 
>     int addr = idxY * m_numCuInWidth + idxX;
>     pixel* recCb = reconPic->getCbAddr(addr);
>@@ -652,88 +646,53 @@
> 
>     if (idxX == 0)
>     {
>-        for (int i = 0; i < ctuHeight + 1; i++)
>+        for (int i = 0; i < ctuHeight + 1; i++, recCb += stride, recCr += 
>stride)


this style is not good to readable

First of all I couldn't understand why it is not good to readable. I believe in 
two things, the code should be good to readable, i.e., not complex so that in 
future code maintenance  will be easier
and at the same time the code should be compact. We should look into the code 
size since its impact is much larger in embedded systems.
Since we are working in x86, we may not be feeling it's impact. Yes I agree 
that there are certain cases like unrolling a loop to do same operations will 
give good performance. But here this is not the case.
Here I feel it is completely wrong to write separate code for Cb and Cr. It is 
increasing the code size, which is avoidable.


Still there are some unnecessary copy operations are there below if(typeIdx >= 
0) condition check.
I will clean this and other parts in code in my next patches.



In here, the for() mixed with control code, it affect readable on code.

For embedded system, code size is not a big problem, the performance depends on 
CPU, e.g. TI DSP jump instruction need 6 cycles, we may generate up to 48 
instructions during that period.; the new ARM cortex have 13-stages pipeline 
with out-of-order execute feature,  unroll loop may hidden another Chroma 
operators in these cycles.
Of course, I don't care change to loop, the Chroma is not bottleneck.



Yes code size is a problem as per my understanding, since the on-chip program 
memory size is very limited on TI DSP as well as other processors.

We should not unroll the code and make it reasonably high where it is not 
necessary at all as it is in this part of code where we are not going
to achieve any performance by simply unrolling the code. Just imagine one 
scenario if we write separate code for each components  Y, Cb and Cr in each
and every part of encoder, how much the code size it be !!!
unless you may make encoder code & data less than 32KB, otherwise we have to 
use external memory to store code.



>         { >             m_tmpL1[1][i] = recCb[0]; >             m_tmpL1[2][i] 
> = recCr[0]; >-            recCb += stride; >-            recCr += stride; >   
>       } >     } >  >-    bool mergeLeftFlagCb = (ctuParam[1][addr].mergeMode 
> == SAO_MERGE_LEFT); >-    int typeIdxCb = ctuParam[1][addr].typeIdx;

in here, no multiplication  operator to access typeIdx, see below

>-
>-    bool mergeLeftFlagCr = (ctuParam[2][addr].mergeMode == SAO_MERGE_LEFT);
>-    int typeIdxCr = ctuParam[2][addr].typeIdx;
>-
>     if (idxX != (m_numCuInWidth - 1))
>     {
>         recCb = reconPic->getCbAddr(addr);
>         recCr = reconPic->getCrAddr(addr);
>-        for (int i = 0; i < ctuHeight + 1; i++)
>+        for (int i = 0; i < ctuHeight + 1; i++, recCb += stride, recCr += 
>stride)
>         {
>             m_tmpL2[1][i] = recCb[ctuWidth - 1];
>             m_tmpL2[2][i] = recCr[ctuWidth - 1];
>-            recCb += stride;
>-            recCr += stride;
>         }
>     }
> 
>-    // Process U
>-    if (typeIdxCb >= 0)
>+    for (int plane = 1; plane < 3; plane++)
>     {
>-        if (!mergeLeftFlagCb)
>+        int typeIdx = ctuParam[plane][addr].typeIdx;
here necessary multiplication on array index access


>+        if (typeIdx >= 0)
>         {
>-            if (typeIdxCb == SAO_BO)
>+            if (ctuParam[plane][addr].mergeMode != SAO_MERGE_LEFT)
>             {
>-                memset(m_offsetBo[1], 0, sizeof(m_offsetBo[0]));
>+                if (typeIdx == SAO_BO)
>+                {
>+                    memset(m_offsetBo[plane], 0, sizeof(m_offsetBo[0]));
> 
>-                for (int i = 0; i < SAO_NUM_OFFSET; i++)
>-                    m_offsetBo[1][((ctuParam[1][addr].bandPos + i) & 
>(SAO_NUM_BO_CLASSES - 1))] = (int8_t)(ctuParam[1][addr].offset[i] << 
>SAO_BIT_INC);
>-            }
>-            else // if (typeIdx == SAO_EO_0 || typeIdx == SAO_EO_1 || typeIdx 
>== SAO_EO_2 || typeIdx == SAO_EO_3)
>-            {
>-                int offset[NUM_EDGETYPE];
>-                offset[0] = 0;
>-                for (int i = 0; i < SAO_NUM_OFFSET; i++)
>-                    offset[i + 1] = ctuParam[1][addr].offset[i] << 
>SAO_BIT_INC;
>+                    for (int i = 0; i < SAO_NUM_OFFSET; i++)
>+                        m_offsetBo[plane][((ctuParam[plane][addr].bandPos + 
>i) & (SAO_NUM_BO_CLASSES - 1))] = (int8_t)(ctuParam[plane][addr].offset[i] << 
>SAO_BIT_INC);
>+                }
>+                else // if (typeIdx == SAO_EO_0 || typeIdx == SAO_EO_1 || 
>typeIdx == SAO_EO_2 || typeIdx == SAO_EO_3)
>+                {
>+                    int offset[NUM_EDGETYPE];
>+                    offset[0] = 0;
>+                    for (int i = 0; i < SAO_NUM_OFFSET; i++)
>+                        offset[i + 1] = ctuParam[plane][addr].offset[i] << 
>SAO_BIT_INC;

the loop just 4 times, we may merge with below loop

> 
>-                for (int edgeType = 0; edgeType < NUM_EDGETYPE; edgeType++)
>-                    m_offsetEo[1][edgeType] = 
>(int8_t)offset[s_eoTable[edgeType]];
>+                    for (int edgeType = 0; edgeType < NUM_EDGETYPE; 
>edgeType++)
>+                        m_offsetEo[plane][edgeType] = 
>(int8_t)offset[s_eoTable[edgeType]];
>+                }
>             }
>         }
>-        processSaoCu(addr, typeIdxCb, 1);
>+        processSaoCu(addr, typeIdx, plane);
>+        std::swap(m_tmpL1[plane], m_tmpL2[plane]);
>     }



_______________________________________________
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel





_______________________________________________
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel

_______________________________________________
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel

Re: [x265] [PATCH] SAO: no need to unroll chroma to avoid increased code size

Reply via email to