Re: [x265] [PATCH] encoder: Do not include CLL SEI message if empty

2018-11-06 Thread Praveen Tiwari
Hello Vittorio,

Sorry for the late reply,  all of us were on leave due to the Diwali
festival in India.

Thanks for the patch, will run some basic test and push the patch.

Regards,
Praveen


On Wed, Nov 7, 2018 at 12:35 AM Vittorio Giovara 
wrote:

>
>
> On Thu, Nov 1, 2018 at 5:34 PM Vittorio Giovara <
> vittorio.giov...@gmail.com> wrote:
>
>> Some devices render out-of-luminance pixels incorrectly otherwise.
>>
>> ---
>>  source/encoder/encoder.cpp | 11 +++
>>  1 file changed, 7 insertions(+), 4 deletions(-)
>>
>> diff -r fd517ae68f93 source/encoder/encoder.cpp
>> --- a/source/encoder/encoder.cppTue Sep 25 16:02:31 2018 +0530
>> +++ b/source/encoder/encoder.cppThu Nov 01 17:27:51 2018 -0400
>> @@ -2381,10 +2381,13 @@
>>
>>  if (m_param->bEmitHDRSEI)
>>  {
>> -SEIContentLightLevel cllsei;
>> -cllsei.max_content_light_level = m_param->maxCLL;
>> -cllsei.max_pic_average_light_level = m_param->maxFALL;
>> -cllsei.writeSEImessages(bs, m_sps, NAL_UNIT_PREFIX_SEI, list,
>> m_param->bSingleSeiNal);
>> +if (m_emitCLLSEI)
>> +{
>> +SEIContentLightLevel cllsei;
>> +cllsei.max_content_light_level = m_param->maxCLL;
>> +cllsei.max_pic_average_light_level = m_param->maxFALL;
>> +cllsei.writeSEImessages(bs, m_sps, NAL_UNIT_PREFIX_SEI,
>> list, m_param->bSingleSeiNal);
>> +}
>>
>>  if (m_param->masteringDisplayColorVolume)
>>  {
>> --
>> Vittorio
>>
>
> ping
> --
> Vittorio
> ___
> x265-devel mailing list
> x265-devel@videolan.org
> https://mailman.videolan.org/listinfo/x265-devel
>
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] fix Issue #442: linking issue on non x86 platform

2018-10-31 Thread Praveen Tiwari
Thanks! I messed up the syntax.

On Wed, Oct 31, 2018 at 5:45 PM Andrey Semashev 
wrote:

> On 10/31/18 2:33 PM, prav...@multicorewareinc.com wrote:
> > # HG changeset patch
> > # User Praveen Tiwari 
> > # Date 1540983948 -19800
> > #  Wed Oct 31 16:35:48 2018 +0530
> > # Node ID 1c878790edea64186edabcd40fb3df121f536311
> > # Parent  fd517ae68f93dbfdd1bff45a9dd8e626523542b6
> > fix Issue #442: linking issue on non x86 platform
> >
> > diff -r fd517ae68f93 -r 1c878790edea source/common/cpu.cpp
> > --- a/source/common/cpu.cpp   Tue Sep 25 16:02:31 2018 +0530
> > +++ b/source/common/cpu.cpp   Wed Oct 31 16:35:48 2018 +0530
> > @@ -127,6 +127,7 @@
> >   {
> >   return(enable512);
> >   }
> > +
> >   uint32_t cpu_detect(bool benableavx512 )
> >   {
> >
> > diff -r fd517ae68f93 -r 1c878790edea source/common/quant.cpp
> > --- a/source/common/quant.cpp Tue Sep 25 16:02:31 2018 +0530
> > +++ b/source/common/quant.cpp Wed Oct 31 16:35:48 2018 +0530
> > @@ -723,6 +723,7 @@
> >   X265_CHECK(coeffNum[cgScanPos] == 0, "count of coeff
> failure\n");
> >   uint32_t scanPosBase = (cgScanPos << MLS_CG_SIZE);
> >   uint32_t blkPos  = codeParams.scan[scanPosBase];
> > +#if X265_ARCH_X86
> >   bool enable512 = detect512();
> >   if (enable512)
> >   primitives.cu[log2TrSize -
> 2].psyRdoQuant(m_resiDctCoeff, m_fencDctCoeff, costUncoded,
> , , , blkPos);
> > @@ -731,6 +732,10 @@
> >   primitives.cu[log2TrSize -
> 2].psyRdoQuant_1p(m_resiDctCoeff,  costUncoded, ,
> ,blkPos);
> >   primitives.cu[log2TrSize -
> 2].psyRdoQuant_2p(m_resiDctCoeff, m_fencDctCoeff, costUncoded,
> , , , blkPos);
> >   }
> > +#elif
>
> #else? Everywhere else, too.
>
> > +primitives.cu[log2TrSize -
> 2].psyRdoQuant_1p(m_resiDctCoeff, costUncoded, ,
> , blkPos);
> > +primitives.cu[log2TrSize -
> 2].psyRdoQuant_2p(m_resiDctCoeff, m_fencDctCoeff, costUncoded,
> , , , blkPos);
> > +#endif
> >   }
> >   }
> >   else
> > @@ -805,8 +810,8 @@
> >   uint32_t blkPos = codeParams.scan[scanPosBase];
> >   if (usePsyMask)
> >   {
> > +#if X265_ARCH_X86
> >   bool enable512 = detect512();
> > -
> >   if (enable512)
> >   primitives.cu[log2TrSize -
> 2].psyRdoQuant(m_resiDctCoeff, m_fencDctCoeff, costUncoded,
> , , , blkPos);
> >   else
> > @@ -814,6 +819,10 @@
> >   primitives.cu[log2TrSize -
> 2].psyRdoQuant_1p(m_resiDctCoeff, costUncoded, ,
> , blkPos);
> >   primitives.cu[log2TrSize -
> 2].psyRdoQuant_2p(m_resiDctCoeff, m_fencDctCoeff, costUncoded,
> , , , blkPos);
> >   }
> > +#elif
> > +primitives.cu[log2TrSize -
> 2].psyRdoQuant_1p(m_resiDctCoeff, costUncoded, ,
> , blkPos);
> > +primitives.cu[log2TrSize -
> 2].psyRdoQuant_2p(m_resiDctCoeff, m_fencDctCoeff, costUncoded,
> , , , blkPos);
> > +#endif
> >   blkPos = codeParams.scan[scanPosBase];
> >   for (int y = 0; y < MLS_CG_SIZE; y++)
> >   {
> >
> >
> > ___
> > x265-devel mailing list
> > x265-devel@videolan.org
> > https://mailman.videolan.org/listinfo/x265-devel
> >
>
> ___
> x265-devel mailing list
> x265-devel@videolan.org
> https://mailman.videolan.org/listinfo/x265-devel
>
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] Original C++ code used for sad functions' assembly code in COST_MV?

2018-09-05 Thread Praveen Tiwari
Hello Jeffrey,

You can find all C primitives in source/common folder.

SAD C primitives ares in source/common/pixel.cpp.


Thanks,
Praveen

On Wed, Sep 5, 2018 at 12:23 PM, Mario *LigH* Rohkrämer 
wrote:

> Jeffrey Chen schrieb am 04.09.2018 um 23:57:
>
>> Hi, I would like to configure the sad function in COST_MV for another
>> platform. However, the assembly code would not be supported on the other
>> platform. Where can I find the original programming language code that was
>> made into the assembly language code?
>>
>
> Hi Jeffrey.
>
> I'm not a developer, just guessing:
>
> source/encoder/motion.cpp line 234 #defines a loop.
> ___
> x265-devel mailing list
> x265-devel@videolan.org
> https://mailman.videolan.org/listinfo/x265-devel
>
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] Code performance issue

2018-06-04 Thread Praveen Tiwari
Hello Min,

Thanks for the suggestion, we will run some tests and let you know if any
change is required here. Thanks.


Regards,
Praveen Tiwari



On Sat, Jun 2, 2018 at 9:18 AM, chen  wrote:

> There have series performance issues, such as,
>
> uint32_t sum = (uint32_t)pow((outOfBound >> 2), 2);
>
> Are you want to get square value from a small integer?
>
>
> ___
> x265-devel mailing list
> x265-devel@videolan.org
> https://mailman.videolan.org/listinfo/x265-devel
>
>
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] threadpool.cpp: use WIN system call for popcount

2018-05-03 Thread Praveen Tiwari
It is just counting cpusPerNode, so the 64-bit number is not required, yes
but I missed the fact of support on few CPUs.  Lookup table
based implementation could have been fastest due to better caching, but it
is not used frequently so we can keep as it is. Thanks.

On Thu, May 3, 2018 at 11:24 PM, Andrey Semashev <andrey.semas...@gmail.com>
wrote:

> On Thu, May 3, 2018 at 7:37 PM, Pradeep Ramachandran
> <prad...@multicorewareinc.com> wrote:
> >
> > On Thu, May 3, 2018 at 2:23 PM, <prav...@multicorewareinc.com> wrote:
> >>
> >> # HG changeset patch
> >> # User Praveen Tiwari <prav...@multicorewareinc.com>
> >> # Date 1525328839 -19800
> >> #  Thu May 03 11:57:19 2018 +0530
> >> # Branch stable
> >> # Node ID 9cbb2aadcca3a2f7a308ea1dc792fb817bcc5b51
> >> # Parent  69aafa6d70ad4e151f4590766c6b125621c5d007
> >> threadpool.cpp: use WIN system call for popcount
> >
> >
> > Unless this fixes a known bug, I don't want to push this directly into
> > stable. Syscalls are notorious especially when working with older
> versions
> > of the OS.
> > I would rather push this into default and allow users to test that this
> > works with all kinds of systems and then merge with stable once the
> answer
> > is known.
> > Does this fix a specific issue on some platform, or improve performance?
>
> The comment is not quite right, __popcnt is not a syscall but an
> MSVC-specific intrinsic.
>
> https://msdn.microsoft.com/en-us/library/bb385231.aspx
>
> The equivalent gcc intrinsic is __builtin_popcount and friends.
>
> I think, the patch is buggy because the relevant field is a 64-bit
> integer on 64-bit Windows and __popcnt is 32-bit.
>
> Note also that the popcount instruction only available in ABM ISA
> extension. In Intel CPUs it is available since Nehalem.
>
> >> diff -r 69aafa6d70ad -r 9cbb2aadcca3 source/common/threadpool.cpp
> >> --- a/source/common/threadpool.cpp  Wed May 02 15:15:05 2018 +0530
> >> +++ b/source/common/threadpool.cpp  Thu May 03 11:57:19 2018 +0530
> >> @@ -71,21 +71,6 @@
> >>  # define strcasecmp _stricmp
> >>  #endif
> >>
> >> -#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7
> >> -const uint64_t m1 = 0x; //binary: 0101...
> >> -const uint64_t m2 = 0x; //binary: 00110011..
> >> -const uint64_t m3 = 0x0f0f0f0f0f0f0f0f; //binary:  4 zeros,  4 ones ...
> >> -const uint64_t h01 = 0x0101010101010101; //the sum of 256 to the power
> of
> >> 0,1,2,3...
> >> -
> >> -static int popCount(uint64_t x)
> >> -{
> >> -x -= (x >> 1) & m1;
> >> -x = (x & m2) + ((x >> 2) & m2);
> >> -x = (x + (x >> 4)) & m3;
> >> -return (x * h01) >> 56;
> >> -}
> >> -#endif
> >> -
> >>  namespace X265_NS {
> >>  // x265 private namespace
> >>
> >> @@ -274,7 +259,7 @@
> >>  for (int i = 0; i < numNumaNodes; i++)
> >>  {
> >>  GetNumaNodeProcessorMaskEx((UCHAR)i, groupAffinityPointer);
> >> -cpusPerNode[i] = popCount(groupAffinityPointer->Mask);
> >> +cpusPerNode[i] = __popcnt(static_cast >> int>(groupAffinityPointer->Mask));
> >>  }
> >>  delete groupAffinityPointer;
> >>  #elif HAVE_LIBNUMA
> >> @@ -623,7 +608,7 @@
> >>  for (int i = 0; i < numNumaNodes; i++)
> >>  {
> >>  GetNumaNodeProcessorMaskEx((UCHAR)i, );
> >> -cpus += popCount(groupAffinity.Mask);
> >> +cpus += __popcnt(static_cast int>(groupAffinity.Mask));
> >>  }
> >>  return cpus;
> >>  #elif _WIN32
> >> ___
> >> x265-devel mailing list
> >> x265-devel@videolan.org
> >> https://mailman.videolan.org/listinfo/x265-devel
> >
> >
> >
> > ___
> > x265-devel mailing list
> > x265-devel@videolan.org
> > https://mailman.videolan.org/listinfo/x265-devel
> >
> ___
> x265-devel mailing list
> x265-devel@videolan.org
> https://mailman.videolan.org/listinfo/x265-devel
>
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH 000 of 307 ] AVX-512 implementataion in x265: breaks 32-bit compilation

2018-04-11 Thread Praveen Tiwari
Thanks for reporting, we are looking at the issue, will send a fix soon.

Regards,
Praveen Tiwari

On Thu, Apr 12, 2018 at 2:31 AM, Mario Rohkrämer <cont...@ligh.de> wrote:

> Am 07.04.2018, 04:29 Uhr, schrieb <mythr...@multicorewareinc.com>:
>
> This series of patches enables AVX-512 in x265. USe CLI option --asm
>> avx512 to enable AVX-512 kernels.
>> ___
>> x265-devel mailing list
>> x265-devel@videolan.org
>> https://mailman.videolan.org/listinfo/x265-devel
>>
>
>
> Compiling x265 for Win32 target (here in MSYS2/MinGW32) is not possible
> anymore.
>
> Assembler code was still available for 8-bit depth core, at least. But:
>
> +
> [ 13%] Building ASM_NASM object common/CMakeFiles/common.dir/x
> 86/pixel-util8.asm.obj
> H:/development/media-autobuild_suite-master/build/x265-hg/
> source/common/x86/pixel-util8.asm:1867: error: invalid combination of
> opcode and operands
> H:/development/media-autobuild_suite-master/build/x265-hg/
> source/common/x86/pixel-util8.asm:1880: error: invalid combination of
> opcode and operands
> H:/development/media-autobuild_suite-master/build/x265-hg/
> source/common/x86/pixel-util8.asm:1880: error: invalid combination of
> opcode and operands
> H:/development/media-autobuild_suite-master/build/x265-hg/
> source/common/x86/pixel-util8.asm:1880: error: invalid combination of
> opcode and operands
> H:/development/media-autobuild_suite-master/build/x265-hg/
> source/common/x86/pixel-util8.asm:1880: error: invalid combination of
> opcode and operands
> H:/development/media-autobuild_suite-master/build/x265-hg/
> source/common/x86/pixel-util8.asm:1941: error: invalid combination of
> opcode and operands
> H:/development/media-autobuild_suite-master/build/x265-hg/
> source/common/x86/pixel-util8.asm:1954: error: invalid combination of
> opcode and operands
> H:/development/media-autobuild_suite-master/build/x265-hg/
> source/common/x86/pixel-util8.asm:1954: error: invalid combination of
> opcode and operands
> H:/development/media-autobuild_suite-master/build/x265-hg/
> source/common/x86/pixel-util8.asm:1954: error: invalid combination of
> opcode and operands
> H:/development/media-autobuild_suite-master/build/x265-hg/
> source/common/x86/pixel-util8.asm:1954: error: invalid combination of
> opcode and operands
> H:/development/media-autobuild_suite-master/build/x265-hg/
> source/common/x86/pixel-util8.asm:1954: error: invalid combination of
> opcode and operands
> H:/development/media-autobuild_suite-master/build/x265-hg/
> source/common/x86/pixel-util8.asm:1954: error: invalid combination of
> opcode and operands
> H:/development/media-autobuild_suite-master/build/x265-hg/
> source/common/x86/pixel-util8.asm:1954: error: invalid combination of
> opcode and operands
> H:/development/media-autobuild_suite-master/build/x265-hg/
> source/common/x86/pixel-util8.asm:1954: error: invalid combination of
> opcode and operands
> H:/development/media-autobuild_suite-master/build/x265-hg/
> source/common/x86/pixel-util8.asm:1954: error: invalid combination of
> opcode and operands
> H:/development/media-autobuild_suite-master/build/x265-hg/
> source/common/x86/pixel-util8.asm:1954: error: invalid combination of
> opcode and operands
> H:/development/media-autobuild_suite-master/build/x265-hg/
> source/common/x86/pixel-util8.asm:1954: error: invalid combination of
> opcode and operands
> H:/development/media-autobuild_suite-master/build/x265-hg/
> source/common/x86/pixel-util8.asm:1954: error: invalid combination of
> opcode and operands
> H:/development/media-autobuild_suite-master/build/x265-hg/
> source/common/x86/pixel-util8.asm:1954: error: invalid combination of
> opcode and operands
> H:/development/media-autobuild_suite-master/build/x265-hg/
> source/common/x86/pixel-util8.asm:1954: error: invalid combination of
> opcode and operands
> H:/development/media-autobuild_suite-master/build/x265-hg/
> source/common/x86/pixel-util8.asm:1954: error: invalid combination of
> opcode and operands
> H:/development/media-autobuild_suite-master/build/x265-hg/
> source/common/x86/pixel-util8.asm:1954: error: invalid combination of
> opcode and operands
> make[2]: *** [common/CMakeFiles/common.dir/build.make:159:
> common/CMakeFiles/common.dir/x86/pixel-util8.asm.obj] Error 1
> make[1]: *** [CMakeFiles/Makefile2:449: common/CMakeFiles/common.dir/all]
> Error 2
> make: *** [Makefile:130: all] Error 2
> +
>
> Trying to compile AVX-512 instructions may have to be avoided in 32-bit
> architecture mode (because there is surely no 32-bit only CPU supporting
> this instruction set extension).
>
> --
>
> Fun and success!
&g

Re: [x265] [PATCH 000 of 307 ] AVX-512 implementataion in x265

2018-04-06 Thread Praveen Tiwari
Your request is on the way, soon we will share the performance related
details. Thanks.

Regards,
Praveen Tiwari

On Fri, Apr 6, 2018 at 9:36 PM, Vittorio Giovara <vittorio.giov...@gmail.com
> wrote:

> just curious, what kind of general speed improvement does this give?
> I could have missed them in the series, but it would be nice to have some
> sort of benchmarks
> thanks
> Vittorio
>
> On Sat, Apr 7, 2018 at 4:29 AM, <mythr...@multicorewareinc.com> wrote:
>
>> This series of patches enables AVX-512 in x265. USe CLI option --asm
>> avx512 to enable AVX-512 kernels.
>> ___
>> x265-devel mailing list
>> x265-devel@videolan.org
>> https://mailman.videolan.org/listinfo/x265-devel
>>
>
>
>
> --
> Vittorio
>
> ___
> x265-devel mailing list
> x265-devel@videolan.org
> https://mailman.videolan.org/listinfo/x265-devel
>
>
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] quant.cpp: 'rdoQuant_c' primitive for SIMD optimization

2017-11-27 Thread Praveen Tiwari
Please ignore this patch I messed an update. I will resend this soon. Thanks

On Mon, Nov 27, 2017 at 5:11 PM, <prav...@multicorewareinc.com> wrote:

> # HG changeset patch
> # User Praveen Tiwari <prav...@multicorewareinc.com>
> # Date 1511167656 -19800
> #  Mon Nov 20 14:17:36 2017 +0530
> # Node ID dffb056e5ad0e2298b0dd65d048f4f16d8508566
> # Parent  b24454f3ff6de650aab6835e291837fc4e2a4466
> quant.cpp: 'rdoQuant_c' primitive for SIMD optimization
>
> This particular section of code appears to be bottleneck in many profiles,
> as it
> involves 64-bit multiplication operations. For SIMD optimization we need
> to convert
> few buffer/variables to double.
>
> diff -r b24454f3ff6d -r dffb056e5ad0 source/common/dct.cpp
> --- a/source/common/dct.cpp Wed Nov 22 22:00:48 2017 +0530
> +++ b/source/common/dct.cpp Mon Nov 20 14:17:36 2017 +0530
> @@ -984,6 +984,32 @@
>  return (sum & 0x00FF) + (c1 << 26) + (firstC2Idx << 28);
>  }
>
> +void rdoQuant_c(int16_t* m_resiDctCoeff, int16_t* m_fencDctCoeff, double*
> costUncoded, double* totalUncodedCost, double* totalRdCost, int64_t
> psyScale, uint32_t blkPos, uint32_t log2TrSize)
> +{
> +const int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH -
> log2TrSize; /* Represents scaling through forward transform */
> +const int scaleBits = SCALE_BITS - 2 * transformShift;
> +const uint32_t trSize = 1 << log2TrSize;
> +int max = X265_MAX(0, (2 * transformShift + 1));
> +
> +for (int y = 0; y < MLS_CG_SIZE; y++)
> +{
> +for (int x = 0; x < MLS_CG_SIZE; x++)
> +{
> +int64_t signCoef = m_resiDctCoeff[blkPos + x];/*
> pre-quantization DCT coeff */
> +int64_t predictedCoef = m_fencDctCoeff[blkPos + x] -
> signCoef; /* predicted DCT = source DCT - residual DCT*/
> +
> +costUncoded[blkPos + x] = static_cast((signCoef *
> signCoef) << scaleBits);
> +
> +/* when no residual coefficient is coded, predicted coef ==
> recon coef */
> +costUncoded[blkPos + x] -= static_cast((psyScale *
> (predictedCoef)) >> max);
> +
> +*totalUncodedCost += costUncoded[blkPos + x];
> +*totalRdCost += costUncoded[blkPos + x];
> +}
> +blkPos += trSize;
> +}
> +}
> +
>  namespace X265_NS {
>  // x265 private namespace
>
> @@ -993,6 +1019,7 @@
>  p.dequant_normal = dequant_normal_c;
>  p.quant = quant_c;
>  p.nquant = nquant_c;
> +p.rdoQuant = rdoQuant_c;
>  p.dst4x4 = dst4_c;
>  p.cu[BLOCK_4x4].dct   = dct4_c;
>  p.cu[BLOCK_8x8].dct   = dct8_c;
> diff -r b24454f3ff6d -r dffb056e5ad0 source/common/primitives.h
> --- a/source/common/primitives.hWed Nov 22 22:00:48 2017 +0530
> +++ b/source/common/primitives.hMon Nov 20 14:17:36 2017 +0530
> @@ -216,6 +216,7 @@
>
>  typedef void (*integralv_t)(uint32_t *sum, intptr_t stride);
>  typedef void (*integralh_t)(uint32_t *sum, pixel *pix, intptr_t stride);
> +typedef void (*rdoQuant_t)(int16_t* m_resiDctCoeff, int16_t*
> m_fencDctCoeff, double* costUncoded, double* totalUncodedCost, double*
> totalRdCost, int64_t psyScale, uint32_t blkPos, uint32_t log2TrSize);
>
>  /* Function pointers to optimized encoder primitives. Each pointer can
> reference
>   * either an assembly routine, a SIMD intrinsic primitive, or a C
> function */
> @@ -304,6 +305,7 @@
>
>  quant_t   quant;
>  nquant_t  nquant;
> +rdoQuant_trdoQuant;
>  dequant_scaling_t dequant_scaling;
>  dequant_normal_t  dequant_normal;
>  denoiseDct_t  denoiseDct;
> diff -r b24454f3ff6d -r dffb056e5ad0 source/common/quant.cpp
> --- a/source/common/quant.cpp   Wed Nov 22 22:00:48 2017 +0530
> +++ b/source/common/quant.cpp   Mon Nov 20 14:17:36 2017 +0530
> @@ -663,7 +663,7 @@
>  #define PSYVALUE(rec)   ((psyScale * (rec)) >> X265_MAX(0, (2 *
> transformShift + 1)))
>
>  int64_t costCoeff[trSize * trSize];   /* d*d + lambda * bits */
> -int64_t costUncoded[trSize * trSize]; /* d*d + lambda * 0*/
> +double costUncoded[trSize * trSize]; /* d*d + lambda * 0*/
>  int64_t costSig[trSize * trSize]; /* lambda * bits   */
>
>  int rateIncUp[trSize * trSize];  /* signal overhead of increasing
> level */
> @@ -677,12 +677,12 @@
>  bool bIsLuma = ttype == TEXT_LUMA;
>
>  /* total rate distortion cost of transform block, as CBF=0 */
> -int64_t totalUncodedCost = 0;
> +double totalUncodedCost = 0;
>
>  /* Total rate distortion cost of this transform block, counting te
> di

Re: [x265] [PATCH 2 of 2] x86: Change assembler from YASM to NASM

2017-11-21 Thread Praveen Tiwari
Yes, that's true looking at the future prospect we have decided to move the
support to NASM. It comes with additional advantages as Andrey mentioned
above,  but we understand the concern to change assembler support,  we will
make it a smooth transition as much as possible. Thanks.

Regards,
Praveen Tiwari
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] Fwd: [PATCH] intra: sse4 version of strong intra smoothing

2017-11-20 Thread Praveen Tiwari
-- Forwarded message --
From: chen 
Date: Tue, Nov 21, 2017 at 10:07 AM
Subject: Re: [x265] [PATCH] intra: sse4 version of strong intra smoothing
To: Development for x265 


>diff -r a7c2f80c18af -r 973560d58dfb source/common/x86/intrapred8.asm
>--- a/source/common/x86/intrapred8.asm Mon Nov 20 14:31:22 2017 +0530
>+++ b/source/common/x86/intrapred8.asm Tue Nov 21 03:10:14 2017 +0800
>@@ -22313,11 +22313,144 @@
> mov [r1 + 64], r3b  ; LeftLast
> RET
>
>-INIT_XMM sse4
>-cglobal intra_filter_32x32, 2,4,6
>-mov r2b, byte [r0 +  64]; topLast
>-mov r3b, byte [r0 + 128]; LeftLast
>-
>+; this function add strong intra filter
>+
​​
INIT_XMM sse4
>+cglobal intra_filter_32x32, 3,8,7
>+xor r3d, r3d ; R9
>+xor r4d, r4d ; R10
>+mov r3b, byte [r0 +  64] ; topLast
>+mov r4b, byte [r0 + 128] ; LeftLast

xor+mov = movzx, the xor (clear to zero) does not spending cycle, but
affect instruction decode rate

>+
>+; strong intra filter is diabled
>+cmp r2m, byte 0
>+jz  .normal_filter32
>+; decide to do strong intra filter
>+xor r5d, r5d ; R11
>+xor r6d, r6d ; RAX
>+xor r7d, r7d ; RDI
>+mov r5b, byte [r0]   ; topLeft
>+mov r6b, byte [r0 + 96]  ; leftMiddle
>+mov r7b, byte [r0 + 32]  ; topMiddle
>+
>+; threshold = 8
>+mov r2d, r3d ; R8
>+add r2d, r5d ; (topLast + topLeft)
>+shl r7d, 1   ; 2 * topMiddle
>+sub r2d, r7d
(A+B) - 2 * C  <==> (A-C) + (B-C)

>+mov r7d, r2d ; backup r2d
>+sar r7d, 31
>+xor r2d, r7d
>+sub r2d, r7d ; abs(r2d)
>+cmp r2d, 8
; how about this or instruction cdq?
; abs(x-y)
mov eax, X
sub eax, Y
sub Y, X
cmovg eax, Y


>+; bilinearAbove is false
>+jns .normal_filter32
>+
>+mov r2d, r5d
>+add r2d, r4d
>+shl r6d, 1
>+sub r2d, r6d
>+mov r6d, r2d
>+sar r6d, 31
>+xor r2d, r6d
>+sub r2d, r6d
>+cmp r2d, 8
>+; bilinearLeft is false
>+jns .normal_filter32
>+
>+; do strong intra filter shift = 6
>+mov r2d, r5d
>+shl r2d, 6
>+add r2d, 32  ; init
>+mov r6d, r4d
>+sub r6w, r5w ; deltaL size is word
partial register may stall in here

>+mov r7d, r3d
>+sub r7w, r5w ; deltaR size is word
>+movdxmm0, r2d
>+
​​
vpbroadcastwxmm0, xmm0
SSE4?
​This is AVX2 instruction, so
* ​​*intialization on top is wrong. We genrally we don't prefix xmm,
ymm for native version m0, m1 will be better.


>+movaxmm4, xmm0
>+



___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] Fwd: [PATCH 3 of 3] SEA motion search:integralv functions avx2 implementation

2017-05-02 Thread Praveen Tiwari
-- Forwarded message --
From: 
Date: Tue, May 2, 2017 at 3:16 PM
Subject: [x265] [PATCH 3 of 3] SEA motion search:integralv functions avx2
implementation
To: x265-devel@videolan.org


# HG changeset patch
# User Vignesh Vijayakumar
# Date 1493121121 -19800
#  Tue Apr 25 17:22:01 2017 +0530
# Node ID e5ee88d08fcedee83efa63869a5a346c711a0e3d
# Parent  1afc127e62b4502c8f052ee989843c64b45ffc56
SEA motion search:integralv functions avx2 implementation

diff -r 1afc127e62b4 -r e5ee88d08fce source/common/CMakeLists.txt
--- a/source/common/CMakeLists.txt  Fri Apr 28 11:22:29 2017 +0530
+++ b/source/common/CMakeLists.txt  Tue Apr 25 17:22:01 2017 +0530
@@ -57,10 +57,10 @@
 set(VEC_PRIMITIVES vec/vec-primitives.cpp ${PRIMITIVES})
 source_group(Intrinsics FILES ${VEC_PRIMITIVES})

-set(C_SRCS asm-primitives.cpp pixel.h mc.h ipfilter8.h blockcopy8.h
dct8.h loopfilter.h)
+set(C_SRCS asm-primitives.cpp pixel.h mc.h ipfilter8.h blockcopy8.h
dct8.h loopfilter.h seaintegral.h)
 set(A_SRCS pixel-a.asm const-a.asm cpu-a.asm ssd-a.asm mc-a.asm
mc-a2.asm pixel-util8.asm blockcopy8.asm
-   pixeladd8.asm dct8.asm)
+   pixeladd8.asm dct8.asm seaintegral.asm)
 if(HIGH_BIT_DEPTH)
 set(A_SRCS ${A_SRCS} sad16-a.asm intrapred16.asm ipfilter16.asm
loopfilter.asm)
 else()
diff -r 1afc127e62b4 -r e5ee88d08fce source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp  Fri Apr 28 11:22:29 2017
+0530
+++ b/source/common/x86/asm-primitives.cpp  Tue Apr 25 17:22:01 2017
+0530
@@ -2158,6 +2158,13 @@
 p.fix8Unpack = PFX(cutree_fix8_unpack_avx2);
 p.fix8Pack = PFX(cutree_fix8_pack_avx2);

+p.integral_init4v = PFX(integral4v_avx2);
+p.integral_init8v = PFX(integral8v_avx2);
+p.integral_init12v = PFX(integral12v_avx2);
+p.integral_init16v = PFX(integral16v_avx2);
+p.integral_init24v = PFX(integral24v_avx2);
+p.integral_init32v = PFX(integral32v_avx2);
+
 /* TODO: This kernel needs to be modified to work with
HIGH_BIT_DEPTH only
 p.planeClipAndMax = PFX(planeClipAndMax_avx2); */

@@ -2178,6 +2185,7 @@
 p.costCoeffNxN = PFX(costCoeffNxN_avx2_bmi2);
 }
 }
+
 }
 #else // if HIGH_BIT_DEPTH

@@ -3696,6 +3704,13 @@
 p.fix8Unpack = PFX(cutree_fix8_unpack_avx2);
 p.fix8Pack = PFX(cutree_fix8_pack_avx2);

+p.integral_init4v = PFX(integral4v_avx2);
+p.integral_init8v = PFX(integral8v_avx2);
+p.integral_init12v = PFX(integral12v_avx2);
+p.integral_init16v = PFX(integral16v_avx2);
+p.integral_init24v = PFX(integral24v_avx2);
+p.integral_init32v = PFX(integral32v_avx2);
+
 }
 #endif
 }
diff -r 1afc127e62b4 -r e5ee88d08fce source/common/x86/seaintegral.asm
--- /dev/null   Thu Jan 01 00:00:00 1970 +
+++ b/source/common/x86/seaintegral.asm Tue Apr 25 17:22:01 2017 +0530
@@ -0,0 +1,155 @@
+;**
***
+;* Copyright (C) 2013-2017 MulticoreWare, Inc
+;*
+;* Authors: Jayashri Murugan 
+;*  Vignesh V Menon 
+;*
+;* This program is free software; you can redistribute it and/or modify
+;* it under the terms of the GNU General Public License as published by
+;* the Free Software Foundation; either version 2 of the License, or
+;* (at your option) any later version.
+;*
+;* This program is distributed in the hope that it will be useful,
+;* but WITHOUT ANY WARRANTY; without even the implied warranty of
+;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+;* GNU General Public License for more details.
+;*
+;* You should have received a copy of the GNU General Public License
+;* along with this program; if not, write to the Free Software
+;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111,
USA.
+;*
+;* This program is also available under a commercial proprietary license.
+;* For more information, contact us at license @ x265.com.
+;**
***/
+
+%include "x86inc.asm"
+%include "x86util.asm"
+
+SECTION .text
+
+;--
---
+;void integral_init4v_c(uint32_t *sum4, intptr_t stride)
+;--
---
+INIT_YMM avx2
+cglobal integral4v, 2, 4, 2
+
+mov r2, 0

​xor will be faster method of clearing a register.​


+mov r3, r1

​What are possible values of stride here, is it random number or multiple
of a specific number?​


+shl r3, 4
+
+.loop:
+movum0, [r0]
+movum1, [r0 + r3]
+psubd   m0, m1, m0
+movu[r0], m0
+add r2, 8
+add r0, 32
+cmp r2, r1
+jl  .loop
+RET
+

[x265] Fwd: [PATCH 2 of 3] SEA motion search:Add testbench for integralv functions

2017-05-02 Thread Praveen Tiwari
-- Forwarded message --
From: 
Date: 2017-05-02 15:16 GMT+05:30
Subject: [x265] [PATCH 2 of 3] SEA motion search:Add testbench for
integralv functions
To: x265-devel@videolan.org


# HG changeset patch
# User Vignesh Vijayakumar
# Date 1493358749 -19800
#  Fri Apr 28 11:22:29 2017 +0530
# Node ID 1afc127e62b4502c8f052ee989843c64b45ffc56
# Parent  cb67dffd0e2a596c8d3c6d042b8e6c532487d427
SEA motion search:Add testbench for integralv functions

diff -r cb67dffd0e2a -r 1afc127e62b4 source/test/pixelharness.cpp
--- a/source/test/pixelharness.cpp  Tue May 02 09:58:13 2017 +0530
+++ b/source/test/pixelharness.cpp  Fri Apr 28 11:22:29 2017 +0530
@@ -2003,6 +2003,228 @@
 return true;
 }

+bool PixelHarness::check_integral_init4v(integral4v_t ref, integral4v_t
opt)
+{
+intptr_t srcStep = 64;
+int j = 0;
​>>​
+uint32_t sum_ans[BUFFSIZE] = { 0 };
​>>​
+uint32_t sum_ans1[BUFFSIZE] = { 0 };

​Better names please, check existing naming conventions.

+
+for (int i = 0; i < 64; i++)
+{
+sum_ans[i] = pixel_test_buff[0][i];
+sum_ans1[i] = pixel_test_buff[0][i];
+}
+for (int i = 0, k = 0; i < BUFFSIZE; i++)
+{
+if (i % 64 == 0)
+k++;
+sum_ans[i] = sum_ans[i % 64] + k;
+sum_ans1[i] = sum_ans1[i % 64] + k;
+}
+int padx = 4;
+int pady = 4;
+uint32_t *sum_ans_ptr = sum_ans + srcStep * pady + padx;
+uint32_t *sum_ans1_ptr = sum_ans1 + srcStep * pady + padx;
+for (int i = 0; i < ITERS; i++)
+{
+ref(sum_ans_ptr, srcStep);
+checked(opt, sum_ans1_ptr, srcStep);
+
+if (memcmp(sum_ans, sum_ans1, sizeof(uint32_t) * BUFFSIZE))
+return false;
+
+reportfail()
+j += INCR;
+}
+return true;
+}
+
+bool PixelHarness::check_integral_init8v(integral8v_t ref, integral8v_t
opt)
+ {
+intptr_t srcStep = 64;
+int j = 0;
+uint32_t sum_ans[BUFFSIZE] = { 0 };
+uint32_t sum_ans1[BUFFSIZE] = { 0 };
+
+for (int i = 0; i < 64; i++)
+{
+sum_ans[i] = pixel_test_buff[0][i];
+sum_ans1[i] = pixel_test_buff[0][i];
+}
+for (int i = 0, k = 0; i < BUFFSIZE; i++)
+{
+if (i % 64 == 0)
+k++;
+sum_ans[i] = sum_ans[i % 64] + k;
+sum_ans1[i] = sum_ans1[i % 64] + k;
+}
+int padx = 4;
+int pady = 4;
+uint32_t *sum_ans_ptr = sum_ans + srcStep * pady + padx;
+uint32_t *sum_ans1_ptr = sum_ans1 + srcStep * pady + padx;
+for (int i = 0; i < ITERS; i++)
+{
+ref(sum_ans_ptr, srcStep);
+checked(opt, sum_ans1_ptr, srcStep);
+
+if (memcmp(sum_ans, sum_ans1, sizeof(uint32_t) * BUFFSIZE))
+return false;
+
+reportfail()
+j += INCR;
+}
+return true;
+}
+
+bool PixelHarness::check_integral_init12v(integral12v_t ref, integral12v_t
opt)
+ {
+intptr_t srcStep = 64;
+int j = 0;
+uint32_t sum_ans[BUFFSIZE] = { 0 };
+uint32_t sum_ans1[BUFFSIZE] = { 0 };
+
+for (int i = 0; i < 64; i++)
+{
+sum_ans[i] = pixel_test_buff[0][i];
+sum_ans1[i] = pixel_test_buff[0][i];
+}
+for (int i = 0, k = 0; i < BUFFSIZE; i++)
+{
+if (i % 64 == 0)
+k++;
+sum_ans[i] = sum_ans[i % 64] + k;
+sum_ans1[i] = sum_ans1[i % 64] + k;
+}
+int padx = 4;
+int pady = 4;
+uint32_t *sum_ans_ptr = sum_ans + srcStep * pady + padx;
+uint32_t *sum_ans1_ptr = sum_ans1 + srcStep * pady + padx;
+for (int i = 0; i < ITERS; i++)
+{
+ref(sum_ans_ptr, srcStep);
+checked(opt, sum_ans1_ptr, srcStep);
+
+if (memcmp(sum_ans, sum_ans1, sizeof(uint32_t) * BUFFSIZE))
+return false;
+
+reportfail()
+j += INCR;
+}
+return true;
+}
+
+bool PixelHarness::check_integral_init16v(integral16v_t ref, integral16v_t
opt)
+{
+intptr_t srcStep = 64;
+int j = 0;
+uint32_t sum_ans[BUFFSIZE] = { 0 };
+uint32_t sum_ans1[BUFFSIZE] = { 0 };
+
+for (int i = 0; i < 64; i++)
+{
+sum_ans[i] = pixel_test_buff[0][i];
+sum_ans1[i] = pixel_test_buff[0][i];
+}
+for (int i = 0, k = 0; i < BUFFSIZE; i++)
+{
+if (i % 64 == 0)
+k++;
+sum_ans[i] = sum_ans[i % 64] + k;
+sum_ans1[i] = sum_ans1[i % 64] + k;
+}
+int padx = 4;
+int pady = 4;
+uint32_t *sum_ans_ptr = sum_ans + srcStep * pady + padx;
+uint32_t *sum_ans1_ptr = sum_ans1 + srcStep * pady + padx;
+for (int i = 0; i < ITERS; i++)
+{
+ref(sum_ans_ptr, srcStep);
+checked(opt, sum_ans1_ptr, srcStep);
+
+if (memcmp(sum_ans, sum_ans1, sizeof(uint32_t) * BUFFSIZE))
+return false;
+
+reportfail()
+j += INCR;
+}
+return true;
+}
+
+bool PixelHarness::check_integral_init24v(integral24v_t ref, integral24v_t
opt)
+{
+intptr_t srcStep = 64;
+int j = 0;
+uint32_t 

[x265] Fwd: [PATCH 1 of 3] SEA motion search:Setup asm primitives for integral calculation

2017-05-02 Thread Praveen Tiwari
-- Forwarded message --
From: 
Date: Tue, May 2, 2017 at 3:16 PM
Subject: [x265] [PATCH 1 of 3] SEA motion search:Setup asm primitives for
integral calculation
To: x265-devel@videolan.org


# HG changeset patch
# User Vignesh Vijayakumar
# Date 1493699293 -19800
#  Tue May 02 09:58:13 2017 +0530
# Node ID cb67dffd0e2a596c8d3c6d042b8e6c532487d427
# Parent  5bc5e73760cdb61d2674e74cc52149fa0603af8a
SEA motion search:Setup asm primitives for integral calculation

diff -r 5bc5e73760cd -r cb67dffd0e2a source/common/primitives.cpp
--- a/source/common/primitives.cpp  Sat Apr 22 17:00:28 2017 -0700
+++ b/source/common/primitives.cpp  Tue May 02 09:58:13 2017 +0530
@@ -57,6 +57,7 @@
 void setupIntraPrimitives_c(EncoderPrimitives );
 void setupLoopFilterPrimitives_c(EncoderPrimitives );
 void setupSaoPrimitives_c(EncoderPrimitives );
+void setupSeaIntegralPrimitives_c(EncoderPrimitives );

 void setupCPrimitives(EncoderPrimitives )
 {
@@ -66,6 +67,7 @@
 setupIntraPrimitives_c(p);  // intrapred.cpp
 setupLoopFilterPrimitives_c(p); // loopfilter.cpp
 setupSaoPrimitives_c(p);// sao.cpp
+setupSeaIntegralPrimitives_c(p);  // framefilter.cpp
 }

 void setupAliasPrimitives(EncoderPrimitives )
diff -r 5bc5e73760cd -r cb67dffd0e2a source/common/primitives.h
--- a/source/common/primitives.hSat Apr 22 17:00:28 2017 -0700
+++ b/source/common/primitives.hTue May 02 09:58:13 2017 +0530
@@ -202,6 +202,18 @@

 typedef void (*pelFilterLumaStrong_t)(pixel* src, intptr_t srcStep,
intptr_t offset, int32_t tcP, int32_t tcQ);
 typedef void (*pelFilterChroma_t)(pixel* src, intptr_t srcStep, intptr_t
offset, int32_t tc, int32_t maskP, int32_t maskQ);
​>>​
+
​​
typedef void(*integral4h_t)(uint32_t *sum, pixel *pix, intptr_t stride);
​>>​
+typedef void(*integral8h_t)(uint32_t *sum, pixel *pix, intptr_t stride);
​>>​
+typedef void(*integral12h_t)(uint32_t *sum, pixel *pix, intptr_t stride);
​>>​
+typedef void(*integral16h_t)(uint32_t *sum, pixel *pix, intptr_t stride);
​>>​
+typedef void(*integral24h_t)(uint32_t *sum, pixel *pix, intptr_t stride);
​>>​
+typedef void(*integral32h_t)(uint32_t *sum, pixel *pix, intptr_t stride);
​>>​
+
​​
​​
typedef void(*integral4v_t)(uint32_t *sum, intptr_t stride);
​>>​
+typedef void(*integral8v_t)(uint32_t *sum, intptr_t stride);
​>>​
+typedef void(*integral12v_t)(uint32_t *sum, intptr_t stride);
​>>​
+typedef void(*integral16v_t)(uint32_t *sum, intptr_t stride);
​>>​
+typedef void(*integral24v_t)(uint32_t *sum, intptr_t stride);
​>>​
+typedef void(*integral32v_t)(uint32_t *sum, intptr_t stride);

​Just needed two typedef here,  one for horitontal and one for vertical
rest of the typedef are redudent here.

 /* Function pointers to optimized encoder primitives. Each pointer can
reference
  * either an assembly routine, a SIMD intrinsic primitive, or a C function
*/
@@ -342,6 +354,19 @@
 pelFilterLumaStrong_t pelFilterLumaStrong[2]; // EDGE_VER = 0,
EDGE_HOR = 1
 pelFilterChroma_t pelFilterChroma[2]; // EDGE_VER = 0,
EDGE_HOR = 1

​>>
+integral4h_tintegral_init4h;
​>>​
+integral8h_tintegral_init8h;
​>>​
+integral12h_tintegral_init12h;
​>>​
+integral16h_tintegral_init16h;
​>>​
+integral24h_tintegral_init24h;
​>>​
+integral32h_tintegral_init32h;
​>>​
+integral4v_tintegral_init4v;
​>>​
+integral8v_tintegral_init8v;
​>>​
+integral12v_tintegral_init12v;
​>>​
+integral16v_tintegral_init16v;
​>>​
+integral24v_tintegral_init24v;
​>>​
+integral32v_tintegral_init32v;
​>>​
+

​An array of appropiate size for horizontal and another for vertical.


 /* There is one set of chroma primitives per color space. An encoder
will
  * have just a single color space and thus it will only ever use one
entry
  * in this array. However we always fill all entries in the array in
case
diff -r 5bc5e73760cd -r cb67dffd0e2a source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp  Sat Apr 22 17:00:28 2017
-0700
+++ b/source/common/x86/asm-primitives.cpp  Tue May 02 09:58:13 2017
+0530
@@ -114,6 +114,7 @@
 #include "blockcopy8.h"
 #include "intrapred.h"
 #include "dct8.h"
+#include "seaintegral.h"
 }

 #define ALL_LUMA_CU_TYPED(prim, fncdef, fname, cpu) \
diff -r 5bc5e73760cd -r cb67dffd0e2a source/common/x86/seaintegral.h
--- /dev/null   Thu Jan 01 00:00:00 1970 +
+++ b/source/common/x86/seaintegral.h   Tue May 02 09:58:13 2017 +0530
@@ -0,0 +1,41 @@
+/**
***
+* Copyright (C) 2013-2017 MulticoreWare, Inc
+*
+* Authors: Vignesh V Menon 
+*  Jayashri Murugan 
+*
+* This program is free software; you can redistribute it and/or modify
+* it under the 

Re: [x265] Interested in fast popcnt substitute below SSE4.2?

2017-03-01 Thread Praveen Tiwari
Hi Mario,

Sorry for late reply, you have shared an interesting and useful
information. Currently we are doing some experimental refactoring over the
ASM code base, so it might take some time. Hoping to receive more post like
this.

Regards,
Praveen Tiwari

On Wed, Mar 1, 2017 at 8:21 PM, Mario *LigH* Rohkrämer <cont...@ligh.de>
wrote:

> Apparently not interesting...
>
>
>
> Am 23.02.2017, 10:05 Uhr, schrieb Mario *LigH* Rohkrämer <cont...@ligh.de
> >:
>
> Another point of view on this matter:
>>
>> http://danluu.com/assembly-intrinsics/
>>
>> Seems to relativate the impact.
>>
>> I don't know if you already knew about all this before...
>>
>>
>> Am 22.02.2017, 13:39 Uhr, schrieb Mario *LigH* Rohkrämer <cont...@ligh.de
>> >:
>>
>> http://wm.ite.pl/articles/sse-popcount.html
>>>
>>> May even be faster than the popcnt instruction implemented in a
>>> supporting CPU!
>>>
>>> Found via a German "conspiracy news" blog (no, that's not at all meant
>>> seriously) which sometimes also mentions computer security issues and
>>> interesting programming challenges: https://blog.fefe.de/?ts=a653b91f
>>>
>>>
>>
>>
>
> --
>
> Fun and success!
> Mario *LigH* Rohkrämer
> mailto:cont...@ligh.de
>
> ___
> x265-devel mailing list
> x265-devel@videolan.org
> https://mailman.videolan.org/listinfo/x265-devel
>
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH 1 of 9] pcs: update design to have 'm_achivedFps' for every PCS Instance

2016-11-17 Thread Praveen Tiwari
Please, ignore this patch. Thanks.


On Thu, Nov 17, 2016 at 8:51 PM, <prav...@multicorewareinc.com> wrote:

> # HG changeset patch
> # User Praveen Tiwari <prav...@multicorewareinc.com>
> # Date 1479128885 -19800
> #  Mon Nov 14 18:38:05 2016 +0530
> # Branch stable
> # Node ID 8defd4e7b2e4875247e4ec95e0dd3b9630983526
> # Parent  bdf273f9521784ceeda868222d415303a0bcf58b
> pcs: update design to have 'm_achivedFps' for every PCS Instance
>
> diff -r bdf273f95217 -r 8defd4e7b2e4 source/api-uhdkit.cpp
> --- a/source/api-uhdkit.cpp Tue Nov 08 14:20:24 2016 +0530
> +++ b/source/api-uhdkit.cpp Mon Nov 14 18:38:05 2016 +0530
> @@ -206,8 +206,6 @@
>  return -1;
>  if (numEncoded > 0)
>  {
> -uhdkitEnc->m_achievedFps = numEncoded * 100.0 /
> (double)(endTime - startTime);
> -uhdkitEnc->m_achievedFps = uhdkitEnc->m_achievedFps /
> uhdkitEnc->m_param->gops; // Achieved fps for each gop encoder
>  uhdkitEnc->m_encodedFrameCount += numEncoded;
>  controllerIndex = ((uhdkitEnc->m_encodedFrameCount - 1) /
> uhdkitEnc->m_param->x265Param->keyframeMax) % uhdkitEnc->m_param->gops;
>  X265_CHECK(controllerIndex >= 0 && controllerIndex <
> uhdkitEnc->m_param->gops, "Invalid controllerIndex: %d, must be between 0
> and %d\n", controllerIndex, uhdkitEnc->m_param->gops);
> diff -r bdf273f95217 -r 8defd4e7b2e4 source/pcs/api-pcs.cpp
> --- a/source/pcs/api-pcs.cppTue Nov 08 14:20:24 2016 +0530
> +++ b/source/pcs/api-pcs.cppMon Nov 14 18:38:05 2016 +0530
> @@ -211,6 +211,7 @@
>  m_pcsParam->statusPrintInterval  = param->statusPrintInterval;
>  m_curTimeStamp = m_lastTimeStamp = X265_NS::x265_mdate();
>  m_framesWindow = 1;
> +m_achievedFps = 0.0;
>  m_outFrameCountOfLastAccumulatorReset = 0;
>  time(_lastStatusOutputTime);
>
> @@ -289,11 +290,11 @@
>  int64_t elapsedEncTime = m_curTimeStamp - m_lastTimeStamp;
>  int controllerIndex = ((uhdkitEnc->m_encodedFrameCount - 1) /
> uhdkitEnc->m_param->x265Param->keyframeMax) % uhdkitEnc->m_param->gops;
>  X265_CHECK(controllerIndex >= 0 && controllerIndex <
> uhdkitEnc->m_param->gops, "Invalid controllerIndex: %d, must be between 0
> and %d\n", controllerIndex, uhdkitEnc->m_param->gops);
> -if (((m_bScenecut == 1) && elapsedEncTime > 0) || elapsedEncTime
> >= 30 || uhdkitEnc->m_achievedFps < m_pcsParam->fpsSetPoint)
> +if (((m_bScenecut == 1) && elapsedEncTime > 0) || elapsedEncTime
> >= 30 || m_achievedFps < m_pcsParam->fpsSetPoint)
>  {
>  // Don't allow outrageously high frame rate measurements to
> skew the controller.
> -uhdkitEnc->m_achievedFps = X265_MIN(uhdkitEnc->m_achievedFps,
> 4 * m_pcsParam->fpsSetPoint);
> -error = (m_pcsParam->fpsSetPoint - uhdkitEnc->m_achievedFps)
> / m_pcsParam->fpsSetPoint;
> +m_achievedFps = X265_MIN(m_achievedFps, 4 *
> m_pcsParam->fpsSetPoint);
> +error = (m_pcsParam->fpsSetPoint - m_achievedFps) /
> m_pcsParam->fpsSetPoint;
>
>  if (m_pcsParam->integralReset > 0)
>  {
> @@ -308,7 +309,7 @@
>  {
>  double lowerBound = (m_pcsParam->fpsSetPoint *
> SATURATION_RANGE_MIN) / 100.0;   /* Lower bound, 3% of set-point */
>  double upperBound = (m_pcsParam->fpsSetPoint *
> SATURATION_RANGE_MAX) / 100.0;   /* Upper bound, 10% of set-point */
> -double fpsDiff = (uhdkitEnc->m_achievedFps -
> m_pcsParam->fpsSetPoint);
> +double fpsDiff =(m_achievedFps -
> m_pcsParam->fpsSetPoint);
>  resetErrorAccumulater = (fpsDiff >= lowerBound && fpsDiff
> <= upperBound) || m_bScenecut; /* Steady state, or scenecut */
>  }
>
> @@ -388,7 +389,7 @@
>  m_outFrameCountOfLastAccumulatorReset = uhdkitEnc->m_
> encodedFrameCount;
>  m_lastTimeStamp = m_curTimeStamp;
>  if (uhdkitEnc->m_reconfigParam->logLevel == UHDKIT_LOG_INFO)
> -
> uhdkit_pcs_printStatus(>m_reconfigParam[controllerIndex],
> uhdkitEnc->m_achievedFps);
> +
> uhdkit_pcs_printStatus(>m_reconfigParam[controllerIndex],
> m_achievedFps);
>  }
>  return true;
>  }
> @@ -398,6 +399,11 @@
>  m_bScenecut = pic->frameData.bScenecut;
>  }
>
> +void pcs::uhdkit_pcs_update_fps(int64_t startTime, int64_t endTime, int
&

Re: [x265] [PATCH] [multi-lib] Support 8+10+12 bits in single DLL (Workaround)

2016-09-23 Thread Praveen Tiwari
Hi Min,
 Can you please verify for VC12 ? I double checked on this I am
getting different output for this patch. 8-bit encoded file size is same
but different binary (compared using beyond compare), 10 and 12 bit size
and binary both are different. I applied you patch build once (like 8 bit
build)  and collected all depth outputs (8, 10 and 12), compared with three
builds of x265 i.e 8 bit, 10 bit and 12 bit.

Regards,
Praveen


On Fri, Sep 23, 2016 at 2:47 AM, chen <chenm...@163.com> wrote:

> Hi Praveen,
>
> I test your cmdlind on my VS2008 build.
> I build three bit-depth version and compare with one bit-depth version,
> but the output are still matched in both 10 and 12 bit.
>
> Regards,
> Min
>
> At 2016-09-22 14:39:50,"Praveen Tiwari" <prav...@multicorewareinc.com>
> wrote:
>
> Hi Min,
>
>  After this patch outputs are changing, tested for following command line
> for 10-bit and 12-bit outputs.
>
> --input=NebutaFestival_2560x1600_60_10bit_crop.yuv --input-res=2560x1600
> --fps=60  --numa-pools="NULL" --output-depth=12 --hash=1 -o  NFOut12.hevc
>
>
>
>
> Regards,
> Praveen
>
> On Thu, Sep 15, 2016 at 1:55 AM, chen <chenm...@163.com> wrote:
>
>> From ea50e494473623ed0dbff2907194aaf268dc449a Mon Sep 17 00:00:00 2001
>> From: Min Chen <min.c...@multicorewareinc.com>
>> Date: Wed, 14 Sep 2016 15:23:38 -0500
>> Subject: [PATCH] [multi-lib] Support 8+10+12 bits in single DLL
>> (Workaround)
>>
>> ---
>>  source/CMakeLists.txt |   40 +++-
>>  1 files changed, 39 insertions(+), 1 deletions(-)
>>
>> diff --git a/source/CMakeLists.txt b/source/CMakeLists.txt
>> index dd19d28..c2c2f7f 100644
>> --- a/source/CMakeLists.txt
>> +++ b/source/CMakeLists.txt
>> @@ -36,6 +36,7 @@ configure_file("${PROJECT_SOURCE_DIR}/x265.def.in"
>>  configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in"
>> "${PROJECT_BINARY_DIR}/x265_config.h")
>>
>> +
>>  SET(CMAKE_MODULE_PATH "${PROJECT_SOURCE_DIR}/cmake"
>> "${CMAKE_MODULE_PATH}")
>>
>>  # System architecture detection
>> @@ -396,6 +397,39 @@ if(WIN32)
>>  endif(WINXP_SUPPORT)
>>  endif()
>>
>> +
>> +if(ENABLE_SHARED AND LINKED_10BIT AND LINKED_12BIT)
>> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "\n")
>> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?setParamAspectRatio@x265
>> @@YAXPEAUx265_param@@HH@Z\n")
>> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?getParamAspectRatio@x265
>> @@YAXPEAUx265_param@@AEAH1@Z\n")
>> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?general_log_file@x265
>> @@YAXPEBUx265_param@@PEBDH1ZZ\n")
>> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?general_log@x265
>> @@YAXPEBUx265_param@@PEBDH1ZZ\n")
>> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def
>> "?x265_api_get_94@x265_10bit@@YAPEBUx265_api@@H@Z\n")
>> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def
>> "?x265_api_get_94@x265_12bit@@YAPEBUx265_api@@H@Z\n")
>> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def
>> "?x265_api_query@x265_10bit@@YAPEBUx265_api@@HHPEAH@Z\n")
>> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def
>> "?x265_api_query@x265_12bit@@YAPEBUx265_api@@HHPEAH@Z\n")
>> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_mdate@x265
>> @@YA_JXZ\n")
>> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def
>> "?x265_picturePlaneSize@x265@@YAI@Z\n")
>> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_ssim2dB@x265
>> @@YANN@Z\n")
>> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_ssim2dB@x265
>> @@YANN@Z\n")
>> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_report_simd@x265
>> @@YAXPEAUx265_param@@@Z\n")
>> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_fopen@x265
>> @@YAPEAU_iobuf@@PEBD0@Z\n")
>> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_malloc@x265
>> @@YAPEAX_K@Z\n")
>> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_free@x265
>> @@YAXPEAX@Z\n")
>> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_atoi@x265
>> @@YAHPEBDAEA_N@Z\n")
>> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?start@Thread@x265@
>> @QEAA_NXZ\n")
>> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?stop@Thread@x265@
>> @QEAAXXZ\n")
>> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "??0Thre

Re: [x265] [PATCH] [multi-lib] Support 8+10+12 bits in single DLL (Workaround)

2016-09-22 Thread Praveen Tiwari
Hi Min,

 After this patch outputs are changing, tested for following command line
for 10-bit and 12-bit outputs.

--input=NebutaFestival_2560x1600_60_10bit_crop.yuv --input-res=2560x1600
--fps=60  --numa-pools="NULL" --output-depth=12 --hash=1 -o  NFOut12.hevc




Regards,
Praveen

On Thu, Sep 15, 2016 at 1:55 AM, chen  wrote:

> From ea50e494473623ed0dbff2907194aaf268dc449a Mon Sep 17 00:00:00 2001
> From: Min Chen 
> Date: Wed, 14 Sep 2016 15:23:38 -0500
> Subject: [PATCH] [multi-lib] Support 8+10+12 bits in single DLL
> (Workaround)
>
> ---
>  source/CMakeLists.txt |   40 +++-
>  1 files changed, 39 insertions(+), 1 deletions(-)
>
> diff --git a/source/CMakeLists.txt b/source/CMakeLists.txt
> index dd19d28..c2c2f7f 100644
> --- a/source/CMakeLists.txt
> +++ b/source/CMakeLists.txt
> @@ -36,6 +36,7 @@ configure_file("${PROJECT_SOURCE_DIR}/x265.def.in"
>  configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in"
> "${PROJECT_BINARY_DIR}/x265_config.h")
>
> +
>  SET(CMAKE_MODULE_PATH "${PROJECT_SOURCE_DIR}/cmake"
> "${CMAKE_MODULE_PATH}")
>
>  # System architecture detection
> @@ -396,6 +397,39 @@ if(WIN32)
>  endif(WINXP_SUPPORT)
>  endif()
>
> +
> +if(ENABLE_SHARED AND LINKED_10BIT AND LINKED_12BIT)
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "\n")
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?setParamAspectRatio@x265
> @@YAXPEAUx265_param@@HH@Z\n")
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?getParamAspectRatio@x265
> @@YAXPEAUx265_param@@AEAH1@Z\n")
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?general_log_file@x265@@
> YAXPEBUx265_param@@PEBDH1ZZ\n")
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?general_log@x265@@
> YAXPEBUx265_param@@PEBDH1ZZ\n")
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def
> "?x265_api_get_94@x265_10bit@@YAPEBUx265_api@@H@Z\n")
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def
> "?x265_api_get_94@x265_12bit@@YAPEBUx265_api@@H@Z\n")
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_api_query@x265_10bit
> @@YAPEBUx265_api@@HHPEAH@Z\n")
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_api_query@x265_12bit
> @@YAPEBUx265_api@@HHPEAH@Z\n")
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_mdate@x265
> @@YA_JXZ\n")
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def
> "?x265_picturePlaneSize@x265@@YAI@Z\n")
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_ssim2dB@x265
> @@YANN@Z\n")
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_ssim2dB@x265
> @@YANN@Z\n")
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_report_simd@x265@@
> YAXPEAUx265_param@@@Z\n")
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_fopen@x265@@YAPEAU_
> iobuf@@PEBD0@Z\n")
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_malloc@x265
> @@YAPEAX_K@Z\n")
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_free@x265
> @@YAXPEAX@Z\n")
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_atoi@x265
> @@YAHPEBDAEA_N@Z\n")
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?start@Thread@x265@
> @QEAA_NXZ\n")
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?stop@Thread@x265@
> @QEAAXXZ\n")
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "??0Thread@x265@@QEAA@XZ
> \n")
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "??1Thread@x265@@UEAA@XZ
> \n")
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?g_maxCUDepth@x265
> @@3IA\n")
> +if(WINXP_SUPPORT)
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?cond_init@x265@@
> YAHPEAUConditionVariable@1@@Z\n")
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?cond_wait@x265@@
> YAHPEAUConditionVariable@1@PEAU_RTL_CRITICAL_SECTION@@K@Z\n")
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?cond_destroy@x265@@
> YAXPEAUConditionVariable@1@@Z\n")
> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?cond_broadcast@x265
> @@YAXPEAUConditionVariable@1@@Z\n")
> +endif()
> +endif()
> +
>  include(version) # determine X265_VERSION and X265_LATEST_TAG
>  include_directories(. common encoder "${PROJECT_BINARY_DIR}")
>
> @@ -608,7 +642,11 @@ if(ENABLE_CLI)
>  if(WIN32 OR NOT ENABLE_SHARED OR INTEL_CXX)
>  # The CLI cannot link to the shared library on Windows, it
>  # requires internal APIs not exported from the DLL
> -target_link_libraries(cli x265-static ${PLATFORM_LIBS})
> +if(ENABLE_SHARED AND LINKED_10BIT AND LINKED_12BIT)
> +target_link_libraries(cli x265-shared ${PLATFORM_LIBS})
> +else()
> +target_link_libraries(cli x265-static ${PLATFORM_LIBS})
> +endif()
>  else()
>  target_link_libraries(cli x265-shared ${PLATFORM_LIBS})
>  endif()
> --
> 1.7.9.msysgit.0
>
>
> ___
> x265-devel mailing list
> x265-devel@videolan.org
> https://mailman.videolan.org/listinfo/x265-devel
>

Re: [x265] [PATCH] threadpool.cpp: fix default pool param behaviour, if NULL or “” (default) x265 will use all available threads on each NUMA node

2016-09-08 Thread Praveen Tiwari
Please ignore this this behaviour is not required for linux systems.
Thanks.

Regards,
Praveen

On Wed, Sep 7, 2016 at 5:19 PM, <prav...@multicorewareinc.com> wrote:

> # HG changeset patch
> # User Praveen Tiwari <prav...@multicorewareinc.com>
> # Date 1473246754 -19800
> #  Wed Sep 07 16:42:34 2016 +0530
> # Node ID 9587a394ba58a2c3a579db5fb3f7531daf49559b
> # Parent  df559450949bd085b0fc5e01332aa8458af2fa43
> threadpool.cpp: fix default pool param behaviour, if NULL or 灯 (default)
> x265 will use all available threads on each NUMA node
>
> diff -r df559450949b -r 9587a394ba58 source/common/threadpool.cpp
> --- a/source/common/threadpool.cpp  Wed Aug 10 13:26:18 2016 +0530
> +++ b/source/common/threadpool.cpp  Wed Sep 07 16:42:34 2016 +0530
> @@ -330,8 +330,8 @@
>  {
>  for (int j = i; j < numNumaNodes; j++)
>  {
> -threadsPerPool[numNumaNodes] += cpusPerNode[j];
> -nodeMaskPerPool[numNumaNodes] |= ((uint64_t)1 << j);
> +threadsPerPool[j] += cpusPerNode[j];
> +nodeMaskPerPool[j] |= ((uint64_t)1 << j);
>  }
>  break;
>  }
> @@ -366,8 +366,8 @@
>  {
>  for (int i = 0; i < numNumaNodes; i++)
>  {
> -threadsPerPool[numNumaNodes]  += cpusPerNode[i];
> -nodeMaskPerPool[numNumaNodes] |= ((uint64_t)1 << i);
> +threadsPerPool[i]  += cpusPerNode[i];
> +nodeMaskPerPool[i] |= ((uint64_t)1 << i);
>  }
>  }
>
>
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] threadpool: fix warning: ‘int popCount(uint64_t)’ defined but not used [-Wunused-function]

2016-05-30 Thread Praveen Tiwari
I remember some numa functionality requires minimum window 7, they are not
supported on previous version of window OS.

Regards,
Praveen

On Mon, May 30, 2016 at 6:43 PM, Mateusz <mateu...@poczta.onet.pl> wrote:

> There is a serious bug in threadpool code that prevent working in Windows
> XP/Vista.
> VS 2015 error when compiling for 32-bit Windows XP:
> (ClCompile target) ->
>   I:\vs\x265\source\common\threadpool.cpp(590): error C3861:
> 'GetNumaNodeProcessorMaskEx': identifier not found [I:\vs\x265\ma\
> 8-b\common\common.vcxproj]
>
> Did you see patch https://patches.videolan.org/patch/13495/ (it fixes
> also this warning)?
>
>
> W dniu 2016-05-30 o 14:45, prav...@multicorewareinc.com pisze:
> > # HG changeset patch
> > # User Praveen Tiwari <prav...@multicorewareinc.com>
> > # Date 1464585837 -19800
> > #  Mon May 30 10:53:57 2016 +0530
> > # Node ID b8dbe8d7c09e7fc0b7cce236569fc5df2eb70b1e
> > # Parent  aeade2e8d8688ebffb8455b8948d89d6a72e2c38
> > threadpool: fix warning: ‘int popCount(uint64_t)’ defined but not used
> [-Wunused-function]
> >  static int popCount(uint64_t x)
> >
> > diff -r aeade2e8d868 -r b8dbe8d7c09e source/common/threadpool.cpp
> > --- a/source/common/threadpool.cppThu May 26 16:45:09 2016 +0530
> > +++ b/source/common/threadpool.cppMon May 30 10:53:57 2016 +0530
> > @@ -68,6 +68,7 @@
> >  # define strcasecmp _stricmp
> >  #endif
> >
> > +#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7
> >  const uint64_t m1 = 0x; //binary: 0101...
> >  const uint64_t m2 = 0x; //binary: 00110011..
> >  const uint64_t m3 = 0x0f0f0f0f0f0f0f0f; //binary:  4 zeros,  4 ones ...
> > @@ -80,6 +81,7 @@
> >  x = (x + (x >> 4)) & m3;
> >  return (x * h01) >> 56;
> >  }
> > +#endif
> >
> >  namespace X265_NS {
> >  // x265 private namespace
> >
> >
> >
> > ___
> > x265-devel mailing list
> > x265-devel@videolan.org
> > https://mailman.videolan.org/listinfo/x265-devel
> >
>
>
> ___
> x265-devel mailing list
> x265-devel@videolan.org
> https://mailman.videolan.org/listinfo/x265-devel
>
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH 1 of 7] threadpool.cpp: get correct CPU count for multisocket machines -> windows system fix

2016-05-23 Thread Praveen Tiwari
Hi,
I am combining these patches into a single patch along with some
updates, so please ignore these patches. On top of this I will update
Mateusz patch (CLI: new logic for '--pools ' option ) to avoid
merge conflicts. Thanks.


.

Regards,
Praveen

On Fri, May 20, 2016 at 4:31 PM, <prav...@multicorewareinc.com> wrote:

> # HG changeset patch
> # User Praveen Tiwari <prav...@multicorewareinc.com>
> # Date 1463655478 -19800
> #  Thu May 19 16:27:58 2016 +0530
> # Node ID 9a6ab28b736e1167ac26977d7da8ab2d23cc296f
> # Parent  aca781339b4c8dae94ff7da73f18cd4439757e87
> threadpool.cpp: get correct CPU count for multisocket machines -> windows
> system fix
>
> diff -r aca781339b4c -r 9a6ab28b736e source/common/threadpool.cpp
> --- a/source/common/threadpool.cpp  Tue May 10 15:33:17 2016 +0530
> +++ b/source/common/threadpool.cpp  Thu May 19 16:27:58 2016 +0530
> @@ -64,6 +64,19 @@
>  # define strcasecmp _stricmp
>  #endif
>
> +const uint64_t m1 = 0x; //binary: 0101...
> +const uint64_t m2 = 0x; //binary: 00110011..
> +const uint64_t m3 = 0x0f0f0f0f0f0f0f0f; //binary:  4 zeros,  4 ones ...
> +const uint64_t h01 = 0x0101010101010101; //the sum of 256 to the power of
> 0,1,2,3...
> +
> +int popCount(uint64_t x)
> +{
> +x -= (x >> 1) & m1;
> +x = (x & m2) + ((x >> 2) & m2);
> +x = (x + (x >> 4)) & m3;
> +return (x * h01) >> 56;
> +}
> +
>  namespace X265_NS {
>  // x265 private namespace
>
> @@ -525,9 +538,17 @@
>  int ThreadPool::getCpuCount()
>  {
>  #if _WIN32
> -SYSTEM_INFO sysinfo;
> -GetSystemInfo();
> -return sysinfo.dwNumberOfProcessors;
> +enum { MAX_NODE_NUM = 127 };
> +int cpus = 0;
> +int numNumaNodes = X265_MIN(getNumaNodeCount(), MAX_NODE_NUM);
> +PGROUP_AFFINITY groupAffinityPointer = new GROUP_AFFINITY;
> +for (int i = 0; i < numNumaNodes; i++)
> +{
> +GetNumaNodeProcessorMaskEx((UCHAR)i, groupAffinityPointer);
> +cpus += popCount(groupAffinityPointer->Mask);
> +}
> +delete groupAffinityPointer;
> +return cpus;
>  #elif __unix__ && X265_ARCH_ARM
>  /* Return the number of processors configured by OS. Because, most
> embedded linux distributions
>   * uses only one processor as the scheduler doesn't have enough work
> to utilize all processors */
>
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] ThreadPool.cpp: fix getCpuCount function for windows systems

2016-05-20 Thread Praveen Tiwari
Please ignore this sending updated patch. thanks.

Regards,
Praveen

On Tue, May 17, 2016 at 7:17 PM, <prav...@multicorewareinc.com> wrote:

> # HG changeset patch
> # User Praveen Tiwari <prav...@multicorewareinc.com>
> # Date 1463492830 -19800
> #  Tue May 17 19:17:10 2016 +0530
> # Node ID cf3c2e0dce0997a499ae1d50fda6891cae83e685
> # Parent  372fc5b12ed6003f8784702956ccf7203ea68a2e
> ThreadPool.cpp: fix getCpuCount function for windows systems
>
> diff -r 372fc5b12ed6 -r cf3c2e0dce09 source/common/threadpool.cpp
> --- a/source/common/threadpool.cpp  Tue May 17 19:06:36 2016 +0530
> +++ b/source/common/threadpool.cpp  Tue May 17 19:17:10 2016 +0530
> @@ -545,9 +545,17 @@
>  int ThreadPool::getCpuCount()
>  {
>  #if _WIN32
> -SYSTEM_INFO sysinfo;
> -GetSystemInfo();
> -return sysinfo.dwNumberOfProcessors;
> +enum { MAX_NODE_NUM = 127 };
> +int cpus = 0;
> +int numNumaNodes = X265_MIN(getNumaNodeCount(), MAX_NODE_NUM);
> +PGROUP_AFFINITY groupAffinityPointer = new GROUP_AFFINITY;
> +for (int i = 0; i < numNumaNodes; i++)
> +{
> +GetNumaNodeProcessorMaskEx((UCHAR)i, groupAffinityPointer);
> +cpus += (int)bitCount(groupAffinityPointer->Mask);
> +}
> +delete groupAffinityPointer;
> +return cpus;
>  #elif __unix__ && X265_ARCH_ARM
>  /* Return the number of processors configured by OS. Because, most
> embedded linux distributions
>   * uses only one processor as the scheduler doesn't have enough work
> to utilize all processors */
>
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] ThreadPool.cpp: fix core count for windows machines

2016-05-20 Thread Praveen Tiwari
Please  ignore this sending updated patch. Thanks

Regards,
Praveen

On Tue, May 17, 2016 at 8:01 PM, Pradeep Ramachandran <
prad...@multicorewareinc.com> wrote:

>
> On Tue, May 17, 2016 at 7:07 PM, <prav...@multicorewareinc.com> wrote:
>
>> # HG changeset patch
>> # User Praveen Tiwari <prav...@multicorewareinc.com>
>> # Date 1463492196 -19800
>> #  Tue May 17 19:06:36 2016 +0530
>> # Node ID 372fc5b12ed6003f8784702956ccf7203ea68a2e
>> # Parent  e5b5bdc3c154f908706fb75e006f9abf9b3de96f
>> ThreadPool.cpp: fix core count for windows machines
>>
>> diff -r e5b5bdc3c154 -r 372fc5b12ed6 source/common/threadpool.cpp
>> --- a/source/common/threadpool.cpp  Sat May 14 07:29:46 2016 +0530
>> +++ b/source/common/threadpool.cpp  Tue May 17 19:06:36 2016 +0530
>> @@ -27,6 +27,7 @@
>>  #include "threading.h"
>>
>>  #include 
>> +#include 
>>
>>  #if X86_64
>>
>> @@ -64,6 +65,18 @@
>>  # define strcasecmp _stricmp
>>  #endif
>>
>> +uint64_t bitCount(uint64_t value)
>> +{
>> +uint64_t count = 0;
>> +while (value > 0) // until all bits are zero
>> +{
>> +if ((value & 1) == 1) // check lower bit
>> +count++;
>> +value >>= 1;  // shift bits, removing lower bit
>> +}
>> +return count;
>> +}
>> +
>>  namespace X265_NS {
>>  // x265 private namespace
>>
>> @@ -238,7 +251,6 @@
>>  memset(nodeMaskPerPool, 0, sizeof(nodeMaskPerPool));
>>
>>  int numNumaNodes = X265_MIN(getNumaNodeCount(), MAX_NODE_NUM);
>> -int cpuCount = getCpuCount();
>>  bool bNumaSupport = false;
>>
>>  #if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7
>> @@ -248,20 +260,28 @@
>>  #endif
>>
>>
>> +#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7
>> +PGROUP_AFFINITY groupAffinityPointer = new GROUP_AFFINITY;
>> +for (int i = 0; i < numNumaNodes; i++)
>> +{
>> +GetNumaNodeProcessorMaskEx((UCHAR)i, groupAffinityPointer);
>> +cpusPerNode[i] = (int)bitCount(groupAffinityPointer->Mask);
>> +}
>> +delete groupAffinityPointer;
>> +#elif HAVE_LIBNUMA
>> +int cpuCount = getCpuCount();
>>
>
> Can we move to the cleaner implementation of not relying on CPU counts for
> non-windows platforms also?
>
>
>>  for (int i = 0; i < cpuCount; i++)
>>  {
>> -#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7
>> -UCHAR node;
>> -if (GetNumaProcessorNode((UCHAR)i, ))
>> -cpusPerNode[X265_MIN(node, (UCHAR)MAX_NODE_NUM)]++;
>> -else
>> -#elif HAVE_LIBNUMA
>>  if (bNumaSupport >= 0)
>>  cpusPerNode[X265_MIN(numa_node_of_cpu(i), MAX_NODE_NUM)]++;
>> -else
>> +}
>> +#elif
>> +int cpuCount = getCpuCount();
>> +for (int i = 0; i < cpuCount; i++)
>> +{
>> +cpusPerNode[0]++;
>> +}
>>
>
> How about cpusPerNode[0] = getCpuCount() here? The for loop is unnecessary.
>
>
>>  #endif
>> -cpusPerNode[0]++;
>> -}
>>
>>  if (bNumaSupport && p->logLevel >= X265_LOG_DEBUG)
>>  for (int i = 0; i < numNumaNodes; i++)
>> ___
>> x265-devel mailing list
>> x265-devel@videolan.org
>> https://mailman.videolan.org/listinfo/x265-devel
>>
>
>
> ___
> x265-devel mailing list
> x265-devel@videolan.org
> https://mailman.videolan.org/listinfo/x265-devel
>
>
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] motion.cpp: optimize 'X265_DIA_SEARCH' byeliminating costly branch instructions

2016-03-08 Thread praveen tiwari
Yes, this is for eliminating if...else so it's perform a conditional assignment 
for correctness of code. I will try to update macro definition. Thanks. 

-Original Message-
From: "chen" <chenm...@163.com>
Sent: ‎09-‎03-‎2016 05:52
To: "Development for x265" <x265-devel@videolan.org>
Subject: Re: [x265] [PATCH] motion.cpp: optimize 'X265_DIA_SEARCH' 
byeliminating costly branch instructions

I suggest you to modify macro
And this patch depends on side effect of conditional statment, it is bad code 
style.

At 2016-03-08 22:48:49,prav...@multicorewareinc.com wrote:
># HG changeset patch
># User Praveen Tiwari <prav...@multicorewareinc.com>
># Date 1457448163 -19800
>#  Tue Mar 08 20:12:43 2016 +0530
># Node ID 519441d72cf723dc3b279a91a6080f329729cb49
># Parent  0e1b6472c05e3a53538d8e064e502d8a7508eb6e
>motion.cpp: optimize 'X265_DIA_SEARCH' by eliminating costly branch 
>instructions
>
>diff -r 0e1b6472c05e -r 519441d72cf7 source/encoder/motion.cpp
>--- a/source/encoder/motion.cppTue Mar 08 19:08:57 2016 +0530
>+++ b/source/encoder/motion.cppTue Mar 08 20:12:43 2016 +0530
>@@ -659,10 +659,10 @@
> do
> {
> COST_MV_X4_DIR(0, -1, 0, 1, -1, 0, 1, 0, costs);
>-COPY1_IF_LT(bcost, (costs[0] << 4) + 1);
>-COPY1_IF_LT(bcost, (costs[1] << 4) + 3);
>-COPY1_IF_LT(bcost, (costs[2] << 4) + 4);
>-COPY1_IF_LT(bcost, (costs[3] << 4) + 12);
>+(((costs[0] << 4) + 1) < bcost) && (bcost = ((costs[0] << 4) + 
>1));  // if ((y) < (x)) (x) = (y);
>+(((costs[1] << 4) + 3) < bcost) && (bcost = ((costs[1] << 4) + 
>3));
>+(((costs[2] << 4) + 4) < bcost) && (bcost = ((costs[2] << 4) + 
>4));
>+(((costs[3] << 4) + 12) < bcost) && (bcost = ((costs[3] << 4) + 
>12));
> if (!(bcost & 15))
> break;
> bmv.x -= (bcost << 28) >> 30;
>___
>x265-devel mailing list
>x265-devel@videolan.org
>https://mailman.videolan.org/listinfo/x265-devel___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] param: cleanup, print reconfigured param option along with its old and configured value

2016-03-07 Thread Praveen Tiwari
Please ignore the patch need to update. Thanks.

Regards,
Praveen

On Tue, Mar 8, 2016 at 10:57 AM, <prav...@multicorewareinc.com> wrote:

> # HG changeset patch
> # User Praveen Tiwari <prav...@multicorewareinc.com>
> # Date 1457356750 -19800
> #  Mon Mar 07 18:49:10 2016 +0530
> # Node ID 6f7dbb1c901cb5b5b88cc20c3213906465021338
> # Parent  88aebc166fa8e16f91d5f0acce77690003be9d91
> param: cleanup, print reconfigured param option along with its old and
> configured value
>
> diff -r 88aebc166fa8 -r 6f7dbb1c901c source/common/param.cpp
> --- a/source/common/param.cpp   Fri Mar 04 16:59:45 2016 +0530
> +++ b/source/common/param.cpp   Mon Mar 07 18:49:10 2016 +0530
> @@ -1373,36 +1373,31 @@
>  if (!param || !reconfiguredParam)
>  return;
>
> -x265_log(param,X265_LOG_INFO, "Reconfigured param options :\n");
> -
> -char buf[80] = { 0 };
>  char tmp[40];
> -#define TOOLCMP(COND1, COND2, STR, VAL)  if (COND1 != COND2) {
> sprintf(tmp, STR, VAL); appendtool(param, buf, sizeof(buf), tmp); }
> -TOOLCMP(param->maxNumReferences, reconfiguredParam->maxNumReferences,
> "ref=%d", reconfiguredParam->maxNumReferences);
> -TOOLCMP(param->maxTUSize, reconfiguredParam->maxTUSize,
> "max-tu-size=%d", reconfiguredParam->maxTUSize);
> -TOOLCMP(param->searchRange, reconfiguredParam->searchRange,
> "merange=%d", reconfiguredParam->searchRange);
> -TOOLCMP(param->subpelRefine, reconfiguredParam->subpelRefine, "subme=
> %d", reconfiguredParam->subpelRefine);
> -TOOLCMP(param->rdLevel, reconfiguredParam->rdLevel, "rd=%d",
> reconfiguredParam->rdLevel);
> -TOOLCMP(param->psyRd, reconfiguredParam->psyRd, "psy-rd=%.2lf",
> reconfiguredParam->psyRd);
> -TOOLCMP(param->rdoqLevel, reconfiguredParam->rdoqLevel, "rdoq=%d",
> reconfiguredParam->rdoqLevel);
> -TOOLCMP(param->psyRdoq, reconfiguredParam->psyRdoq, "psy-rdoq=%.2lf",
> reconfiguredParam->psyRdoq);
> -TOOLCMP(param->noiseReductionIntra,
> reconfiguredParam->noiseReductionIntra, "nr-intra=%d",
> reconfiguredParam->noiseReductionIntra);
> -TOOLCMP(param->noiseReductionInter,
> reconfiguredParam->noiseReductionInter, "nr-inter=%d",
> reconfiguredParam->noiseReductionInter);
> -TOOLCMP(param->bEnableTSkipFast, reconfiguredParam->bEnableTSkipFast,
> "tskip-fast=%d", reconfiguredParam->bEnableTSkipFast);
> -TOOLCMP(param->bEnableSignHiding,
> reconfiguredParam->bEnableSignHiding, "signhide=%d",
> reconfiguredParam->bEnableSignHiding);
> -TOOLCMP(param->bEnableFastIntra, reconfiguredParam->bEnableFastIntra,
> "fast-intra=%d", reconfiguredParam->bEnableFastIntra);
> -if (param->bEnableLoopFilter && (param->deblockingFilterBetaOffset !=
> reconfiguredParam->deblockingFilterBetaOffset
> +#define TOOLCMP(COND1, COND2, STR, OLD_VAL, NEW_VAL)  if (COND1 != COND2)
> { sprintf(tmp, STR, OLD_VAL, NEW_VAL);}
> +TOOLCMP(param->maxNumReferences, reconfiguredParam->maxNumReferences,
> "[x265] Reconfigure: ref=%d to %d", param->maxNumReferences,
> reconfiguredParam->maxNumReferences);
> +TOOLCMP(param->maxTUSize, reconfiguredParam->maxTUSize, "[x265]
> Reconfigure: max-tu-size=%d to %d", param->maxTUSize,
> reconfiguredParam->maxTUSize);
> +TOOLCMP(param->searchRange, reconfiguredParam->searchRange, "[x265]
> Reconfigure: merange=%d to %d", param->searchRange,
> reconfiguredParam->searchRange);
> +TOOLCMP(param->subpelRefine, reconfiguredParam->subpelRefine, "[x265]
> Reconfigure: subme=%d to %d", param->subpelRefine,
> reconfiguredParam->subpelRefine);
> +TOOLCMP(param->rdLevel, reconfiguredParam->rdLevel, "[x265]
> Reconfigure: rd=%d to %d", param->rdLevel, reconfiguredParam->rdLevel);
> +TOOLCMP(param->psyRd, reconfiguredParam->psyRd, "[x265] Reconfigure:
> psy-rd=%.2lf to %.2lf", param->psyRd, reconfiguredParam->psyRd);
> +TOOLCMP(param->rdoqLevel, reconfiguredParam->rdoqLevel, "[x265]
> Reconfigure: rdoq=%d to %d", param->rdoqLevel,
> reconfiguredParam->rdoqLevel);
> +TOOLCMP(param->psyRdoq, reconfiguredParam->psyRdoq, "[x265]
> Reconfigure: psy-rdoq=%.2lf to %.2lf", param->psyRdoq,
> reconfiguredParam->psyRdoq);
> +TOOLCMP(param->noiseReductionIntra,
> reconfiguredParam->noiseReductionIntra, "[x265] Reconfigure: nr-intra=%d to
> %d", param->noiseReductionIntra, reconf

[x265] Fwd: [PATCH] asm: avx2 code for weight_sp() 16bpp

2015-06-30 Thread Praveen Tiwari
-- Forwarded message --
From: aasaipr...@multicorewareinc.com
Date: Mon, Jun 29, 2015 at 4:51 PM
Subject: [x265] [PATCH] asm: avx2 code for weight_sp() 16bpp
To: x265-devel@videolan.org


# HG changeset patch
# User Aasaipriya Chandran aasaipr...@multicorewareinc.com
# Date 1435562395 -19800
#  Mon Jun 29 12:49:55 2015 +0530
# Node ID bebe4e496a432608cf0a9c495debd1970caa387e
# Parent  9feee64efa440c25f016d15ae982789e5393a77e
asm: avx2 code for weight_sp() 16bpp

 avx2: weight_sp  11.37x   4496.63 51139.20
 sse4: weight_sp  6.48x8163.87 52870.36

diff -r 9feee64efa44 -r bebe4e496a43 source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp  Fri Jun 26 15:29:51 2015
+0530
+++ b/source/common/x86/asm-primitives.cpp  Mon Jun 29 12:49:55 2015
+0530
@@ -1517,6 +1517,7 @@
 p.scale1D_128to64 = PFX(scale1D_128to64_avx2);
 p.scale2D_64to32 = PFX(scale2D_64to32_avx2);
 p.weight_pp = PFX(weight_pp_avx2);
+p.weight_sp = PFX(weight_sp_avx2);
 p.sign = PFX(calSign_avx2);

 p.cu[BLOCK_16x16].calcresidual = PFX(getResidual16_avx2);
diff -r 9feee64efa44 -r bebe4e496a43 source/common/x86/pixel-util8.asm
--- a/source/common/x86/pixel-util8.asm Fri Jun 26 15:29:51 2015 +0530
+++ b/source/common/x86/pixel-util8.asm Mon Jun 29 12:49:55 2015 +0530
@@ -1674,8 +1674,128 @@
 dec r5d
 jnz .loopH
 RET
-
-%if ARCH_X86_64
+%endif
+
+
+%if HIGH_BIT_DEPTH
+INIT_YMM avx2
+cglobal weight_sp, 6,7,9
+mova  m1, [pw_1023]
+mova  m2, [pw_1]
+mov   r6d, r7m


r7 is 8th register (0-7). so it should be  cglobal weight_sp, 6, 8, 9
and ARCH_X86_64
only code.



+shl   r6d, 16
+orr6d, r6m
+vpbroadcastd  m3, r6d  ; m3 = [round w0]
+movd  xm4, r8m ; m4 = [shift]
+vpbroadcastd  m5, r9m  ; m5 = [offset]
+
+; correct row stride
+add   r3d, r3d
+add   r2d, r2d
+mov   r6d, r4d
+and   r6d, ~(mmsize / SIZEOF_PIXEL - 1)
+sub   r3d, r6d
+sub   r3d, r6d
+sub   r2d, r6d
+sub   r2d, r6d
+
+; generate partial width mask (MUST BE IN YMM0)
+mov   r6d, r4d
+and   r6d, (mmsize / SIZEOF_PIXEL - 1)
+movd  xm0, r6d
+pshuflw   m0, m0, 0
+punpcklqdqm0, m0
+vinserti128   m0, m0, xm0, 1
+pcmpgtw   m0, [pw_0_15]
+
+.loopH:
+mov   r6d, r4d
+
+.loopW:
+movu  m6, [r0]
+paddw m6, [pw_2000]
+
+punpcklwd m7, m6, m2
+pmaddwd   m7, m3   ;(round w0)
+psrad m7, xm4  ;(shift)
+paddd m7, m5   ;(offset)
+
+punpckhwd m6, m2
+pmaddwd   m6, m3
+psrad m6, xm4
+paddd m6, m5
+
+packusdw  m7, m6
+pminuwm7, m1
+
+sub   r6d, (mmsize / SIZEOF_PIXEL)
+jl.width14
+movu  [r1], m7
+lea   r0, [r0 + mmsize]
+lea   r1, [r1 + mmsize]
+je.nextH
+jmp   .loopW
+
+.width14:
+add   r6d, 16
+cmp   r6d, 14
+jl.width12
+movu  [r1], xm7
+vextracti128  xm8, m7, 1
+movq  [r1 + 16], xm8
+pextrd[r1 + 24], xm8, 2
+je.nextH
+
+.width12:
+cmp   r6d, 12
+jl.width10
+movu  [r1], xm7
+vextracti128  xm8, m7, 1
+movq  [r1 + 16], xm8
+je.nextH
+
+.width10:
+cmp   r6d, 10
+jl.width8
+movu  [r1], xm7
+vextracti128  xm8, m7, 1
+movd  [r1 + 16], xm8
+je.nextH
+
+.width8:
+cmp   r6d, 8
+jl.width6
+movu  [r1], xm7
+je.nextH
+
+.width6
+cmp   r6d, 6
+jl.width4
+movq  [r1], xm7
+pextrd[r1 + 8], xm7, 2
+je.nextH
+
+.width4:
+cmp   r6d, 4
+jl

Re: [x265] Fwd: [PATCH] asm: pixelavg_pp[8xN] avx2 code for 10bpp

2015-06-29 Thread Praveen Tiwari
You would like to visit 8bpp code as well.

Regards,
Praveen

On Mon, Jun 29, 2015 at 11:24 AM, Rajesh Paulraj 
raj...@multicorewareinc.com wrote:

 We don't need to push this patch. I will improve sse version for the same
 size. We may not need avx2 code for this.(will make sure after rewriting
 sse2 code)

 On Mon, Jun 29, 2015 at 10:21 AM, Deepthi Nandakumar 
 deep...@multicorewareinc.com wrote:

 This does not build for HBD disabled

 On Fri, Jun 26, 2015 at 5:40 PM, Rajesh Paulraj 
 raj...@multicorewareinc.com wrote:

 yes. It looks like we need to optimize sse2 code. I will work on this.

 On Fri, Jun 26, 2015 at 5:31 PM, Praveen Tiwari 
 prav...@multicorewareinc.com wrote:




 -- Forwarded message --
 From: raj...@multicorewareinc.com
 Date: Fri, Jun 26, 2015 at 3:14 PM
 Subject: [x265] [PATCH] asm: pixelavg_pp[8xN] avx2 code for 10bpp
 To: x265-devel@videolan.org


 # HG changeset patch
 # User Rajesh Paulrajraj...@multicorewareinc.com
 # Date 1435311076 -19800
 #  Fri Jun 26 15:01:16 2015 +0530
 # Node ID 956401f1a679f1e71181b704d64e4acdb6f1a93f
 # Parent  d64227e54233d1646c55bcb4b0b831e5340009ed
 asm: pixelavg_pp[8xN] avx2 code for 10bpp

 avx2:
 avg_pp[  8x4]  4.39x145.09  636.75
 avg_pp[  8x8]  5.33x215.27  1146.55
 avg_pp[ 8x16]  6.50x336.88  2190.68
 avg_pp[ 8x32]  7.71x579.86  4470.84

 sse2:
 avg_pp[  8x4]  2.31x287.63  663.94
 avg_pp[  8x8]  3.26x370.21  1205.26
 avg_pp[ 8x16]  3.99x581.63  2323.25
 avg_pp[ 8x32]  4.78x995.79  4755.58


 Basically, our macro pixel_avg_8xN just SSE (just simple syntax
 conversion for avx2, not using 256 bit capability) so, fundamentally there
 should be no major improvement in speed. But improvements 287.63c
 - 145.09c, 370.21c - 215.27 etc are quite good. Does it means SSE2 codes
 are not optimize well ? Can you revisit SSE code using this algorithm?



 diff -r d64227e54233 -r 956401f1a679
 source/common/x86/asm-primitives.cpp
 --- a/source/common/x86/asm-primitives.cpp  Thu Jun 25 16:25:51
 2015 +0530
 +++ b/source/common/x86/asm-primitives.cpp  Fri Jun 26 15:01:16
 2015 +0530
 @@ -1362,6 +1362,10 @@
  p.cu[BLOCK_32x32].intra_pred[33]=
 PFX(intra_pred_ang32_33_avx2);
  p.cu[BLOCK_32x32].intra_pred[34]=
 PFX(intra_pred_ang32_2_avx2);

 +p.pu[LUMA_8x4].pixelavg_pp = PFX(pixel_avg_8x4_avx2);
 +p.pu[LUMA_8x8].pixelavg_pp = PFX(pixel_avg_8x8_avx2);
 +p.pu[LUMA_8x16].pixelavg_pp = PFX(pixel_avg_8x16_avx2);
 +p.pu[LUMA_8x32].pixelavg_pp = PFX(pixel_avg_8x32_avx2);
  p.pu[LUMA_12x16].pixelavg_pp = PFX(pixel_avg_12x16_avx2);
  p.pu[LUMA_16x4].pixelavg_pp = PFX(pixel_avg_16x4_avx2);
  p.pu[LUMA_16x8].pixelavg_pp = PFX(pixel_avg_16x8_avx2);
 diff -r d64227e54233 -r 956401f1a679 source/common/x86/mc-a.asm
 --- a/source/common/x86/mc-a.asmThu Jun 25 16:25:51 2015 +0530
 +++ b/source/common/x86/mc-a.asmFri Jun 26 15:01:16 2015 +0530
 @@ -4490,6 +4490,88 @@
  RET
  %endif

 +%macro  pixel_avg_W8 0
 +movuxm0, [r2]
 +movuxm1, [r4]
 +pavgw   xm0, xm1
 +movu[r0], xm0
 +movuxm2, [r2 + r3]
 +movuxm3, [r4 + r5]
 +pavgw   xm2, xm3
 +movu[r0 + r1], xm2
 +
 +movuxm0, [r2 + r3 * 2]
 +movuxm1, [r4 + r5 * 2]
 +pavgw   xm0, xm1
 +movu[r0 + r1 * 2], xm0
 +movuxm2, [r2 + r6]
 +movuxm3, [r4 + r7]
 +pavgw   xm2, xm3
 +movu[r0 + r8], xm2
 +
 +lea r0, [r0 + 4 * r1]
 +lea r2, [r2 + 4 * r3]
 +lea r4, [r4 + 4 * r5]
 +%endmacro
 +

 +;---
 +;void pixelavg_pp(pixel dst, intptr_t dstride, const pixel src0,
 intptr_t sstride0, const pixel* src1, intptr_t sstride1, int)

 +;---
 +%if ARCH_X86_64
 +INIT_YMM avx2
 +cglobal pixel_avg_8x4, 6,10,4
 +add r1d, r1d
 +add r3d, r3d
 +add r5d, r5d
 +lea r6, [r3 * 3]
 +lea r7, [r5 * 3]
 +lea r8, [r1 * 3]
 +pixel_avg_W8
 +RET
 +
 +cglobal pixel_avg_8x8, 6,10,4
 +add r1d, r1d
 +add r3d, r3d
 +add r5d, r5d
 +lea r6, [r3 * 3]
 +lea r7, [r5 * 3]
 +lea r8, [r1 * 3]
 +mov r9d, 2
 +.loop
 +pixel_avg_W8
 +dec r9d
 +jnz .loop
 +RET
 +
 +cglobal pixel_avg_8x16, 6,10,4
 +add r1d, r1d
 +add r3d, r3d
 +add r5d, r5d
 +lea r6, [r3 * 3]
 +lea r7, [r5 * 3]
 +lea r8, [r1 * 3]
 +mov r9d, 4
 +.loop
 +pixel_avg_W8
 +dec r9d
 +jnz .loop
 +RET
 +
 +cglobal pixel_avg_8x32, 6,10,4
 +add r1d, r1d
 +add r3d, r3d
 +add r5d, r5d
 +lea r6, [r3 * 3

[x265] Fwd: [PATCH] asm: pixelavg_pp[8xN] avx2 code for 10bpp

2015-06-26 Thread Praveen Tiwari
-- Forwarded message --
From: raj...@multicorewareinc.com
Date: Fri, Jun 26, 2015 at 3:14 PM
Subject: [x265] [PATCH] asm: pixelavg_pp[8xN] avx2 code for 10bpp
To: x265-devel@videolan.org


# HG changeset patch
# User Rajesh Paulrajraj...@multicorewareinc.com
# Date 1435311076 -19800
#  Fri Jun 26 15:01:16 2015 +0530
# Node ID 956401f1a679f1e71181b704d64e4acdb6f1a93f
# Parent  d64227e54233d1646c55bcb4b0b831e5340009ed
asm: pixelavg_pp[8xN] avx2 code for 10bpp

avx2:
avg_pp[  8x4]  4.39x145.09  636.75
avg_pp[  8x8]  5.33x215.27  1146.55
avg_pp[ 8x16]  6.50x336.88  2190.68
avg_pp[ 8x32]  7.71x579.86  4470.84

sse2:
avg_pp[  8x4]  2.31x287.63  663.94
avg_pp[  8x8]  3.26x370.21  1205.26
avg_pp[ 8x16]  3.99x581.63  2323.25
avg_pp[ 8x32]  4.78x995.79  4755.58

diff -r d64227e54233 -r 956401f1a679 source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp  Thu Jun 25 16:25:51 2015
+0530
+++ b/source/common/x86/asm-primitives.cpp  Fri Jun 26 15:01:16 2015
+0530
@@ -1362,6 +1362,10 @@
 p.cu[BLOCK_32x32].intra_pred[33]=
PFX(intra_pred_ang32_33_avx2);
 p.cu[BLOCK_32x32].intra_pred[34]= PFX(intra_pred_ang32_2_avx2);

+p.pu[LUMA_8x4].pixelavg_pp = PFX(pixel_avg_8x4_avx2);
+p.pu[LUMA_8x8].pixelavg_pp = PFX(pixel_avg_8x8_avx2);
+p.pu[LUMA_8x16].pixelavg_pp = PFX(pixel_avg_8x16_avx2);
+p.pu[LUMA_8x32].pixelavg_pp = PFX(pixel_avg_8x32_avx2);
 p.pu[LUMA_12x16].pixelavg_pp = PFX(pixel_avg_12x16_avx2);
 p.pu[LUMA_16x4].pixelavg_pp = PFX(pixel_avg_16x4_avx2);
 p.pu[LUMA_16x8].pixelavg_pp = PFX(pixel_avg_16x8_avx2);
diff -r d64227e54233 -r 956401f1a679 source/common/x86/mc-a.asm
--- a/source/common/x86/mc-a.asmThu Jun 25 16:25:51 2015 +0530
+++ b/source/common/x86/mc-a.asmFri Jun 26 15:01:16 2015 +0530
@@ -4490,6 +4490,88 @@
 RET
 %endif

+%macro  pixel_avg_W8 0
+movuxm0, [r2]
+movuxm1, [r4]
+pavgw   xm0, xm1
+movu[r0], xm0
+movuxm2, [r2 + r3]
+movuxm3, [r4 + r5]
+pavgw   xm2, xm3
+movu[r0 + r1], xm2
+
 Your macro is not using avx2 capabilities, did you check the performance
of two rows combined ? It will reduce your  pavgw and movu instruction by
half. You can use vinserti128 to combine two rows at a time.

+movuxm0, [r2 + r3 * 2]
+movuxm1, [r4 + r5 * 2]
+pavgw   xm0, xm1
+movu[r0 + r1 * 2], xm0
+movuxm2, [r2 + r6]
+movuxm3, [r4 + r7]
+pavgw   xm2, xm3
+movu[r0 + r8], xm2
+
+lea r0, [r0 + 4 * r1]
+lea r2, [r2 + 4 * r3]
+lea r4, [r4 + 4 * r5]
+%endmacro
+
+;---
+;void pixelavg_pp(pixel dst, intptr_t dstride, const pixel src0, intptr_t
sstride0, const pixel* src1, intptr_t sstride1, int)
+;---
+%if ARCH_X86_64
+INIT_YMM avx2
+cglobal pixel_avg_8x4, 6,10,4
+add r1d, r1d
+add r3d, r3d
+add r5d, r5d
+lea r6, [r3 * 3]
+lea r7, [r5 * 3]
+lea r8, [r1 * 3]
+pixel_avg_W8
+RET
+
+cglobal pixel_avg_8x8, 6,10,4
+add r1d, r1d
+add r3d, r3d
+add r5d, r5d
+lea r6, [r3 * 3]
+lea r7, [r5 * 3]
+lea r8, [r1 * 3]
+mov r9d, 2
+.loop
+pixel_avg_W8
+dec r9d
+jnz .loop
+RET
+
+cglobal pixel_avg_8x16, 6,10,4
+add r1d, r1d
+add r3d, r3d
+add r5d, r5d
+lea r6, [r3 * 3]
+lea r7, [r5 * 3]
+lea r8, [r1 * 3]
+mov r9d, 4
+.loop
+pixel_avg_W8
+dec r9d
+jnz .loop
+RET
+
+cglobal pixel_avg_8x32, 6,10,4
+add r1d, r1d
+add r3d, r3d
+add r5d, r5d
+lea r6, [r3 * 3]
+lea r7, [r5 * 3]
+lea r8, [r1 * 3]
+mov r9d, 8
+.loop
+pixel_avg_W8
+dec r9d
+jnz .loop
+RET
+%endif
+
 %macro  pixel_avg_H4 0
 movum0, [r2]
 movum1, [r4]
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] Fwd: [PATCH] asm: pixelavg_pp[8xN] avx2 code for 10bpp

2015-06-26 Thread Praveen Tiwari
ahh, width is just 8*16 = 128, two rows at a time will need vextracti128 as
well while storing, which goes to port5, a bottleneck port. pavgw is much
cheaper than it. You may try to combine 16XN sizes.

Regards,
Praveen

On Fri, Jun 26, 2015 at 3:40 PM, Rajesh Paulraj raj...@multicorewareinc.com
 wrote:

 I tried using vinserti128. But that reduces the performance than this one.
 So i kept this version.

 On Fri, Jun 26, 2015 at 3:37 PM, Praveen Tiwari 
 prav...@multicorewareinc.com wrote:




 -- Forwarded message --
 From: raj...@multicorewareinc.com
 Date: Fri, Jun 26, 2015 at 3:14 PM
 Subject: [x265] [PATCH] asm: pixelavg_pp[8xN] avx2 code for 10bpp
 To: x265-devel@videolan.org


 # HG changeset patch
 # User Rajesh Paulrajraj...@multicorewareinc.com
 # Date 1435311076 -19800
 #  Fri Jun 26 15:01:16 2015 +0530
 # Node ID 956401f1a679f1e71181b704d64e4acdb6f1a93f
 # Parent  d64227e54233d1646c55bcb4b0b831e5340009ed
 asm: pixelavg_pp[8xN] avx2 code for 10bpp

 avx2:
 avg_pp[  8x4]  4.39x145.09  636.75
 avg_pp[  8x8]  5.33x215.27  1146.55
 avg_pp[ 8x16]  6.50x336.88  2190.68
 avg_pp[ 8x32]  7.71x579.86  4470.84

 sse2:
 avg_pp[  8x4]  2.31x287.63  663.94
 avg_pp[  8x8]  3.26x370.21  1205.26
 avg_pp[ 8x16]  3.99x581.63  2323.25
 avg_pp[ 8x32]  4.78x995.79  4755.58

 diff -r d64227e54233 -r 956401f1a679 source/common/x86/asm-primitives.cpp
 --- a/source/common/x86/asm-primitives.cpp  Thu Jun 25 16:25:51 2015
 +0530
 +++ b/source/common/x86/asm-primitives.cpp  Fri Jun 26 15:01:16 2015
 +0530
 @@ -1362,6 +1362,10 @@
  p.cu[BLOCK_32x32].intra_pred[33]=
 PFX(intra_pred_ang32_33_avx2);
  p.cu[BLOCK_32x32].intra_pred[34]=
 PFX(intra_pred_ang32_2_avx2);

 +p.pu[LUMA_8x4].pixelavg_pp = PFX(pixel_avg_8x4_avx2);
 +p.pu[LUMA_8x8].pixelavg_pp = PFX(pixel_avg_8x8_avx2);
 +p.pu[LUMA_8x16].pixelavg_pp = PFX(pixel_avg_8x16_avx2);
 +p.pu[LUMA_8x32].pixelavg_pp = PFX(pixel_avg_8x32_avx2);
  p.pu[LUMA_12x16].pixelavg_pp = PFX(pixel_avg_12x16_avx2);
  p.pu[LUMA_16x4].pixelavg_pp = PFX(pixel_avg_16x4_avx2);
  p.pu[LUMA_16x8].pixelavg_pp = PFX(pixel_avg_16x8_avx2);
 diff -r d64227e54233 -r 956401f1a679 source/common/x86/mc-a.asm
 --- a/source/common/x86/mc-a.asmThu Jun 25 16:25:51 2015 +0530
 +++ b/source/common/x86/mc-a.asmFri Jun 26 15:01:16 2015 +0530
 @@ -4490,6 +4490,88 @@
  RET
  %endif

 +%macro  pixel_avg_W8 0
 +movuxm0, [r2]
 +movuxm1, [r4]
 +pavgw   xm0, xm1
 +movu[r0], xm0
 +movuxm2, [r2 + r3]
 +movuxm3, [r4 + r5]
 +pavgw   xm2, xm3
 +movu[r0 + r1], xm2
 +
  Your macro is not using avx2 capabilities, did you check the
 performance of two rows combined ? It will reduce your  pavgw and movu
 instruction by half. You can use vinserti128 to combine two rows at a
 time.

 +movuxm0, [r2 + r3 * 2]
 +movuxm1, [r4 + r5 * 2]
 +pavgw   xm0, xm1
 +movu[r0 + r1 * 2], xm0
 +movuxm2, [r2 + r6]
 +movuxm3, [r4 + r7]
 +pavgw   xm2, xm3
 +movu[r0 + r8], xm2
 +
 +lea r0, [r0 + 4 * r1]
 +lea r2, [r2 + 4 * r3]
 +lea r4, [r4 + 4 * r5]
 +%endmacro
 +

 +;---
 +;void pixelavg_pp(pixel dst, intptr_t dstride, const pixel src0,
 intptr_t sstride0, const pixel* src1, intptr_t sstride1, int)

 +;---
 +%if ARCH_X86_64
 +INIT_YMM avx2
 +cglobal pixel_avg_8x4, 6,10,4
 +add r1d, r1d
 +add r3d, r3d
 +add r5d, r5d
 +lea r6, [r3 * 3]
 +lea r7, [r5 * 3]
 +lea r8, [r1 * 3]
 +pixel_avg_W8
 +RET
 +
 +cglobal pixel_avg_8x8, 6,10,4
 +add r1d, r1d
 +add r3d, r3d
 +add r5d, r5d
 +lea r6, [r3 * 3]
 +lea r7, [r5 * 3]
 +lea r8, [r1 * 3]
 +mov r9d, 2
 +.loop
 +pixel_avg_W8
 +dec r9d
 +jnz .loop
 +RET
 +
 +cglobal pixel_avg_8x16, 6,10,4
 +add r1d, r1d
 +add r3d, r3d
 +add r5d, r5d
 +lea r6, [r3 * 3]
 +lea r7, [r5 * 3]
 +lea r8, [r1 * 3]
 +mov r9d, 4
 +.loop
 +pixel_avg_W8
 +dec r9d
 +jnz .loop
 +RET
 +
 +cglobal pixel_avg_8x32, 6,10,4
 +add r1d, r1d
 +add r3d, r3d
 +add r5d, r5d
 +lea r6, [r3 * 3]
 +lea r7, [r5 * 3]
 +lea r8, [r1 * 3]
 +mov r9d, 8
 +.loop
 +pixel_avg_W8
 +dec r9d
 +jnz .loop
 +RET
 +%endif
 +
  %macro  pixel_avg_H4 0
  movum0, [r2]
  movum1, [r4]
 ___
 x265-devel mailing list
 x265-devel@videolan.org

[x265] Fwd: [PATCH] asm: pixelavg_pp[8xN] avx2 code for 10bpp

2015-06-26 Thread Praveen Tiwari
-- Forwarded message --
From: raj...@multicorewareinc.com
Date: Fri, Jun 26, 2015 at 3:14 PM
Subject: [x265] [PATCH] asm: pixelavg_pp[8xN] avx2 code for 10bpp
To: x265-devel@videolan.org


# HG changeset patch
# User Rajesh Paulrajraj...@multicorewareinc.com
# Date 1435311076 -19800
#  Fri Jun 26 15:01:16 2015 +0530
# Node ID 956401f1a679f1e71181b704d64e4acdb6f1a93f
# Parent  d64227e54233d1646c55bcb4b0b831e5340009ed
asm: pixelavg_pp[8xN] avx2 code for 10bpp

avx2:
avg_pp[  8x4]  4.39x145.09  636.75
avg_pp[  8x8]  5.33x215.27  1146.55
avg_pp[ 8x16]  6.50x336.88  2190.68
avg_pp[ 8x32]  7.71x579.86  4470.84

sse2:
avg_pp[  8x4]  2.31x287.63  663.94
avg_pp[  8x8]  3.26x370.21  1205.26
avg_pp[ 8x16]  3.99x581.63  2323.25
avg_pp[ 8x32]  4.78x995.79  4755.58


Basically, our macro pixel_avg_8xN just SSE (just simple syntax
conversion for avx2, not using 256 bit capability) so, fundamentally there
should be no major improvement in speed. But improvements 287.63c
- 145.09c, 370.21c - 215.27 etc are quite good. Does it means SSE2 codes
are not optimize well ? Can you revisit SSE code using this algorithm?


diff -r d64227e54233 -r 956401f1a679 source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp  Thu Jun 25 16:25:51 2015
+0530
+++ b/source/common/x86/asm-primitives.cpp  Fri Jun 26 15:01:16 2015
+0530
@@ -1362,6 +1362,10 @@
 p.cu[BLOCK_32x32].intra_pred[33]=
PFX(intra_pred_ang32_33_avx2);
 p.cu[BLOCK_32x32].intra_pred[34]= PFX(intra_pred_ang32_2_avx2);

+p.pu[LUMA_8x4].pixelavg_pp = PFX(pixel_avg_8x4_avx2);
+p.pu[LUMA_8x8].pixelavg_pp = PFX(pixel_avg_8x8_avx2);
+p.pu[LUMA_8x16].pixelavg_pp = PFX(pixel_avg_8x16_avx2);
+p.pu[LUMA_8x32].pixelavg_pp = PFX(pixel_avg_8x32_avx2);
 p.pu[LUMA_12x16].pixelavg_pp = PFX(pixel_avg_12x16_avx2);
 p.pu[LUMA_16x4].pixelavg_pp = PFX(pixel_avg_16x4_avx2);
 p.pu[LUMA_16x8].pixelavg_pp = PFX(pixel_avg_16x8_avx2);
diff -r d64227e54233 -r 956401f1a679 source/common/x86/mc-a.asm
--- a/source/common/x86/mc-a.asmThu Jun 25 16:25:51 2015 +0530
+++ b/source/common/x86/mc-a.asmFri Jun 26 15:01:16 2015 +0530
@@ -4490,6 +4490,88 @@
 RET
 %endif

+%macro  pixel_avg_W8 0
+movuxm0, [r2]
+movuxm1, [r4]
+pavgw   xm0, xm1
+movu[r0], xm0
+movuxm2, [r2 + r3]
+movuxm3, [r4 + r5]
+pavgw   xm2, xm3
+movu[r0 + r1], xm2
+
+movuxm0, [r2 + r3 * 2]
+movuxm1, [r4 + r5 * 2]
+pavgw   xm0, xm1
+movu[r0 + r1 * 2], xm0
+movuxm2, [r2 + r6]
+movuxm3, [r4 + r7]
+pavgw   xm2, xm3
+movu[r0 + r8], xm2
+
+lea r0, [r0 + 4 * r1]
+lea r2, [r2 + 4 * r3]
+lea r4, [r4 + 4 * r5]
+%endmacro
+
+;---
+;void pixelavg_pp(pixel dst, intptr_t dstride, const pixel src0, intptr_t
sstride0, const pixel* src1, intptr_t sstride1, int)
+;---
+%if ARCH_X86_64
+INIT_YMM avx2
+cglobal pixel_avg_8x4, 6,10,4
+add r1d, r1d
+add r3d, r3d
+add r5d, r5d
+lea r6, [r3 * 3]
+lea r7, [r5 * 3]
+lea r8, [r1 * 3]
+pixel_avg_W8
+RET
+
+cglobal pixel_avg_8x8, 6,10,4
+add r1d, r1d
+add r3d, r3d
+add r5d, r5d
+lea r6, [r3 * 3]
+lea r7, [r5 * 3]
+lea r8, [r1 * 3]
+mov r9d, 2
+.loop
+pixel_avg_W8
+dec r9d
+jnz .loop
+RET
+
+cglobal pixel_avg_8x16, 6,10,4
+add r1d, r1d
+add r3d, r3d
+add r5d, r5d
+lea r6, [r3 * 3]
+lea r7, [r5 * 3]
+lea r8, [r1 * 3]
+mov r9d, 4
+.loop
+pixel_avg_W8
+dec r9d
+jnz .loop
+RET
+
+cglobal pixel_avg_8x32, 6,10,4
+add r1d, r1d
+add r3d, r3d
+add r5d, r5d
+lea r6, [r3 * 3]
+lea r7, [r5 * 3]
+lea r8, [r1 * 3]
+mov r9d, 8
+.loop
+pixel_avg_W8
+dec r9d
+jnz .loop
+RET
+%endif
+
 %macro  pixel_avg_H4 0
 movum0, [r2]
 movum1, [r4]
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH 1 of 3] asm: intra_pred_ang32_33 improved by ~35% over SSE4

2015-03-26 Thread Praveen Tiwari
Please ignore duplicate patch (second), send my mistake.

Regards,
Praveen

On Fri, Mar 27, 2015 at 10:41 AM, prav...@multicorewareinc.com wrote:

 # HG changeset patch
 # User Praveen Tiwari prav...@multicorewareinc.com
 # Date 1427356204 -19800
 #  Thu Mar 26 13:20:04 2015 +0530
 # Branch stable
 # Node ID 24bdb3e594556ca6e12ee9dae58100a6bd115d2a
 # Parent  3d0f23cb0e58585e490362587022e67cfded143a
 asm: intra_pred_ang32_33 improved by ~35% over SSE4

 AVX2:
 intra_ang_32x32[33] 11.11x   2618.69 29084.27

 SSE4:
 intra_ang_32x32[33] 7.59x4055.42 30792.64

 diff -r 3d0f23cb0e58 -r 24bdb3e59455 source/common/x86/asm-primitives.cpp
 --- a/source/common/x86/asm-primitives.cpp  Thu Mar 26 15:09:51 2015
 -0500
 +++ b/source/common/x86/asm-primitives.cpp  Thu Mar 26 13:20:04 2015
 +0530
 @@ -1642,6 +1642,7 @@
  p.cu[BLOCK_32x32].intra_pred[30] = x265_intra_pred_ang32_30_avx2;
  p.cu[BLOCK_32x32].intra_pred[31] = x265_intra_pred_ang32_31_avx2;
  p.cu[BLOCK_32x32].intra_pred[32] = x265_intra_pred_ang32_32_avx2;
 +p.cu[BLOCK_32x32].intra_pred[33] = x265_intra_pred_ang32_33_avx2;

  // copy_sp primitives
  p.cu[BLOCK_16x16].copy_sp = x265_blockcopy_sp_16x16_avx2;
 diff -r 3d0f23cb0e58 -r 24bdb3e59455 source/common/x86/intrapred.h
 --- a/source/common/x86/intrapred.h Thu Mar 26 15:09:51 2015 -0500
 +++ b/source/common/x86/intrapred.h Thu Mar 26 13:20:04 2015 +0530
 @@ -212,6 +212,7 @@
  void x265_intra_pred_ang32_30_avx2(pixel* dst, intptr_t dstStride, const
 pixel* srcPix, int dirMode, int bFilter);
  void x265_intra_pred_ang32_31_avx2(pixel* dst, intptr_t dstStride, const
 pixel* srcPix, int dirMode, int bFilter);
  void x265_intra_pred_ang32_32_avx2(pixel* dst, intptr_t dstStride, const
 pixel* srcPix, int dirMode, int bFilter);
 +void x265_intra_pred_ang32_33_avx2(pixel* dst, intptr_t dstStride, const
 pixel* srcPix, int dirMode, int bFilter);
  void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel
 *filtPix, int bLuma);
  void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel
 *filtPix, int bLuma);
  void x265_all_angs_pred_16x16_sse4(pixel *dest, pixel *refPix, pixel
 *filtPix, int bLuma);
 diff -r 3d0f23cb0e58 -r 24bdb3e59455 source/common/x86/intrapred8.asm
 --- a/source/common/x86/intrapred8.asm  Thu Mar 26 15:09:51 2015 -0500
 +++ b/source/common/x86/intrapred8.asm  Thu Mar 26 13:20:04 2015 +0530
 @@ -376,6 +376,37 @@
 db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21,
 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11
 db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0,
 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0

 +
 +ALIGN 32
 +c_ang32_mode_33:   db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6,
 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
 +   db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12,
 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
 +   db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18,
 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
 +   db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8,
 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
 +   db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2,
 30, 2, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
 +   db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10,
 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
 +   db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
 +   db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22,
 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
 +   db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4,
 28, 4, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
 +   db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8,
 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
 +   db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14,
 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
 +   db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20,
 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
 +   db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6,
 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
 +   db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0,
 32, 0, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
 +   db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12,
 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
 +   db 18, 14, 18, 14, 18, 14

Re: [x265] [PATCH 2 of 3] asm: intra_pred_ang32_25 improved by ~53% over SSE4

2015-03-26 Thread Praveen Tiwari
Please ignore duplicate patch (second), send my mistake.

Regards,
Praveen

On Fri, Mar 27, 2015 at 10:41 AM, prav...@multicorewareinc.com wrote:

 # HG changeset patch
 # User Praveen Tiwari prav...@multicorewareinc.com
 # Date 142736 -19800
 #  Thu Mar 26 14:23:20 2015 +0530
 # Branch stable
 # Node ID 39c139322fde1f8c62545fd8bbed9cc8198e540c
 # Parent  24bdb3e594556ca6e12ee9dae58100a6bd115d2a
 asm: intra_pred_ang32_25 improved by ~53% over SSE4

 AVX2:
 intra_ang_32x32[25] 23.11x   1293.83 29904.12

 SSE4:
 intra_ang_32x32[25] 10.31x   2759.33 28451.26

 diff -r 24bdb3e59455 -r 39c139322fde source/common/x86/asm-primitives.cpp
 --- a/source/common/x86/asm-primitives.cpp  Thu Mar 26 13:20:04 2015
 +0530
 +++ b/source/common/x86/asm-primitives.cpp  Thu Mar 26 14:23:20 2015
 +0530
 @@ -1643,6 +1643,7 @@
  p.cu[BLOCK_32x32].intra_pred[31] = x265_intra_pred_ang32_31_avx2;
  p.cu[BLOCK_32x32].intra_pred[32] = x265_intra_pred_ang32_32_avx2;
  p.cu[BLOCK_32x32].intra_pred[33] = x265_intra_pred_ang32_33_avx2;
 +p.cu[BLOCK_32x32].intra_pred[25] = x265_intra_pred_ang32_25_avx2;

  // copy_sp primitives
  p.cu[BLOCK_16x16].copy_sp = x265_blockcopy_sp_16x16_avx2;
 diff -r 24bdb3e59455 -r 39c139322fde source/common/x86/intrapred.h
 --- a/source/common/x86/intrapred.h Thu Mar 26 13:20:04 2015 +0530
 +++ b/source/common/x86/intrapred.h Thu Mar 26 14:23:20 2015 +0530
 @@ -213,6 +213,7 @@
  void x265_intra_pred_ang32_31_avx2(pixel* dst, intptr_t dstStride, const
 pixel* srcPix, int dirMode, int bFilter);
  void x265_intra_pred_ang32_32_avx2(pixel* dst, intptr_t dstStride, const
 pixel* srcPix, int dirMode, int bFilter);
  void x265_intra_pred_ang32_33_avx2(pixel* dst, intptr_t dstStride, const
 pixel* srcPix, int dirMode, int bFilter);
 +void x265_intra_pred_ang32_25_avx2(pixel* dst, intptr_t dstStride, const
 pixel* srcPix, int dirMode, int bFilter);
  void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel
 *filtPix, int bLuma);
  void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel
 *filtPix, int bLuma);
  void x265_all_angs_pred_16x16_sse4(pixel *dest, pixel *refPix, pixel
 *filtPix, int bLuma);
 diff -r 24bdb3e59455 -r 39c139322fde source/common/x86/intrapred8.asm
 --- a/source/common/x86/intrapred8.asm  Thu Mar 26 13:20:04 2015 +0530
 +++ b/source/common/x86/intrapred8.asm  Thu Mar 26 14:23:20 2015 +0530
 @@ -407,6 +407,26 @@
 db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0,
 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0


 +
 +ALIGN 32
 +c_ang32_mode_25:   db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2,
 30, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
 +   db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6,
 26, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
 +   db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10,
 22, 10, 22, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
 +   db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14,
 18, 14, 18, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
 +   db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18,
 14, 18, 14, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
 +   db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22,
 10, 22, 10, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
 +   db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6,
 26, 6, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
 +   db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2,
 30, 2, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
 +   db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2,
 30, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
 +   db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6,
 26, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
 +   db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10,
 22, 10, 22, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
 +   db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14,
 18, 14, 18, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
 +   db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18,
 14, 18, 14, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
 +   db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22,
 10, 22, 10, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
 +   db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6,
 26, 6, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
 +   db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2,
 30, 2, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
 +
 +
  ALIGN 32
  ;; (blkSize - 1 - x)
  pw_planar4_0: dw 3,  2,  1

Re: [x265] [PATCH] asm: intra_pred_ang16_25

2015-03-12 Thread Praveen Tiwari
Please ignore, need to add performance data in commit message.


Regards,
Praveen

On Thu, Mar 12, 2015 at 6:50 PM, prav...@multicorewareinc.com wrote:

 # HG changeset patch
 # User Praveen Tiwari prav...@multicorewareinc.com
 # Date 1426165765 -19800
 # Node ID e4204ceeb011a009455cde620c346729d80ac822
 # Parent  d012e125bdb1299ba29b9c0680931e148981a42e
 asm: intra_pred_ang16_25

 diff -r d012e125bdb1 -r e4204ceeb011 source/common/x86/asm-primitives.cpp
 --- a/source/common/x86/asm-primitives.cpp  Thu Mar 12 18:40:23 2015
 +0530
 +++ b/source/common/x86/asm-primitives.cpp  Thu Mar 12 18:39:25 2015
 +0530
 @@ -1504,6 +1504,7 @@
  p.cu[BLOCK_8x8].intra_pred[12] = x265_intra_pred_ang8_12_avx2;
  p.cu[BLOCK_8x8].intra_pred[24] = x265_intra_pred_ang8_24_avx2;
  p.cu[BLOCK_8x8].intra_pred[11] = x265_intra_pred_ang8_11_avx2;
 +p.cu[BLOCK_16x16].intra_pred[25] = x265_intra_pred_ang16_25_avx2;

  // copy_sp primitives
  p.cu[BLOCK_16x16].copy_sp = x265_blockcopy_sp_16x16_avx2;
 diff -r d012e125bdb1 -r e4204ceeb011 source/common/x86/intrapred.h
 --- a/source/common/x86/intrapred.h Thu Mar 12 18:40:23 2015 +0530
 +++ b/source/common/x86/intrapred.h Thu Mar 12 18:39:25 2015 +0530
 @@ -182,6 +182,7 @@
  void x265_intra_pred_ang8_12_avx2(pixel* dst, intptr_t dstStride, const
 pixel* srcPix, int dirMode, int bFilter);
  void x265_intra_pred_ang8_24_avx2(pixel* dst, intptr_t dstStride, const
 pixel* srcPix, int dirMode, int bFilter);
  void x265_intra_pred_ang8_11_avx2(pixel* dst, intptr_t dstStride, const
 pixel* srcPix, int dirMode, int bFilter);
 +void x265_intra_pred_ang16_25_avx2(pixel* dst, intptr_t dstStride, const
 pixel* srcPix, int dirMode, int bFilter);
  void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel
 *filtPix, int bLuma);
  void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel
 *filtPix, int bLuma);
  void x265_all_angs_pred_16x16_sse4(pixel *dest, pixel *refPix, pixel
 *filtPix, int bLuma);
 diff -r d012e125bdb1 -r e4204ceeb011 source/common/x86/intrapred8.asm
 --- a/source/common/x86/intrapred8.asm  Thu Mar 12 18:40:23 2015 +0530
 +++ b/source/common/x86/intrapred8.asm  Thu Mar 12 18:39:25 2015 +0530
 @@ -113,6 +113,17 @@
db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7,
 25, 7, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29,
 3, 29, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24

 +ALIGN 32
 +c_ang16_mode_25:  db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30,
 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
 +  db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26,
 6, 26, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
 +  db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22,
 10, 22, 10, 22, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12,
 20
 +  db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18,
 14, 18, 14, 18, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
 16
 +  db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14,
 18, 14, 18, 14, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20,
 12
 +  db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10,
 22, 10, 22, 10, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
 +  db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6,
 26, 6, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
 +  db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2,
 30, 2, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
 +
 +ALIGN 32
  ;; (blkSize - 1 - x)
  pw_planar4_0: dw 3,  2,  1,  0,  3,  2,  1,  0
  pw_planar4_1: dw 3,  3,  3,  3,  3,  3,  3,  3
 @@ -10368,6 +10379,47 @@
  movhps[r0 + r3], xm2
  RET

 +%macro INTRA_PRED_ANG16_MC0 3
 +pmaddubsw m3, m1, [r4 + %3 * mmsize]
 +pmulhrsw  m3, m0
 +pmaddubsw m4, m2, [r4 + %3 * mmsize]
 +pmulhrsw  m4, m0
 +packuswb  m3, m4
 +movu  [%1], xm3
 +vextracti128  xm4, m3, 1
 +movu  [%2], xm4
 +%endmacro
 +
 +%macro INTRA_PRED_ANG16_25 1
 +INTRA_PRED_ANG16_MC0 r0, r0 + r1, %1
 +INTRA_PRED_ANG16_MC0 r0 + 2 * r1, r0 + r3, (%1 + 1)
 +%endmacro
 +
 +INIT_YMM avx2
 +cglobal intra_pred_ang16_25, 3, 5, 5
 +mova  m0, [pw_1024]
 +
 +vbroadcasti128m1, [r2]
 +pshufbm1, [intra_pred_shuff_0_8]
 +vbroadcasti128m2, [r2 + 8]
 +pshufbm2, [intra_pred_shuff_0_8]
 +
 +lea   r3, [3 * r1]
 +lea   r4, [c_ang16_mode_25]
 +
 +INTRA_PRED_ANG16_25 0
 +
 +lear0, [r0 + 4 * r1]
 +INTRA_PRED_ANG16_25 2
 +
 +lear0, [r0 + 4 * r1]
 +INTRA_PRED_ANG16_25 4
 +
 +lear0, [r0 + 4 * r1

Re: [x265] [PATCH] asm-avx2: inra_pred, align const

2015-03-11 Thread Praveen Tiwari
Updated this patch on tip.


Thanks,
Praveen

On Tue, Mar 10, 2015 at 10:53 AM, prav...@multicorewareinc.com wrote:

 # HG changeset patch
 # User Praveen Tiwari prav...@multicorewareinc.com
 # Date 1425964751 -19800
 # Node ID f97dfb483647d573cbcab9a4f007ac2aa89c9066
 # Parent  726fe4088f31710af174c18b1e26fdc759efb300
 asm-avx2: inra_pred, align const

 diff -r 726fe4088f31 -r f97dfb483647 source/common/x86/intrapred8.asm
 --- a/source/common/x86/intrapred8.asm  Mon Mar 09 19:21:25 2015 -0500
 +++ b/source/common/x86/intrapred8.asm  Tue Mar 10 10:49:11 2015 +0530
 @@ -26,6 +26,8 @@

  SECTION_RODATA 32

 +intra_pred_shuff_0_8:times 2 db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6,
 6, 7, 7, 8
 +
  pb_0_8times 8 db  0,  8
  pb_unpackbw1  times 2 db  1,  8,  2,  8,  3,  8,  4,  8
  pb_swap8: times 2 db  7,  6,  5,  4,  3,  2,  1,  0
 @@ -83,7 +85,6 @@
  c_ang8_7_20:  db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7,
 25, 7, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
  c_ang8_1_14:  db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1,
 31, 1, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
  c_ang8_27_8:  db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27,
 5, 27, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
 -c_ang8_src1_9_1_9:db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8,
 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8
  c_ang8_src2_10_2_10:  db 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9,
 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9
  c_ang8_src3_11_3_11:  db 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10,
 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10

 @@ -9968,7 +9969,7 @@
  mova  m3, [pw_1024]
  vbroadcasti128m0, [r2 + 17]

 -pshufbm1, m0, [c_ang8_src1_9_1_9]
 +pshufbm1, m0, [intra_pred_shuff_0_8]
  pshufbm2, m0, [c_ang8_src2_10_2_10]
  pshufbm4, m0, [c_ang8_src3_11_3_11]
  pshufbm0, [c_ang8_src3_11_4_12]
 @@ -10013,7 +10014,7 @@
  mova  m3, [pw_1024]
  vbroadcasti128m0, [r2 + 1]

 -pshufbm1, m0, [c_ang8_src1_9_1_9]
 +pshufbm1, m0, [intra_pred_shuff_0_8]
  pshufbm2, m0, [c_ang8_src2_10_2_10]
  pshufbm4, m0, [c_ang8_src3_11_3_11]
  pshufbm0, [c_ang8_src3_11_4_12]
 @@ -10045,12 +10046,11 @@


  INIT_YMM avx2
 -cglobal intra_pred_ang8_9, 3, 5, 6
 +cglobal intra_pred_ang8_9, 3, 5, 5
  mova  m3, [pw_1024]
  vbroadcasti128m0, [r2 + 17]
 -movu  m5, [c_ang8_src1_9_1_9]
 -
 -pshufbm0, m5
 +
 +pshufbm0, [intra_pred_shuff_0_8]

  lea   r4, [c_ang8_mode_27]
  pmaddubsw m1, m0, [r4]
 @@ -10089,12 +10089,11 @@
  RET

  INIT_YMM avx2
 -cglobal intra_pred_ang8_27, 3, 5, 6
 +cglobal intra_pred_ang8_27, 3, 5, 5
  mova  m3, [pw_1024]
  vbroadcasti128m0, [r2 + 1]
 -movu  m5, [c_ang8_src1_9_1_9]
 -
 -pshufbm0, m5
 +
 +pshufbm0, [intra_pred_shuff_0_8]

  lea   r4, [c_ang8_mode_27]
  pmaddubsw m1, m0, [r4]
 @@ -10123,12 +10122,11 @@
  RET

  INIT_YMM avx2
 -cglobal intra_pred_ang8_25, 3, 5, 6
 +cglobal intra_pred_ang8_25, 3, 5, 5
  mova  m3, [pw_1024]
  vbroadcasti128m0, [r2]
 -mova  m5, [c_ang8_src1_9_1_9]
 -
 -pshufbm0, m5
 +
 +pshufbm0, [intra_pred_shuff_0_8]

  lea   r4, [c_ang8_mode_25]
  pmaddubsw m1, m0, [r4]
 @@ -10162,7 +10160,7 @@
  mova  m3, [pw_1024]
  vbroadcasti128m0, [r2 + 17]

 -pshufbm1, m0, [c_ang8_src1_9_1_9]
 +pshufbm1, m0, [intra_pred_shuff_0_8]
  pshufbm2, m0, [c_ang8_src1_9_2_10]
  pshufbm4, m0, [c_ang8_src2_10_2_10]
  pshufbm0, [c_ang8_src2_10_3_11]
 @@ -10207,7 +10205,7 @@
  mova  m3, [pw_1024]
  vbroadcasti128m0, [r2 + 1]

 -pshufbm1, m0, [c_ang8_src1_9_1_9]
 +pshufbm1, m0, [intra_pred_shuff_0_8]
  pshufbm2, m0, [c_ang8_src1_9_2_10]
  pshufbm4, m0, [c_ang8_src2_10_2_10]
  pshufbm0, [c_ang8_src2_10_3_11]
 @@ -10242,7 +10240,7 @@
  cglobal intra_pred_ang8_8, 3, 4, 6
  mova  m3, [pw_1024]
  vbroadcasti128m0, [r2 + 17]
 -movu  m5, [c_ang8_src1_9_1_9]
 +mova  m5, [intra_pred_shuff_0_8]

  pshufbm1, m0, m5
  pshufbm2, m0, m5
 @@ -10288,7 +10286,7 @@
  cglobal intra_pred_ang8_28, 3, 4, 6
  mova  m3, [pw_1024]
  vbroadcasti128m0, [r2 + 1]
 -movu  m5, [c_ang8_src1_9_1_9]
 +mova  m5, [intra_pred_shuff_0_8]

  pshufbm1, m0, m5

[x265] Fwd: [PATCH] asm: avx2 code for sad[32x32] for 8bpp

2015-03-11 Thread Praveen Tiwari
-- Forwarded message --
From: sumala...@multicorewareinc.com
Date: Wed, Mar 11, 2015 at 2:24 PM
Subject: [x265] [PATCH] asm: avx2 code for sad[32x32] for 8bpp
To: x265-devel@videolan.org


# HG changeset patch
# User Sumalatha Polureddysumala...@multicorewareinc.com
# Date 1426064050 -19800
# Node ID 01bfd365bf5f5317874b5c0315736ca76176f3df
# Parent  800f8ecd1e7393756f4bb58e536497162dc32150
asm: avx2 code for sad[32x32] for 8bpp

SSE3
sad[32x32]  230.81x  745.76  172131.92

AVX2
sad[32x32]  330.38x  496.68  164091.02


Are you comparing the debug mode performance numbers? 230.81x  ???

SSE3
sad[32x32]  31.96x   770.39  24623.33

on i7-4770k CPU. Please check the issue.


diff -r 800f8ecd1e73 -r 01bfd365bf5f source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp  Tue Mar 10 10:49:11 2015
+0530
+++ b/source/common/x86/asm-primitives.cpp  Wed Mar 11 14:24:10 2015
+0530
@@ -1442,6 +1442,8 @@
 p.pu[LUMA_8x16].satd  = x265_pixel_satd_8x16_avx2;
 p.pu[LUMA_8x8].satd   = x265_pixel_satd_8x8_avx2;

+p.pu[LUMA_32x32].sad = x265_pixel_sad_32x32_avx2;
+
 p.pu[LUMA_8x4].sad_x3 = x265_pixel_sad_x3_8x4_avx2;
 p.pu[LUMA_8x8].sad_x3 = x265_pixel_sad_x3_8x8_avx2;
 p.pu[LUMA_8x16].sad_x3 = x265_pixel_sad_x3_8x16_avx2;
diff -r 800f8ecd1e73 -r 01bfd365bf5f source/common/x86/sad-a.asm
--- a/source/common/x86/sad-a.asm   Tue Mar 10 10:49:11 2015 +0530
+++ b/source/common/x86/sad-a.asm   Wed Mar 11 14:24:10 2015 +0530
@@ -3897,5 +3897,31 @@
 movq[r6 + 8], xm1
 RET

+INIT_YMM avx2
+cglobal pixel_sad_32x32, 4,4,5
+xorps   m0, m0
+%assign x 0
+%rep 16
+movu   m1, [r0]   ; row 0 of pix0
+movu   m2, [r2]   ; row 0 of pix1
+movu   m3, [r0 + r1]  ; row 1 of pix0
+movu   m4, [r2 + r3]  ; row 1 of pix1
+
+psadbw m1, m2
+psadbw m3, m4
+paddd  m0, m1
+paddd  m0, m3
+%assign x x+1
+  %if x  16
+lea r2, [r2 + 2 * r3]
+lea r0, [r0 + 2 * r1]
+  %endif
+%endrep
+vextracti128   xm1, m0, 1
+paddd  xm0, xm1
+pshufd xm1, xm0, 2
+paddd  xm0,xm1
+movd   eax, xm0
+RET

 %endif
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] Fwd: [PATCH] asm-avx2: intra_pred_ang8_11

2015-03-11 Thread Praveen Tiwari
-- Forwarded message --
From: chen chenm...@163.com
Date: Wed, Mar 11, 2015 at 2:33 AM
Subject: Re: [x265] [PATCH] asm-avx2: intra_pred_ang8_11
To: Development for x265 x265-devel@videolan.org


its right now, just a little problem,
[trans8_shuf] just use 2 times, buffer into register will same speed with
more code size.

   Do you mean instead of,
mova  m0, [trans8_shuf]
vpermdm1, m0, m1
vpermdm4, m0, m4

we should use this,
vpermdm1, [trans8_shuf], m1
vpermdm4, [trans8_shuf], m4

Does the compiler will not use two 'mova' instruction internally rather
than just using once? Can be depend on the compiler here for this
optimization? Even syntax of 'vpermd' does not allows this.

At 2015-03-10 13:58:50,prav...@multicorewareinc.com wrote:
# HG changeset patch
# User Praveen Tiwari prav...@multicorewareinc.com
# Date 1425967049 -19800
# Node ID 810995b991eba3f7dcd9014db3b58a6b07723be3
# Parent  f97dfb483647d573cbcab9a4f007ac2aa89c9066
asm-avx2: intra_pred_ang8_11

diff -r f97dfb483647 -r 810995b991eb source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp Tue Mar 10 10:49:11 2015 +0530
+++ b/source/common/x86/asm-primitives.cpp Tue Mar 10 11:27:29 2015 +0530
@@ -1496,6 +1496,7 @@
 p.cu[BLOCK_8x8].intra_pred[9] = x265_intra_pred_ang8_9_avx2;
 p.cu[BLOCK_8x8].intra_pred[27] = x265_intra_pred_ang8_27_avx2;
 p.cu[BLOCK_8x8].intra_pred[25] = x265_intra_pred_ang8_25_avx2;
+p.cu[BLOCK_8x8].intra_pred[11] = x265_intra_pred_ang8_11_avx2;

 // copy_sp primitives
 p.cu[BLOCK_16x16].copy_sp = x265_blockcopy_sp_16x16_avx2;
diff -r f97dfb483647 -r 810995b991eb source/common/x86/intrapred.h
--- a/source/common/x86/intrapred.hTue Mar 10 10:49:11 2015 +0530
+++ b/source/common/x86/intrapred.hTue Mar 10 11:27:29 2015 +0530
@@ -179,6 +179,7 @@
 void x265_intra_pred_ang8_9_avx2(pixel* dst, intptr_t dstStride, const pixel* 
 srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang8_27_avx2(pixel* dst, intptr_t dstStride, const 
 pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang8_25_avx2(pixel* dst, intptr_t dstStride, const 
 pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_11_avx2(pixel* dst, intptr_t dstStride, const 
pixel* srcPix, int dirMode, int bFilter);
 void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel *filtPix, 
 int bLuma);
 void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel *filtPix, 
 int bLuma);
 void x265_all_angs_pred_16x16_sse4(pixel *dest, pixel *refPix, pixel 
 *filtPix, int bLuma);
diff -r f97dfb483647 -r 810995b991eb source/common/x86/intrapred8.asm
--- a/source/common/x86/intrapred8.asm Tue Mar 10 10:49:11 2015 +0530
+++ b/source/common/x86/intrapred8.asm Tue Mar 10 11:27:29 2015 +0530
@@ -10317,3 +10317,47 @@
 movhps[r0 + 2 * r1], xm4
 movhps[r0 + r3], xm2
 RET
+
+INIT_YMM avx2
+cglobal intra_pred_ang8_11, 3, 5, 5
+mova  m3, [pw_1024]
+movu  xm1, [r2 + 16]
+pinsrbxm1, [r2], 0
+pshufbxm1, [intra_pred_shuff_0_8]
+vinserti128   m0, m1, xm1, 1
+
+lea   r4, [c_ang8_mode_25]
+pmaddubsw m1, m0, [r4]
+pmulhrsw  m1, m3
+pmaddubsw m2, m0, [r4 + mmsize]
+pmulhrsw  m2, m3
+pmaddubsw m4, m0, [r4 + 2 * mmsize]
+pmulhrsw  m4, m3
+pmaddubsw m0, [r4 + 3 * mmsize]
+pmulhrsw  m0, m3
+packuswb  m1, m2
+packuswb  m4, m0
+
+vperm2i128m2, m1, m4, 0010b
+vperm2i128m1, m1, m4, 00110001b
+punpcklbw m4, m2, m1
+punpckhbw m2, m1
+punpcklwd m1, m4, m2
+punpckhwd m4, m2
+mova  m0, [trans8_shuf]
+vpermdm1, m0, m1
+vpermdm4, m0, m4
+
+lea   r3, [3 * r1]
+movq  [r0], xm1
+movhps[r0 + r1], xm1
+vextracti128  xm2, m1, 1
+movq  [r0 + 2 * r1], xm2
+movhps[r0 + r3], xm2
+lea   r0, [r0 + 4 * r1]
+movq  [r0], xm4
+movhps[r0 + r1], xm4
+vextracti128  xm2, m4, 1
+movq  [r0 + 2 * r1], xm2
+movhps[r0 + r3], xm2
+RET
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] Fwd: [PATCH] asm: intra_pred_ang16_34

2015-03-10 Thread Praveen Tiwari
-- Forwarded message --
From: chen chenm...@163.com
Date: Wed, Mar 11, 2015 at 6:32 AM
Subject: Re: [x265] [PATCH] asm: intra_pred_ang16_34
To: Development for x265 x265-devel@videolan.org


same speed to old version

This avx2 version of asm code eliminates following instruction on cost of
one vextracti128 instruction as compare to SSEE3, may not be a visible
impact in testBench but seems worth to push.
add r2, 34
cmp r3m, byte 34
cmove   r2, r4
movum1, [r2 + 16]


Regards,
Praveen
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] Fwd: [PATCH] asm: intra_pred_ang16_2

2015-03-10 Thread Praveen Tiwari
-- Forwarded message --
From: chen chenm...@163.com
Date: Wed, Mar 11, 2015 at 6:32 AM
Subject: Re: [x265] [PATCH] asm: intra_pred_ang16_2
To: Development for x265 x265-devel@videolan.org


same speed to old version

This avx2 version of asm code eliminates following instruction on cost of
one vextracti128 instruction as compare to SSEE3, may not be a visible
impact in testBench but seems worth to push.
add r2, 34
cmp r3m, byte 34
cmove   r2, r4
movum1, [r2 + 16]

Regards,
Praveen
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] Fwd: [PATCH] asm: intra_pred_ang8_24 8bpp, improved 206.33c - 177.70c over SSE version

2015-03-10 Thread Praveen Tiwari
-- Forwarded message --
From: chen chenm...@163.com
Date: Wed, Mar 11, 2015 at 6:09 AM
Subject: Re: [x265] [PATCH] asm: intra_pred_ang8_24 8bpp, improved 206.33c
- 177.70c over SSE version
To: Development for x265 x265-devel@videolan.org


+c_ang8_mode_24:   db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 
27, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, \
+ 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 
17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, \

we'd better a new 'db' in every line.

[Praveen] You have to explain me, how it is better? What difference
does it makes, does it help to achieve more performance or it is just
for coding style.


Regards,

Praveen
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] asm-avx2: intra_pred_ang8_25, (42.92x)

2015-03-09 Thread Praveen Tiwari
Updated the code with more optimization.

Regards,
Praveen



On Sat, Mar 7, 2015 at 3:31 AM, chen chenm...@163.com wrote:

 right


 At 2015-03-06 14:16:23,prav...@multicorewareinc.com wrote:
 # HG changeset patch
 # User Praveen Tiwari prav...@multicorewareinc.com
 # Date 1425622433 -19800
 # Node ID b48efcbe1b196593d572dbbd4dd220f215f97321
 # Parent  fe9c058f216d4315ea995b09384aab2b1a28d1ec
 asm-avx2: intra_pred_ang8_25, (42.92x)
 
 intra_ang_8x8[25]   42.92x   210.61  9039.28
 


 ___
 x265-devel mailing list
 x265-devel@videolan.org
 https://mailman.videolan.org/listinfo/x265-devel


___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] asm-avx2: intra_pred_ang8_11, (51.84x)

2015-03-09 Thread Praveen Tiwari
Update the patch with more optimization.


Regards,
Praveen

On Sat, Mar 7, 2015 at 3:40 AM, chen chenm...@163.com wrote:

 right


 At 2015-03-06 15:50:38,prav...@multicorewareinc.com wrote:
 # HG changeset patch
 # User Praveen Tiwari prav...@multicorewareinc.com
 # Date 1425628229 -19800
 # Node ID 25b01a20389e8e4297e004d500871263ca349d15
 # Parent  b48efcbe1b196593d572dbbd4dd220f215f97321
 asm-avx2: intra_pred_ang8_11, (51.84x)
 
 intra_ang_8x8[11]   51.84x   295.15  15301.57
 


 ___
 x265-devel mailing list
 x265-devel@videolan.org
 https://mailman.videolan.org/listinfo/x265-devel


___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] asm-avx2: intra_pred_ang8_24, (40.05x)

2015-03-09 Thread Praveen Tiwari
Updated the patch as per suggestions.

Regards,
Praveen

On Sat, Mar 7, 2015 at 3:57 AM, chen chenm...@163.com wrote:




 At 2015-03-06 17:24:05,prav...@multicorewareinc.com wrote:
 # HG changeset patch
 # User Praveen Tiwari prav...@multicorewareinc.com
 # Date 1425633836 -19800
 # Node ID 2da3a6431f94e1dce3c6bc739e7c457f90b12369
 # Parent  25b01a20389e8e4297e004d500871263ca349d15
 asm-avx2: intra_pred_ang8_24, (40.05x)
 
 intra_ang_8x8[24]   40.05x   244.28  9782.73
 
 diff -r 25b01a20389e -r 2da3a6431f94 source/common/x86/asm-primitives.cpp
 --- a/source/common/x86/asm-primitives.cpp   Fri Mar 06 13:20:29 2015 +0530
 +++ b/source/common/x86/asm-primitives.cpp   Fri Mar 06 14:53:56 2015 +0530
 @@ -1514,6 +1514,7 @@
  p.cu[BLOCK_8x8].intra_pred[27] = x265_intra_pred_ang8_27_avx2;
  p.cu[BLOCK_8x8].intra_pred[11] = x265_intra_pred_ang8_11_avx2;
  p.cu[BLOCK_8x8].intra_pred[25] = x265_intra_pred_ang8_25_avx2;
 +p.cu[BLOCK_8x8].intra_pred[24] = x265_intra_pred_ang8_24_avx2;
 
  // copy_sp primitives
  p.cu[BLOCK_16x16].copy_sp = x265_blockcopy_sp_16x16_avx2;
 diff -r 25b01a20389e -r 2da3a6431f94 source/common/x86/intrapred.h
 --- a/source/common/x86/intrapred.h  Fri Mar 06 13:20:29 2015 +0530
 +++ b/source/common/x86/intrapred.h  Fri Mar 06 14:53:56 2015 +0530
 @@ -177,6 +177,7 @@
  void x265_intra_pred_ang8_27_avx2(pixel* dst, intptr_t dstStride, const 
  pixel* srcPix, int dirMode, int bFilter);
  void x265_intra_pred_ang8_11_avx2(pixel* dst, intptr_t dstStride, const 
  pixel* srcPix, int dirMode, int bFilter);
  void x265_intra_pred_ang8_25_avx2(pixel* dst, intptr_t dstStride, const 
  pixel* srcPix, int dirMode, int bFilter);
 +void x265_intra_pred_ang8_24_avx2(pixel* dst, intptr_t dstStride, const 
 pixel* srcPix, int dirMode, int bFilter);
  void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel 
  *filtPix, int bLuma);
  void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel 
  *filtPix, int bLuma);
  void x265_all_angs_pred_16x16_sse4(pixel *dest, pixel *refPix, pixel 
  *filtPix, int bLuma);
 diff -r 25b01a20389e -r 2da3a6431f94 source/common/x86/intrapred8.asm
 --- a/source/common/x86/intrapred8.asm   Fri Mar 06 13:20:29 2015 +0530
 +++ b/source/common/x86/intrapred8.asm   Fri Mar 06 14:53:56 2015 +0530
 @@ -105,6 +105,11 @@
   10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 
  10, 22, 10, 22, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 
  20, \
   14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 
  14, 18, 14, 18, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 
  16
 
 +c_ang8_mode_24:   db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 
 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, \
 + 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 
 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 
 12, \
 + 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 
 25, 7, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, \
 + 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 
 3, 29, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
 +
  ;; (blkSize - 1 - x)
  pw_planar4_0: dw 3,  2,  1,  0,  3,  2,  1,  0
  pw_planar4_1: dw 3,  3,  3,  3,  3,  3,  3,  3
 @@ -33145,3 +33150,41 @@
  movhps[r0 + 2 * r1], xm4
  movhps[r0 + r3], xm2
  RET
 +
 +INIT_YMM avx2
 +cglobal intra_pred_ang8_24, 3, 5, 6
 +mova  m3, [pw_1024]
 +vbroadcasti128m0, [r2]
 +movu  m5, [c_ang8_src1_9_1_9]
 unalgined?


 +
 +pshufbm0, m5
 +
 +lea   r4, [c_ang8_mode_24]
 +pmaddubsw m1, m0, [r4]
 +pmulhrsw  m1, m3
 +pmaddubsw m2, m0, [r4 + mmsize]
 +pmulhrsw  m2, m3
 +pmaddubsw m4, m0, [r4 + 2 * mmsize]
 +pmulhrsw  m4, m3
 +pslldqxm0, 2
 +pinsrbxm0, [r2 + 16 + 6], 0
 +pinsrbxm0, [r2 + 0], 1
 +vinserti128   m0, m0, xm0, 1
 +pmaddubsw m0, [r4 + 3 * mmsize]
 +pmulhrsw  m0, m3
 +packuswb  m1, m2
 +packuswb  m4, m0
 +
 +lea   r3, [3 * r1]
 +movq  [r0], xm1
 +vextracti128  xm2, m1, 1
 +movq  [r0 + r1], xm2
 +movhps[r0 + 2 * r1], xm1
 +movhps[r0 + r3], xm2
 +lea   r0, [r0 + 4 * r1]
 +movq  [r0], xm4
 +vextracti128  xm2, m4, 1
 +movq  [r0 + r1], xm2
 +movhps[r0 + 2 * r1], xm4
 +movhps[r0 +
 r3], xm2
 +RET
 ___
 x265-devel mailing list
 x265-devel@videolan.org
 https://mailman.videolan.org/listinfo/x265-devel

[x265] Fwd: Fwd: [PATCH Review Only] asm-avx2: intra_pred_ang8_33, improved 265.79c - 185.43c over sse4 asm code

2015-02-26 Thread Praveen Tiwari
-- Forwarded message --
From: chen chenm...@163.com
Date: Thu, Feb 26, 2015 at 3:15 PM
Subject: Re: [x265] Fwd: [PATCH Review Only] asm-avx2: intra_pred_ang8_33,
improved 265.79c - 185.43c over sse4 asm code
To: Development for x265 x265-devel@videolan.org



At 2015-02-26 14:24:54,Praveen Tiwari prav...@multicorewareinc.com
wrote:


-- Forwarded message --
From: chen chenm...@163.com
Date: Wed, Feb 25, 2015 at 7:38 PM
Subject: Re: [x265] [PATCH Review Only] asm-avx2: intra_pred_ang8_33,
improved 265.79c - 185.43c over sse4 asm code
To: Development for x265 x265-devel@videolan.org






At 2015-02-25 16:52:00,prav...@multicorewareinc.com wrote:
# HG changeset patch
# User Praveen Tiwari prav...@multicorewareinc.com
# Date 1424854196 -19800
# Node ID 177fe9372668b4824c291e967349664766688179
# Parent  02bac78bde961d60d180e59b5260fad93b98d9b4
asm-avx2: intra_pred_ang8_33, improved 265.79c - 185.43c over sse4 asm code

intra_ang_8x8[33]   10.56x   185.43  1957.47

diff -r 02bac78bde96 -r 177fe9372668 source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp Wed Feb 25 13:46:58 2015 +0530
+++ b/source/common/x86/asm-primitives.cpp Wed Feb 25 14:19:56 2015 +0530
@@ -1813,6 +1813,7 @@

 // intra_pred functions
 p.cu[BLOCK_8x8].intra_pred[3] = x265_intra_pred_ang8_3_avx2;
+p.cu[BLOCK_8x8].intra_pred[33] = x265_intra_pred_ang8_33_avx2;
 }
 }
 #endif // if HIGH_BIT_DEPTH
diff -r 02bac78bde96 -r 177fe9372668 source/common/x86/intrapred.h
--- a/source/common/x86/intrapred.hWed Feb 25 13:46:58 2015 +0530
+++ b/source/common/x86/intrapred.hWed Feb 25 14:19:56 2015 +0530
@@ -158,6 +158,7 @@

 #undef DECL_ANG
 void x265_intra_pred_ang8_3_avx2(pixel* dst, intptr_t dstStride, const pixel* 
 srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_33_avx2(pixel* dst, intptr_t dstStride, const 
pixel* srcPix, int dirMode, int bFilter);
 void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel *filtPix, 
 int bLuma);
 void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel *filtPix, 
 int bLuma);
 void x265_all_angs_pred_16x16_sse4(pixel *dest, pixel *refPix, pixel 
 *filtPix, int bLuma);
diff -r 02bac78bde96 -r 177fe9372668 source/common/x86/intrapred8.asm
--- a/source/common/x86/intrapred8.asm Wed Feb 25 13:46:58 2015 +0530
+++ b/source/common/x86/intrapred8.asm Wed Feb 25 14:19:56 2015 +0530
@@ -32087,3 +32087,39 @@
 movq  [r0 + 2 * r1], xm2
 movhps[r0 + r3], xm2
 RET
+
+INIT_YMM avx2
+cglobal intra_pred_ang8_33, 3,4,5
+movu  m3, [pw_1024]
Why constant are unaligned?

[Praveen] Seems alignment issue here, mova crashing on avx2 machine.
[MC] it is global constant, we may use ALIGN32 before pw_1024 to avoid
crash and get more performance

[Praveen] why it needs special care ? why not other constants needs ALIGN32.

+vbroadcasti128m0, [r2 + 1]
it is Exception Type 6, please check and confirm it compatible with
unaligned address

[Praveen] Sadly most of documents don't talk about alignment regarding
this instruction including Intel® Architecture Instruction Set
Extensions Programming Reference but I verified with encoder seems it
works fine with unaligned address too.
[MC] ok, if you try to assign unaligned address (manual in debug mode)
and it work fine, we may ignore it.


+
+pshufbm1, m0, [c_ang8_src1_9_2_10]
+pshufbm2, m0, [c_ang8_src3_11_4_12]
+pshufbm4, m0, [c_ang8_src5_13_5_13]
+pshufbm4, m0, [c_ang8_src5_13_5_13]
Why duplicated?

[Praveen] Yeah, duplicate code here, has been fixed locally.


+pshufbm0, [c_ang8_src6_14_7_15]
+
+pmaddubsw m1, [c_ang8_26_20]
+pmulhrsw  m1, m3
+pmaddubsw m2, [c_ang8_14_8]
+pmulhrsw  m2, m3
+pmaddubsw m4, [c_ang8_2_28]
+pmulhrsw  m4, m3
+pmaddubsw m0, [c_ang8_22_16]
+pmulhrsw  m0, m3
+packuswb  m1, m2
+packuswb  m4, m0
+
+lea   r3, [3 * r1]
+movq  [r0], xm1
+vextracti128  xm2, m1, 1
+movq  [r0 + r1], xm2
+movhps[r0 + 2 * r1], xm1
+movhps[r0 + r3], xm2
+lea   r0, [r0 + 4 * r1]
+movq  [r0], xm4
+vextracti128  xm2, m4, 1
+movq  [r0 + r1], xm2
+movhps[r0 + 2 * r1], xm4
+movhps[r0 +
r3], xm2
+RET
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel



___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265

[x265] Fwd: [PATCH Review Only] asm-avx2: intra_pred_ang8_33, improved 265.79c - 185.43c over sse4 asm code

2015-02-25 Thread Praveen Tiwari
-- Forwarded message --
From: chen chenm...@163.com
Date: Wed, Feb 25, 2015 at 7:38 PM
Subject: Re: [x265] [PATCH Review Only] asm-avx2: intra_pred_ang8_33,
improved 265.79c - 185.43c over sse4 asm code
To: Development for x265 x265-devel@videolan.org






At 2015-02-25 16:52:00,prav...@multicorewareinc.com wrote:
# HG changeset patch
# User Praveen Tiwari prav...@multicorewareinc.com
# Date 1424854196 -19800
# Node ID 177fe9372668b4824c291e967349664766688179
# Parent  02bac78bde961d60d180e59b5260fad93b98d9b4
asm-avx2: intra_pred_ang8_33, improved 265.79c - 185.43c over sse4 asm code

intra_ang_8x8[33]   10.56x   185.43  1957.47

diff -r 02bac78bde96 -r 177fe9372668 source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp Wed Feb 25 13:46:58 2015 +0530
+++ b/source/common/x86/asm-primitives.cpp Wed Feb 25 14:19:56 2015 +0530
@@ -1813,6 +1813,7 @@

 // intra_pred functions
 p.cu[BLOCK_8x8].intra_pred[3] = x265_intra_pred_ang8_3_avx2;
+p.cu[BLOCK_8x8].intra_pred[33] = x265_intra_pred_ang8_33_avx2;
 }
 }
 #endif // if HIGH_BIT_DEPTH
diff -r 02bac78bde96 -r 177fe9372668 source/common/x86/intrapred.h
--- a/source/common/x86/intrapred.hWed Feb 25 13:46:58 2015 +0530
+++ b/source/common/x86/intrapred.hWed Feb 25 14:19:56 2015 +0530
@@ -158,6 +158,7 @@

 #undef DECL_ANG
 void x265_intra_pred_ang8_3_avx2(pixel* dst, intptr_t dstStride, const pixel* 
 srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_33_avx2(pixel* dst, intptr_t dstStride, const 
pixel* srcPix, int dirMode, int bFilter);
 void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel *filtPix, 
 int bLuma);
 void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel *filtPix, 
 int bLuma);
 void x265_all_angs_pred_16x16_sse4(pixel *dest, pixel *refPix, pixel 
 *filtPix, int bLuma);
diff -r 02bac78bde96 -r 177fe9372668 source/common/x86/intrapred8.asm
--- a/source/common/x86/intrapred8.asm Wed Feb 25 13:46:58 2015 +0530
+++ b/source/common/x86/intrapred8.asm Wed Feb 25 14:19:56 2015 +0530
@@ -32087,3 +32087,39 @@
 movq  [r0 + 2 * r1], xm2
 movhps[r0 + r3], xm2
 RET
+
+INIT_YMM avx2
+cglobal intra_pred_ang8_33, 3,4,5
+movu  m3, [pw_1024]
Why constant are unaligned?

[Praveen] Seems alignment issue here, mova crashing on avx2 machine.

+vbroadcasti128m0, [r2 + 1]
it is Exception Type 6, please check and confirm it compatible with
unaligned address

[Praveen] Sadly most of documents don't talk about alignment regarding
this instruction including Intel® Architecture Instruction Set
Extensions Programming Reference but I verified with encoder seems it
works fine with unaligned address too.
+
+pshufbm1, m0, [c_ang8_src1_9_2_10]
+pshufbm2, m0, [c_ang8_src3_11_4_12]
+pshufbm4, m0, [c_ang8_src5_13_5_13]
+pshufbm4, m0, [c_ang8_src5_13_5_13]
Why duplicated?

[Praveen] Yeah, duplicate code here, has been fixed locally.


+pshufbm0, [c_ang8_src6_14_7_15]
+
+pmaddubsw m1, [c_ang8_26_20]
+pmulhrsw  m1, m3
+pmaddubsw m2, [c_ang8_14_8]
+pmulhrsw  m2, m3
+pmaddubsw m4, [c_ang8_2_28]
+pmulhrsw  m4, m3
+pmaddubsw m0, [c_ang8_22_16]
+pmulhrsw  m0, m3
+packuswb  m1, m2
+packuswb  m4, m0
+
+lea   r3, [3 * r1]
+movq  [r0], xm1
+vextracti128  xm2, m1, 1
+movq  [r0 + r1], xm2
+movhps[r0 + 2 * r1], xm1
+movhps[r0 + r3], xm2
+lea   r0, [r0 + 4 * r1]
+movq  [r0], xm4
+vextracti128  xm2, m4, 1
+movq  [r0 + r1], xm2
+movhps[r0 + 2 * r1], xm4
+movhps[r0 +
r3], xm2
+RET
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] Fwd: [PATCH] blockcopy_pp_12x32: SSE2 asm code optimization

2015-02-06 Thread Praveen Tiwari
-- Forwarded message --
From: chen chenm...@163.com
Date: Thu, Feb 5, 2015 at 5:55 PM
Subject: Re: [x265] [PATCH] blockcopy_pp_12x32: SSE2 asm code optimization
To: Development for x265 x265-devel@videolan.org


 this code is right
but could you try use general register move (rN, rNd) in x64 mode?

I applied your idea of using general register as buffer in x64 for 4x8
(easy to test with) but surprisingly using SIMD registers is faster. here I
have the code and performance numbers:
copy_pp[  4x8]  2.67x*139.98 * 374.18  [using general
register move (rN, rNd)]
copy_pp[  4x8]  3.34x*109.60 * 366.35  [SIMD registers
as buffer]

codes: [using general register move (rN, rNd)]

;-
 ; void blockcopy_pp_4x8(pixel* dst, intptr_t dstStride, const pixel* src,
intptr_t srcStride)

;-
 INIT_XMM sse2
 cglobal blockcopy_pp_4x8, 4, 10, 0

 lea r4,[3 * r1]
 lea r5,[3 * r3]

 mov r6d, [r2]
 mov r7d, [r2 + r3]
 mov r8d, [r2 + 2 * r3]
 mov r9d, [r2 + r5]

 mov [r0],  r6d
 mov [r0 + r1], r7d
 mov [r0 + 2 * r1], r8d
 mov [r0 + r4], r9d

 lea  r2, [r2 + 4 * r3]
 mov r6d, [r2]
 mov r7d, [r2 + r3]
 mov r8d, [r2 + 2 * r3]
 mov r9d, [r2 + r5]

 lea  r0,[r0 + 4 * r1]
 mov [r0],  r6d
 mov [r0 + r1], r7d
 mov [r0 + 2 * r1], r8d
 mov [r0 + r4], r9d
RET
code [SIMD registers as buffer]
 INIT_XMM sse2
cglobal blockcopy_pp_4x8, 4, 6, 4

lea r4,[3 * r1]
lea r5,[3 * r3]

movd m0, [r2]
movd m1, [r2 + r3]
movd m2, [r2 + 2 * r3]
movd m3, [r2 + r5]

movd [r0],  m0
movd [r0 + r1], m1
movd [r0 + 2 * r1], m2
movd [r0 + r4], m3

lea  r2, [r2 + 4 * r3]
movd m0, [r2]
movd m1, [r2 + r3]
movd m2, [r2 + 2 * r3]
movd m3, [r2 + r5]

lea  r0,[r0 + 4 * r1]
movd [r0],  m0
movd [r0 + r1], m1
movd [r0 + 2 * r1], m2
movd [r0 + r4], m3
RET
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] blockfill_s_8x8 sse2 asm code optimization

2015-02-02 Thread Praveen Tiwari
Sent updated patch. Thanks.

Regards,
Praveen

On Mon, Feb 2, 2015 at 4:39 PM, chen chenm...@163.com wrote:






 At 2015-02-02 16:55:16,prav...@multicorewareinc.com wrote:
 # HG changeset patch
 # User Praveen Tiwari
 # Date 1422867249 -19800
 # Branch stable
 # Node ID 2618352a21d5917ee8c1f79bcc159e858dd19daa
 # Parent  e2c958ff874e2bf8992ba22605e993530e8a2d8c
 blockfill_s_8x8 sse2 asm code optimization
 
 improved, 100.04c - 90.05c
 
 diff -r e2c958ff874e -r 2618352a21d5 source/common/x86/blockcopy8.asm
 --- a/source/common/x86/blockcopy8.asm   Sat Jan 31 13:48:34 2015 -0600
 +++ b/source/common/x86/blockcopy8.asm   Mon Feb 02 14:24:09 2015 +0530
 @@ -1748,9 +1748,10 @@
  ; void blockfill_s_8x8(int16_t* dst, intptr_t dstride, int16_t val)
  ;-
  INIT_XMM sse2
 -cglobal blockfill_s_8x8, 3, 3, 1, dst, dstStride, val
 +cglobal blockfill_s_8x8, 3, 4, 1, dst, dstStride, val
 
  addr1,r1
 +lear3,[3 * r1]
 
  movd   m0,r2d
  pshuflwm0,m0, 0
 @@ -1760,17 +1761,13 @@
  movu   [r0 + r1], m0
  movu   [r0 + 2 * r1], m0
 
 -lear0,[r0 + 2 * r1]
 +movu   [r0 + r3], m0
 +movu   [r0 + 4 * r1], m0
 +
 +lear0,[r0 + 4 * r1]
 swap LEA and above movu, you will get less bytes on binary code


  movu   [r0 + r1], m0
  movu   [r0 + 2 * r1], m0
 -
 -lear0,[r0 + 2 * r1]
 -movu   [r0 + r1], m0
 -movu   [r0 + 2 * r1], m0
 -
 -lear0,[r0 + 2 * r1]
 -movu   [r0 + r1], m0
 -
 +movu   [r0 + r3], m0
  RET
 
  ;-
 ___
 x265-devel mailing list
 x265-devel@videolan.org
 https://mailman.videolan.org/listinfo/x265-devel


 ___
 x265-devel mailing list
 x265-devel@videolan.org
 https://mailman.videolan.org/listinfo/x265-devel


___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] add testbench for psyCost_ss and asm for psyCost_ss_4x4: improve 1989c-515c

2015-01-09 Thread Praveen Tiwari
If it is only 64x64, then definitely it is range issue when we are finally
accumulating sum of all sad calculations. It make more obvious with 64x64
because more number of accumulation is here. Algorithm issue must have
reflected in other partition also.

Regards,
Praveen

On Fri, Jan 9, 2015 at 4:05 PM, Steve Borho st...@borho.org wrote:

 On 01/09, Divya Manivannan wrote:
  # HG changeset patch
  # User Divya Manivannan di...@multicorewareinc.com
  # Date 1420790181 -19800
  #  Fri Jan 09 13:26:21 2015 +0530
  # Node ID 0f4b677cea64254d0b8f77ccc84c785bf832698d
  # Parent  c99e1a309bd1690be9a0a407050d97d95ccab05a
  add testbench for psyCost_ss and asm for psyCost_ss_4x4: improve
 1989c-515c

 I get an error with a 10bit build:

 steve@zeppelin ./test/TestBench
 Using random seed 54AFAEC9 16bpp
 Testing primitives: SSE2
 Testing primitives: SSE3
 Testing primitives: SSSE3
 Testing primitives: SSE4

 psy_cost_ss[64x64] failed!

  diff -r c99e1a309bd1 -r 0f4b677cea64 source/common/x86/asm-primitives.cpp
  --- a/source/common/x86/asm-primitives.cppFri Jan 09 13:09:39 2015
 +0530
  +++ b/source/common/x86/asm-primitives.cppFri Jan 09 13:26:21 2015
 +0530
  @@ -1430,6 +1430,7 @@
   p.psy_cost_pp[BLOCK_32x32] = x265_psyCost_pp_32x32_sse4;
   p.psy_cost_pp[BLOCK_64x64] = x265_psyCost_pp_64x64_sse4;
   #endif
  +p.psy_cost_ss[BLOCK_4x4] = x265_psyCost_ss_4x4_sse4;
   }
   if (cpuMask  X265_CPU_XOP)
   {
  @@ -1716,6 +1717,7 @@
   p.psy_cost_pp[BLOCK_32x32] = x265_psyCost_pp_32x32_sse4;
   p.psy_cost_pp[BLOCK_64x64] = x265_psyCost_pp_64x64_sse4;
   #endif
  +p.psy_cost_ss[BLOCK_4x4] = x265_psyCost_ss_4x4_sse4;
   }
   if (cpuMask  X265_CPU_AVX)
   {
  diff -r c99e1a309bd1 -r 0f4b677cea64 source/common/x86/pixel-a.asm
  --- a/source/common/x86/pixel-a.asm   Fri Jan 09 13:09:39 2015 +0530
  +++ b/source/common/x86/pixel-a.asm   Fri Jan 09 13:26:21 2015 +0530
  @@ -7569,3 +7569,157 @@
   RET
   %endif ; HIGH_BIT_DEPTH
   %endif
  +
 
 +;-
  +;int psyCost_ss(const int16_t* source, intptr_t sstride, const int16_t*
 recon, intptr_t rstride)
 
 +;-
  +INIT_XMM sse4
  +cglobal psyCost_ss_4x4, 4, 5, 8
  +
  +add r1, r1
  +lea r4, [3 * r1]
  +movddup m0, [r0]
  +movddup m1, [r0 + r1]
  +movddup m2, [r0 + r1 * 2]
  +movddup m3, [r0 + r4]
  +
  +pabsw   m4, m0
  +pabsw   m5, m1
  +paddw   m5, m4
  +pabsw   m4, m2
  +paddw   m5, m4
  +pabsw   m4, m3
  +paddw   m5, m4
  +pmaddwd m5, [pw_1]
  +psrldq  m4, m5, 4
  +paddd   m5, m4
  +psrld   m6, m5, 2
  +
  +movam4, [hmul_8w]
  +pmaddwd m0, m4
  +pmaddwd m1, m4
  +pmaddwd m2, m4
  +pmaddwd m3, m4
  +
  +psrldq  m4, m0, 4
  +psubd   m5, m0, m4
  +paddd   m0, m4
  +shufps  m0, m5, 10001000b
  +
  +psrldq  m4, m1, 4
  +psubd   m5, m1, m4
  +paddd   m1, m4
  +shufps  m1, m5, 10001000b
  +
  +psrldq  m4, m2, 4
  +psubd   m5, m2, m4
  +paddd   m2, m4
  +shufps  m2, m5, 10001000b
  +
  +psrldq  m4, m3, 4
  +psubd   m5, m3, m4
  +paddd   m3, m4
  +shufps  m3, m5, 10001000b
  +
  +movam4, m0
  +paddd   m0, m1
  +psubd   m1, m4
  +movam4, m2
  +paddd   m2, m3
  +psubd   m3, m4
  +movam4, m0
  +paddd   m0, m2
  +psubd   m2, m4
  +movam4, m1
  +paddd   m1, m3
  +psubd   m3, m4
  +
  +pabsd   m0, m0
  +pabsd   m2, m2
  +pabsd   m1, m1
  +pabsd   m3, m3
  +paddd   m0, m2
  +paddd   m1, m3
  +paddd   m0, m1
  +movhlps m1, m0
  +paddd   m0, m1
  +psrldq  m1, m0, 4
  +paddd   m0, m1
  +psrld   m0, 1
  +psubd   m7, m0, m6
  +
  +add r3, r3
  +lea r4, [3 * r3]
  +movddup m0, [r2]
  +movddup m1, [r2 + r3]
  +movddup m2, [r2 + r3 * 2]
  +movddup m3, [r2 + r4]
  +
  +pabsw   m4, m0
  +pabsw   m5, m1
  +paddw   m5, m4
  +pabsw   m4, m2
  +paddw   m5, m4
  +pabsw   m4, m3
  +paddw   m5, m4
  +pmaddwd m5, [pw_1]
  +psrldq  m4, m5, 4
  +   

Re: [x265] [PATCH] asm: luma_vpp[16x32, 16x64] in avx2: improve 3875c-2488c, 7499c-4915c

2014-11-20 Thread Praveen Tiwari
tab_LumaCoeffVer_32 table of this name is already in file, redefining here
will cause build error. Please, verify and update patch.

On Thu, Nov 20, 2014 at 2:49 PM, Divya Manivannan 
di...@multicorewareinc.com wrote:

 # HG changeset patch
 # User Divya Manivannan di...@multicorewareinc.com
 # Date 1416475133 -19800
 #  Thu Nov 20 14:48:53 2014 +0530
 # Node ID 49c99a85531358e1b0624edd8082b6945d4e187e
 # Parent  3649fabf90d348c51d7e155989d1bf629ec27f6e
 asm: luma_vpp[16x32, 16x64] in avx2: improve 3875c-2488c, 7499c-4915c

 diff -r 3649fabf90d3 -r 49c99a855313 source/common/x86/asm-primitives.cpp
 --- a/source/common/x86/asm-primitives.cpp  Thu Nov 20 14:27:53 2014
 +0530
 +++ b/source/common/x86/asm-primitives.cpp  Thu Nov 20 14:48:53 2014
 +0530
 @@ -1798,6 +1798,8 @@
  p.transpose[BLOCK_16x16] = x265_transpose16_avx2;
  p.transpose[BLOCK_32x32] = x265_transpose32_avx2;
  p.transpose[BLOCK_64x64] = x265_transpose64_avx2;
 +p.luma_vpp[LUMA_16x32] = x265_interp_8tap_vert_pp_16x32_avx2;
 +p.luma_vpp[LUMA_16x64] = x265_interp_8tap_vert_pp_16x64_avx2;
  #endif
  p.luma_hpp[LUMA_4x4] = x265_interp_8tap_horiz_pp_4x4_avx2;
  p.luma_vpp[LUMA_4x4] = x265_interp_8tap_vert_pp_4x4_avx2;
 diff -r 3649fabf90d3 -r 49c99a855313 source/common/x86/ipfilter8.asm
 --- a/source/common/x86/ipfilter8.asm   Thu Nov 20 14:27:53 2014 +0530
 +++ b/source/common/x86/ipfilter8.asm   Thu Nov 20 14:48:53 2014 +0530
 @@ -122,6 +122,27 @@
times 8 db 58, -10
times 8 db 4, -1

 +ALIGN 32
 +tab_LumaCoeffVer_32: times 16 db 0, 0
 + times 16 db 0, 64
 + times 16 db 0, 0
 + times 16 db 0, 0
 +
 + times 16 db -1, 4
 + times 16 db -10, 58
 + times 16 db 17, -5
 + times 16 db 1, 0
 +
 + times 16 db -1, 4
 + times 16 db -11, 40
 + times 16 db 40, -11
 + times 16 db 4, -1
 +
 + times 16 db 0, 1
 + times 16 db -5, 17
 + times 16 db 58, -10
 + times 16 db 4, -1
 +
  tab_c_64_n64:   times 8 db 64, -64

  const interp4_shuf, times 2 db 0, 1, 8, 9, 4, 5, 12, 13, 2, 3, 10, 11, 6,
 7, 14, 15
 @@ -3755,6 +3776,312 @@

  
 ;-
  FILTER_VER_LUMA_12xN 12, 16, ps

 +%macro FILTER_VER_LUMA_AVX2_16xN 2
 +INIT_YMM avx2
 +%if ARCH_X86_64 == 1
 +cglobal interp_8tap_vert_pp_%1x%2, 4, 7, 15
 +mov r4d, r4m
 +shl r4d, 7
 +
 +%ifdef PIC
 +lea r5, [tab_LumaCoeffVer_32]
 +add r5, r4
 +%else
 +lea r5, [tab_LumaCoeffVer_32 + r4]
 +%endif
 +
 +lea r4, [r1 * 3]
 +sub r0, r4
 +lea r6, [r1 * 4]
 +movam14, [pw_512]
 +mov word [rsp], %2 / 16
 +
 +.loop:
 +movuxm0, [r0]   ; m0 = row 0
 +movuxm1, [r0 + r1]  ; m1 = row 1
 +punpckhbw   xm2, xm0, xm1
 +punpcklbw   xm0, xm1
 +vinserti128 m0, m0, xm2, 1
 +pmaddubsw   m0, [r5]
 +movuxm2, [r0 + r1 * 2]  ; m2 = row 2
 +punpckhbw   xm3, xm1, xm2
 +punpcklbw   xm1, xm2
 +vinserti128 m1, m1, xm3, 1
 +pmaddubsw   m1, [r5]
 +movuxm3, [r0 + r4]  ; m3 = row 3
 +punpckhbw   xm4, xm2, xm3
 +punpcklbw   xm2, xm3
 +vinserti128 m2, m2, xm4, 1
 +pmaddubsw   m4, m2, [r5 + 1 * mmsize]
 +paddw   m0, m4
 +pmaddubsw   m2, [r5]
 +lea r0, [r0 + r1 * 4]
 +movuxm4, [r0]   ; m4 = row 4
 +punpckhbw   xm5, xm3, xm4
 +punpcklbw   xm3, xm4
 +vinserti128 m3, m3, xm5, 1
 +pmaddubsw   m5, m3, [r5 + 1 * mmsize]
 +paddw   m1, m5
 +pmaddubsw   m3, [r5]
 +movuxm5, [r0 + r1]  ; m5 = row 5
 +punpckhbw   xm6, xm4, xm5
 +punpcklbw   xm4, xm5
 +vinserti128 m4, m4, xm6, 1
 +pmaddubsw   m6, m4, [r5 + 2 * mmsize]
 +paddw   m0, m6
 +pmaddubsw   m6, m4, [r5 + 1 * mmsize]
 +paddw   m2, m6
 +pmaddubsw   m4, [r5]
 +movuxm6, [r0 + r1 * 2]  ; m6 = row 6
 +punpckhbw   xm7, xm5, xm6
 +punpcklbw   xm5, xm6
 +vinserti128 m5, m5, xm7, 1
 +pmaddubsw   m7, m5, [r5 + 2 * mmsize]
 +paddw   m1, m7
 +pmaddubsw   m7, m5, [r5 + 1 * mmsize]
 +paddw   m3, m7
 +pmaddubsw   m5, [r5]
 +movuxm7, [r0 + r4]  ; m7 = row 7
 +punpckhbw   xm8, xm6, xm7
 +punpcklbw 

[x265] Fwd: [PATCH] refactorizaton of the transform/quant path

2014-11-19 Thread Praveen Tiwari
-- Forwarded message --
From: Steve Borho st...@borho.org
Date: Tue, Nov 18, 2014 at 11:31 PM
Subject: Re: [x265] [PATCH] refactorizaton of the transform/quant path
To: Development for x265 x265-devel@videolan.org


On 11/18, prav...@multicorewareinc.com wrote:
 # HG changeset patch
 # User Praveen Tiwari
 # Date 1416299427 -19800
 # Node ID 706fa4af912bc1610478de8f09a651ae3e58624c
 # Parent  2f0062f0791b822fa932712a56e6b0a14e976d91
 refactorizaton of the transform/quant path.
 This patch involves scaling down the DCT/IDCT coefficients from int32_t
to int16_t
 as they can be accommodated on int16_t without any introduction of encode
error,
 this allows us to clean up lots of DCT/IDCT intermediated buffers,
optimize enode efficiency for different
 cli options including noise reduction by reducing data movement
operations, accommodating more number of
 coefficients in a single register for SIMD operations. This patch include
all necessary
 changes for the transfor/quant path including unit test code.

snip

  for (int pass = 0; pass  2; pass++)
 @@ -1564,7 +1418,7 @@
   * still somewhat rare on end-user PCs we still compile and link
these SSE3
   * intrinsic SIMD functions */
  #if !HIGH_BIT_DEPTH
 -p.idct[IDCT_8x8] = idct8;
 +//p.idct[IDCT_8x8] = idct8;
  p.idct[IDCT_16x16] = idct16;
  p.idct[IDCT_32x32] = idct32;
  #endif

Getting the intrinsic idct8 re-enabled or coded in assembly should be a
priority.

[MC] We don't have any sse version of assembly code for IDCT_16x16
and IDCT_32x32, only avx2  asm codes this is why intrinsic version is
enabled. (We have AVX2 assembly for these two functions, but since AVX2 is
still somewhat rare on end-user PCs we still compile and link these SSE3
intrinsic SIMD functions). Further I will clean up idct8 intrinsic
(disabled) code as we have sse and avx2 asm code for it so, I think it is
no longer useful.

--
Steve Borho
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] Fwd: [PATCH] refactorizaton of the transform/quant path

2014-11-19 Thread Praveen Tiwari
-- Forwarded message --
From: Steve Borho st...@borho.org
Date: Tue, Nov 18, 2014 at 11:35 PM
Subject: Re: [x265] [PATCH] refactorizaton of the transform/quant path
To: Development for x265 x265-devel@videolan.org


On 11/18, prav...@multicorewareinc.com wrote:
 # HG changeset patch
 # User Praveen Tiwari
 # Date 1416299427 -19800
 # Node ID 706fa4af912bc1610478de8f09a651ae3e58624c
 # Parent  2f0062f0791b822fa932712a56e6b0a14e976d91
 refactorizaton of the transform/quant path.
 This patch involves scaling down the DCT/IDCT coefficients from int32_t
to int16_t
 as they can be accommodated on int16_t without any introduction of encode
error,
 this allows us to clean up lots of DCT/IDCT intermediated buffers,
optimize enode efficiency for different
 cli options including noise reduction by reducing data movement
operations, accommodating more number of
 coefficients in a single register for SIMD operations. This patch include
all necessary
 changes for the transfor/quant path including unit test code.

Testbench failure with this patch applied:

$ ./test/TestBench
Using random seed 546B89D8 8bpp
Testing primitives: SSE2
Testing primitives: SSE3
Testing primitives: SSSE3
Testing primitives: SSE4
denoiseDct: Failed!

Mac OS X x86_64 8bpp

I'm going to hold this patch until you can send a new patch to resolve
this issue.

[MC] Can we disable this single assembly code and push the patches so that
this and other patches don't have to wait, once we done with this issue we
can enable denoise asm code.

--
Steve Borho
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] disable denoiseDct asm code until fixed for Mac OS

2014-11-19 Thread Praveen Tiwari
My code does not involve any filter function modification, it's surprising.
I remember few week back some typo mistake was in filter AVX2 code . I
think it's same issue.

On Wed, Nov 19, 2014 at 11:37 PM, Steve Borho st...@borho.org wrote:

 On 11/19, prav...@multicorewareinc.com wrote:
  # HG changeset patch
  # User Praveen Tiwari
  # Date 1416402744 -19800
  # Node ID 0ef14321fb144362b609d51f2d7c58f7db757ceb
  # Parent  706fa4af912bc1610478de8f09a651ae3e58624c
  disable denoiseDct asm code until fixed for Mac OS

 with denoise disabled, it finds the next failing primitive:

 $ ./test/TestBench
 Using random seed 546CDBE7 8bpp
 Testing primitives: SSE2
 Testing primitives: SSE3
 Testing primitives: SSSE3
 Testing primitives: SSE4
 Testing primitives: AVX
 Testing primitives: AVX2

 x265: asm primitive has failed. Go and fix that Right Now!
 luma_hpp[  4x4]⏎

 --
 Steve Borho
 ___
 x265-devel mailing list
 x265-devel@videolan.org
 https://mailman.videolan.org/listinfo/x265-devel

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH 3 of 3] asm: AVX2 version luma_vpp[4x4], improve 391c - 302c

2014-11-03 Thread Praveen Tiwari
Crashing on vc11-x86-8bpp, Release mode. Min,  can you check your code ?


Regards,
Praveen

On Fri, Oct 31, 2014 at 4:16 AM, Min Chen chenm...@163.com wrote:

 # HG changeset patch
 # User Min Chen chenm...@163.com
 # Date 1414709200 25200
 # Node ID 5d0b20f6e4de0b59b8c3306793c7267e01b9a41b
 # Parent  529ff7eca135838dc50c227d52db97725a79f0db
 asm: AVX2 version luma_vpp[4x4], improve 391c - 302c

 diff -r 529ff7eca135 -r 5d0b20f6e4de source/common/x86/asm-primitives.cpp
 --- a/source/common/x86/asm-primitives.cpp  Thu Oct 30 15:46:23 2014
 -0700
 +++ b/source/common/x86/asm-primitives.cpp  Thu Oct 30 15:46:40 2014
 -0700
 @@ -1799,6 +1799,7 @@
  p.transpose[BLOCK_64x64] = x265_transpose64_avx2;
  #endif
  p.luma_hpp[LUMA_4x4] = x265_interp_8tap_horiz_pp_4x4_avx2;
 +p.luma_vpp[LUMA_4x4] = x265_interp_8tap_vert_pp_4x4_avx2;
  }
  #endif // if HIGH_BIT_DEPTH
  }
 diff -r 529ff7eca135 -r 5d0b20f6e4de source/common/x86/ipfilter8.asm
 --- a/source/common/x86/ipfilter8.asm   Thu Oct 30 15:46:23 2014 -0700
 +++ b/source/common/x86/ipfilter8.asm   Thu Oct 30 15:46:40 2014 -0700
 @@ -3420,6 +3420,88 @@
  RET
  %endmacro

 +
 +INIT_YMM avx2
 +cglobal interp_8tap_vert_pp_4x4, 4,6,8
 +mov r4d, r4m
 +lea r5, [r1 * 3]
 +sub r0, r5
 +
 +; TODO: VPGATHERDD
 +movdxm1, [r0]   ; m1 = row0
 +movdxm2, [r0 + r1]  ; m2 = row1
 +punpcklbw   xm1, xm2; m1 = [13 03 12 02
 11 01 10 00]
 +
 +movdxm3, [r0 + r1 * 2]  ; m3 = row2
 +punpcklbw   xm2, xm3; m2 = [23 13 22 12
 21 11 20 10]
 +movdxm4, [r0 + r5]
 +punpcklbw   xm3, xm4; m3 = [33 23 32 22
 31 21 30 20]
 +punpcklwd   xm1, xm3; m1 = [33 23 13 03
 32 22 12 02 31 21 11 01 30 20 10 00]
 +
 +lea r0, [r0 + r1 * 4]
 +movdxm5, [r0]   ; m5 = row4
 +punpcklbw   xm4, xm5; m4 = [43 33 42 32
 41 31 40 30]
 +punpcklwd   xm2, xm4; m2 = [43 33 21 13
 42 32 22 12 41 31 21 11 40 30 20 10]
 +vinserti128 m1, m1, xm2, 1  ; m1 = [43 33 21 13
 42 32 22 12 41 31 21 11 40 30 20 10] - [33 23 13 03 32 22 12 02 31 21 11 01
 30 20 10 00]
 +movdxm2, [r0 + r1]  ; m2 = row5
 +punpcklbw   xm5, xm2; m5 = [53 43 52 42
 51 41 50 40]
 +punpcklwd   xm3, xm5; m3 = [53 43 44 23
 52 42 32 22 51 41 31 21 50 40 30 20]
 +movdxm6, [r0 + r1 * 2]  ; m6 = row6
 +punpcklbw   xm2, xm6; m2 = [63 53 62 52
 61 51 60 50]
 +punpcklwd   xm4, xm2; m4 = [63 53 43 33
 62 52 42 32 61 51 41 31 60 50 40 30]
 +vinserti128 m3, m3, xm4, 1  ; m3 = [63 53 43 33
 62 52 42 32 61 51 41 31 60 50 40 30] - [53 43 44 23 52 42 32 22 51 41 31 21
 50 40 30 20]
 +movdxm4, [r0 + r5]  ; m4 = row7
 +punpcklbw   xm6, xm4; m6 = [73 63 72 62
 71 61 70 60]
 +punpcklwd   xm5, xm6; m5 = [73 63 53 43
 72 62 52 42 71 61 51 41 70 60 50 40]
 +
 +lea r0, [r0 + r1 * 4]
 +movdxm7, [r0]   ; m7 = row8
 +punpcklbw   xm4, xm7; m4 = [83 73 82 72
 81 71 80 70]
 +punpcklwd   xm2, xm4; m2 = [83 73 63 53
 82 72 62 52 81 71 61 51 80 70 60 50]
 +vinserti128 m5, m5, xm2, 1  ; m5 = [83 73 63 53
 82 72 62 52 81 71 61 51 80 70 60 50] - [73 63 53 43 72 62 52 42 71 61 51 41
 70 60 50 40]
 +movdxm2, [r0 + r1]  ; m2 = row9
 +punpcklbw   xm7, xm2; m7 = [93 83 92 82
 91 81 90 80]
 +punpcklwd   xm6, xm7; m6 = [93 83 73 63
 92 82 72 62 91 81 71 61 90 80 70 60]
 +movdxm7, [r0 + r1 * 2]  ; m7 = rowA
 +punpcklbw   xm2, xm7; m2 = [A3 93 A2 92
 A1 91 A0 90]
 +punpcklwd   xm4, xm2; m4 = [A3 93 83 73
 A2 92 82 72 A1 91 81 71 A0 90 80 70]
 +vinserti128 m6, m6, xm4, 1  ; m6 = [A3 93 83 73
 A2 92 82 72 A1 91 81 71 A0 90 80 70] - [93 83 73 63 92 82 72 62 91 81 71 61
 90 80 70 60]
 +
 +; load filter coeff
 +%ifdef PIC
 +lea r5, [tab_LumaCoeff]
 +vpbroadcastdm0, [r5 + r4 * 8 + 0]
 +vpbroadcastdm2, [r5 + r4 * 8 + 4]
 +%else
 +vpbroadcastqm0, [tab_LumaCoeff + r4 * 8 + 0]
 +vpbroadcastdm2, [tab_LumaCoeff + r4 * 8 + 4]
 +%endif
 +
 +pmaddubsw   m1, m0
 +pmaddubsw   m3, m0
 +pmaddubsw   m5, m2
 +pmaddubsw   m6, m2
 +

[x265] Fwd: [PATCH] weight_pp avx2 asm code, improved from 8608.65 cycles to 5138.09 cycles over sse version of asm code

2014-10-16 Thread Praveen Tiwari
-- Forwarded message --
From: chen chenm...@163.com
Date: Fri, Oct 17, 2014 at 3:11 AM
Subject: Re: [x265] [PATCH] weight_pp avx2 asm code, improved from 8608.65
cycles to 5138.09 cycles over sse version of asm code
To: Development for x265 x265-devel@videolan.org





At 2014-10-16 17:20:13,prav...@multicorewareinc.com wrote:
# HG changeset patch
# User Praveen Tiwari
# Date 1413451199 -19800
# Node ID 858be8d7d7176ab6c6d01cf92d00c8478fe99b34
# Parent  79702581ec824a2a375aebe228d69c3930aeea96
weight_pp avx2 asm code, improved from 8608.65 cycles to 5138.09 cycles over 
sse version of asm code

diff -r 79702581ec82 -r 858be8d7d717 source/common/x86/pixel-util8.asm
--- a/source/common/x86/pixel-util8.asmWed Oct 15 17:49:35 2014 -0500
+++ b/source/common/x86/pixel-util8.asmThu Oct 16 14:49:59 2014 +0530
@@ -1375,6 +1375,60 @@

 RET

+INIT_YMM avx2
+cglobal weight_pp, 6, 7, 6
+
+mov  r6d, r6m
+shl  r6d, 6   ; m0 = [w06]
+movd xm0, r6d
+
+movd xm1, r7m ; m1 = [round]
+punpcklwdxm0, xm1
+pshufd   xm0, xm0, 0
+vinserti128  m0, m0, xm0, 1   ; assuming both (w06) and round are using 
maximum of 16 bits each, m0 = [w06 round]

vpbroadcastd is better

Yeah, exactly I tried to replace  (pshufd xm0, xm0, 0) + (vinserti128
m0, m0, xm0, 1) with vpbroadcastd m0, xm0 (as per document syntax,
__m256i  _mm256_broadcastd_epi32
(__m128i a)) but it throwing build error: invalid
combination of opcode and operands.

and we just use weight_pp in four position, all of them have same
stride in r2  r3, so we can simplify interface and free more register
here, you can combo W0 and Round in general register to improve
performance.



+
+movd xm1, r8m
+vpbroadcastd m2, r9m
+mova m5, [pw_1]
+sub  r2d, r4d
+sub  r3d, r4d
+
+.loopH:
+mov r6d, r4d
+shr r6d, 4

why do Shr every time?

+.loopW:
+movuxm4, [r0]
+pmovzxbwm4, xm4

pmovzxbw didn't need aligned address

+punpcklwd   m3, m4, m5
+pmaddwd m3, m0
+psrad   m3, xm1
+paddd   m3, m2
+
+punpckhwd   m4, m5
+pmaddwd m4, m0
+psrad   m4, xm1
+paddd   m4, m2
+
+packssdwm3, m4
+vextracti128 xm4, m3, 1
+packuswbm3, m4

How about vpermq+packuswb(xm3)?

+movu[r1], xm3
+
+add r0, 16
+add r1, 16
+
+dec r6d
+jnz .loopW
+
+lea r0, [r0 + r2]
+lea r1, [r1 + r3]
+
+dec r5d
+jnz .loopH
+
+RET


___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] noiseReduction: make noiseReduction deterministic for a given number of frameEncoders

2014-10-14 Thread Praveen Tiwari
Seems we missed out something here, I tested this patch at my end outputs
are deterministic with --pmode but still non-deterministic without --pmode
option. Steve/Deepthi please verify at your end before pushing it. I used
the following cli:

 y4mInputs\park_joy_1280x720p50.y4m --tune=ssim --psnr --asm=false
--nr=1000  --hash 1 --input-depth 8 --preset ultrafast -o
outputFiles\park_joy-c2_nr.out [*Non-deterministic*]

 y4mInputs\park_joy_1280x720p50.y4m --tune=ssim --psnr --asm=false
--nr=1000  --hash 1 --input-depth 8 --preset ultrafast *--pmode* -o
outputFiles\park_joy-c1_nr.out [*deterministic*]



Regards,
Praveen


On Tue, Oct 14, 2014 at 4:54 PM, deep...@multicorewareinc.com wrote:

 # HG changeset patch
 # User Deepthi Nandakumar deep...@multicorewareinc.com
 # Date 1413278604 -19800
 #  Tue Oct 14 14:53:24 2014 +0530
 # Node ID c6e786dbbfaa39822799d17e6c32d49c6141a7fb
 # Parent  38b5733cc629dd16db770e6a93b4f994e13336f3
 noiseReduction: make noiseReduction deterministic for a given number of
 frameEncoders.

 diff -r 38b5733cc629 -r c6e786dbbfaa source/common/frame.cpp
 --- a/source/common/frame.cpp   Tue Oct 14 14:35:30 2014 +0530
 +++ b/source/common/frame.cpp   Tue Oct 14 14:53:24 2014 +0530
 @@ -43,6 +43,7 @@
  m_picSym = NULL;
  m_reconRowCount.set(0);
  m_countRefEncoders = 0;
 +m_frameEncoderID = 0;
  memset(m_lowres, 0, sizeof(m_lowres));
  m_next = NULL;
  m_prev = NULL;
 diff -r 38b5733cc629 -r c6e786dbbfaa source/common/frame.h
 --- a/source/common/frame.h Tue Oct 14 14:35:30 2014 +0530
 +++ b/source/common/frame.h Tue Oct 14 14:53:24 2014 +0530
 @@ -50,6 +50,7 @@
  TComPicSym*   m_picSym;
  TComPicYuv*   m_reconPicYuv;
  int   m_POC;
 +int   m_frameEncoderID; // To identify the ID of the
 frameEncoder processing this frame

  //** Frame Parallelism - notification between FrameEncoders of
 available motion reference rows **
  ThreadSafeInteger m_reconRowCount;  // count of CTU rows
 completely reconstructed and extended for motion reference
 diff -r 38b5733cc629 -r c6e786dbbfaa source/common/quant.cpp
 --- a/source/common/quant.cpp   Tue Oct 14 14:35:30 2014 +0530
 +++ b/source/common/quant.cpp   Tue Oct 14 14:53:24 2014 +0530
 @@ -156,6 +156,7 @@
  m_resiDctCoeff = NULL;
  m_fencDctCoeff = NULL;
  m_fencShortBuf = NULL;
 +m_nr   = NULL;
  }

  bool Quant::init(bool useRDOQ, double psyScale, const ScalingList
 scalingList, Entropy entropy)
 diff -r 38b5733cc629 -r c6e786dbbfaa source/encoder/analysis.cpp
 --- a/source/encoder/analysis.cpp   Tue Oct 14 14:35:30 2014 +0530
 +++ b/source/encoder/analysis.cpp   Tue Oct 14 14:53:24 2014 +0530
 @@ -292,7 +292,11 @@
  if (!jobId || m_param-rdLevel  4)
  {
  slave-m_quant.setQPforQuant(cu);
 -slave-m_quant.m_nr = m_quant.m_nr;
 +if(m_param-noiseReduction)
 +{
 +int frameEncoderID = cu-m_pic-m_frameEncoderID;
 +slave-m_quant.m_nr =
 m_tld[threadId].m_nr[frameEncoderID];
 +}
  slave-m_rdContexts[depth].cur.load(m_rdContexts[depth].cur);
  }
  }
 diff -r 38b5733cc629 -r c6e786dbbfaa source/encoder/analysis.h
 --- a/source/encoder/analysis.h Tue Oct 14 14:35:30 2014 +0530
 +++ b/source/encoder/analysis.h Tue Oct 14 14:53:24 2014 +0530
 @@ -172,7 +172,9 @@
  struct ThreadLocalData
  {
  Analysis analysis;
 -
 +NoiseReduction *m_nr;
 +
 +ThreadLocalData() { m_nr = NULL; }
  ~ThreadLocalData() { analysis.destroy(); }
  };

 diff -r 38b5733cc629 -r c6e786dbbfaa source/encoder/encoder.cpp
 --- a/source/encoder/encoder.cppTue Oct 14 14:35:30 2014 +0530
 +++ b/source/encoder/encoder.cppTue Oct 14 14:53:24 2014 +0530
 @@ -74,6 +74,7 @@
  m_csvfpt = NULL;
  m_param = NULL;
  m_threadPool = 0;
 +m_numThreadLocalData = 0;
  }

  void Encoder::create()
 @@ -162,15 +163,17 @@

  /* Allocate thread local data, one for each thread pool worker and
   * if --no-wpp, one for each frame encoder */
 -int numLocalData = poolThreadCount;
 +m_numThreadLocalData = poolThreadCount;
  if (!m_param-bEnableWavefront)
 -numLocalData += m_param-frameNumThreads;
 -m_threadLocalData = new ThreadLocalData[numLocalData];
 -for (int i = 0; i  numLocalData; i++)
 +m_numThreadLocalData += m_param-frameNumThreads;
 +m_threadLocalData = new ThreadLocalData[m_numThreadLocalData];
 +for (int i = 0; i  m_numThreadLocalData; i++)
  {
  m_threadLocalData[i].analysis.setThreadPool(m_threadPool);
  m_threadLocalData[i].analysis.initSearch(m_param, m_scalingList);
  m_threadLocalData[i].analysis.create(g_maxCUDepth + 1,
 g_maxCUSize, m_threadLocalData);
 +if(m_param-noiseReduction)
 +m_threadLocalData[i].m_nr = new
 NoiseReduction[m_param-frameNumThreads];
  }

  if 

[x265] Fwd: [PATCH] denoiseDct: unit test code

2014-09-16 Thread Praveen Tiwari
-- Forwarded message --
From: Steve Borho st...@borho.org
Date: Mon, Sep 15, 2014 at 4:28 PM
Subject: Re: [x265] [PATCH] denoiseDct: unit test code
To: Development for x265 x265-devel@videolan.org


On 09/15, prav...@multicorewareinc.com wrote:
 # HG changeset patch
 # User Praveen Tiwari
 # Date 1410775657 -19800
 # Node ID 36f5477f54ba8047f9abc1b42c5b56c6d223dc5a
 # Parent  184e56afa951815f4e295b4fcce094ee03361a2e
 denoiseDct: unit test code

a few nits and questions

 diff -r 184e56afa951 -r 36f5477f54ba source/test/mbdstharness.cpp
 --- a/source/test/mbdstharness.cppFri Sep 12 12:02:46 2014 +0530
 +++ b/source/test/mbdstharness.cppMon Sep 15 15:37:37 2014 +0530
 @@ -66,14 +66,17 @@
  short_test_buff[0][i]= (rand()  PIXEL_MAX) - (rand() 
PIXEL_MAX);
  int_test_buff[0][i]  = rand() % PIXEL_MAX;
  int_idct_test_buff[0][i] = (rand() % (SHORT_MAX - SHORT_MIN)) -
SHORT_MAX;
 +int_denoise_test_buff1[0][i] = int_denoise_test_buff2[0][i] =
(rand()  UNSIGNED_SHORT_MAX) - (rand()  UNSIGNED_SHORT_MAX);

  short_test_buff[1][i]= -PIXEL_MAX;
  int_test_buff[1][i]  = -PIXEL_MAX;
  int_idct_test_buff[1][i] = SHORT_MIN;
 +int_denoise_test_buff1[1][i] = int_denoise_test_buff2[1][i] =
-UNSIGNED_SHORT_MAX;

  short_test_buff[2][i]= PIXEL_MAX;
  int_test_buff[2][i]  = PIXEL_MAX;
  int_idct_test_buff[2][i] = SHORT_MAX;
 +int_denoise_test_buff1[2][i] = int_denoise_test_buff2[1][i] =
UNSIGNED_SHORT_MAX;

  mbuf1[i] = rand()  PIXEL_MAX;
  mbufdct[i] = (rand()  PIXEL_MAX) - (rand()  PIXEL_MAX);
 @@ -313,6 +316,46 @@
  return true;
  }

 +bool MBDstHarness::check_denoise_dct_primitive(denoiseDct_t ref,
denoiseDct_t opt)
 +{
 +int j = 0;
 +
 +for (int i = 0; i  4; i++)
 +{
 +int log2TrSize = i + 2;
 +int num = 1  (log2TrSize * 2);

This loop second confuses me? what's the point of it?

 +for (int n = 0; n = num; n++)
 +{
 +memset(mubuf1, 0, num * sizeof(uint32_t));
 +memset(mubuf2, 0, num * sizeof(uint32_t));
 +memset(mushortbuf1, 0,  num * sizeof(uint16_t));
 +
 +for (int k = 0; k  n; j++)
 +{
 +mushortbuf1[k] = rand() % UNSIGNED_SHORT_MAX;
 +}

we don't use braces for single-line expressions

 +int index = rand() % TEST_CASES;
 +int cmp_size = sizeof(int) * num;
 +
 +ref(int_denoise_test_buff1[index] + j, mubuf1, mushortbuf1,
num);
 +checked(opt, int_denoise_test_buff2[index] + j, mubuf2,
mushortbuf1, num);
 +
 +if (memcmp(int_denoise_test_buff1[index] + j,
int_denoise_test_buff2[index] + j, cmp_size))
 +return false;

white-space

 +if (memcmp(mubuf1, mubuf2, cmp_size))
 +return false;
 +
 +reportfail();
 +j += INCR;

is this bounds safe? TEST_BUF_SIZE is allocated for a max of ITERS
iterations (128). It seems like num can be 32*32.

 +}
 +}
 +
 +return true;
 +}
 +
  bool MBDstHarness::testCorrectness(const EncoderPrimitives ref, const
EncoderPrimitives opt)
  {
  for (int i = 0; i  NUM_DCTS; i++)
 @@ -393,6 +436,15 @@
  }
  }

 +if (opt.denoiseDct)
 +{
 +if (!check_denoise_dct_primitive(ref.denoiseDct, opt.denoiseDct))
 +{
 +printf(denoiseDct: Failed!\n);
 +return false;
 +}
 +}
 +
  return true;
  }

 @@ -448,4 +500,10 @@
  REPORT_SPEEDUP(opt.count_nonzero, ref.count_nonzero, mbuf1,
i * i)
  }
  }
 +
 +if (opt.denoiseDct)
 +{
 +printf(denoiseDct\t\t);
 +REPORT_SPEEDUP(opt.denoiseDct, ref.denoiseDct,
int_denoise_test_buff1[0], mubuf1, mushortbuf1, 32 * 32);
 +}
  }
 diff -r 184e56afa951 -r 36f5477f54ba source/test/mbdstharness.h
 --- a/source/test/mbdstharness.h  Fri Sep 12 12:02:46 2014 +0530
 +++ b/source/test/mbdstharness.h  Mon Sep 15 15:37:37 2014 +0530
 @@ -44,6 +44,10 @@
  int16_t mbufdct[TEST_BUF_SIZE];
  int mbufidct[TEST_BUF_SIZE];

 +ALIGN_VAR_32(uint32_t, mubuf1[MAX_TU_SIZE]);
 +ALIGN_VAR_32(uint32_t, mubuf2[MAX_TU_SIZE]);
 +ALIGN_VAR_32(uint16_t, mushortbuf1[MAX_TU_SIZE]);

does denoise need all new buffers? can it reuse existing buffers?
 I need unsigned buffers, so I prepared to attain new ones over
interpreting sign buffer as unsign using type casting, the residuum of the
things I have update in my patch.

There's no need to declare them aligned here. The first array is
declared aligned and since all below it are also aligned in size every
array is implicitly aligned.

  int16_t mshortbuf2[MAX_TU_SIZE];
  int16_t mshortbuf3[MAX_TU_SIZE];

 @@ -56,6 +60,9 @@
  int int_test_buff[TEST_CASES][TEST_BUF_SIZE];
  int int_idct_test_buff[TEST_CASES][TEST_BUF_SIZE];

 +int int_denoise_test_buff1[TEST_CASES

Re: [x265] [PATCH] copy_cnt: enable avx2 version of asm code

2014-09-11 Thread Praveen Tiwari
You can push 16x16 and 32x32 also they are good in performance but they
need a bit more improvement, I will be sending improvement patch soon.

Regards,
Praveen Tiwari

On Thu, Sep 11, 2014 at 11:29 AM, Deepthi Nandakumar 
deep...@multicorewareinc.com wrote:

 Would be better to combine this asm enable with the corresponding asm
 patch itself. I have pushed copy_cnt8, and enabled only that for now.

 On Wed, Sep 10, 2014 at 3:28 PM, prav...@multicorewareinc.com wrote:

 # HG changeset patch
 # User Praveen Tiwari
 # Date 1410343073 -19800
 # Node ID 2cd4a13086740728559fde3a176953e9aa4c0782
 # Parent  7bc4db02ccc728f6e2ddedd036c96e3d37b90f22
 copy_cnt: enable avx2 version of asm code

 diff -r 7bc4db02ccc7 -r 2cd4a1308674 source/common/x86/asm-primitives.cpp
 --- a/source/common/x86/asm-primitives.cpp  Wed Sep 10 14:45:33 2014
 +0530
 +++ b/source/common/x86/asm-primitives.cpp  Wed Sep 10 15:27:53 2014
 +0530
 @@ -1724,14 +1724,10 @@
  p.sad_x4[LUMA_16x32] = x265_pixel_sad_x4_16x32_avx2;
  p.ssd_s[BLOCK_32x32] = x265_pixel_ssd_s_32_avx2;

 -/* Need to update assembly code as per changed interface of the
 copy_cnt primitive, once
 - * code is updated, avx2 version will be enabled */
 -/*
  p.copy_cnt[BLOCK_4x4] = x265_copy_cnt_4_avx2;
  p.copy_cnt[BLOCK_8x8] = x265_copy_cnt_8_avx2;
  p.copy_cnt[BLOCK_16x16] = x265_copy_cnt_16_avx2;
  p.copy_cnt[BLOCK_32x32] = x265_copy_cnt_32_avx2;
 -*/

  p.cvt32to16_shl[BLOCK_4x4] = x265_cvt32to16_shl_4_avx2;
  p.cvt32to16_shl[BLOCK_8x8] = x265_cvt32to16_shl_8_avx2;
 ___
 x265-devel mailing list
 x265-devel@videolan.org
 https://mailman.videolan.org/listinfo/x265-devel



 ___
 x265-devel mailing list
 x265-devel@videolan.org
 https://mailman.videolan.org/listinfo/x265-devel


___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] removed copy_cnt_4 avx2 asm code: SSE version is eualy faster

2014-09-11 Thread Praveen Tiwari
Ignore It, need to correct commit message.


Regards,
Praveen Tiwari

On Thu, Sep 11, 2014 at 4:41 PM, prav...@multicorewareinc.com wrote:

 # HG changeset patch
 # User Praveen Tiwari
 # Date 1410433904 -19800
 # Node ID 5740ec22db67267bfca97fbba07ef9239802d2b0
 # Parent  012f315d3eda8044f5a49865e15ba2943fbab094
 removed copy_cnt_4 avx2 asm code: SSE version is eualy faster

 diff -r 012f315d3eda -r 5740ec22db67 source/common/x86/asm-primitives.cpp
 --- a/source/common/x86/asm-primitives.cpp  Wed Sep 10 17:27:20 2014
 +0200
 +++ b/source/common/x86/asm-primitives.cpp  Thu Sep 11 16:41:44 2014
 +0530
 @@ -1730,7 +1730,6 @@
  /* Need to update assembly code as per changed interface of the
 copy_cnt primitive, once
   * code is updated, avx2 version will be enabled */

 -// p.copy_cnt[BLOCK_4x4] = x265_copy_cnt_4_avx2;
  p.copy_cnt[BLOCK_8x8] = x265_copy_cnt_8_avx2;
  // p.copy_cnt[BLOCK_16x16] = x265_copy_cnt_16_avx2;
  // p.copy_cnt[BLOCK_32x32] = x265_copy_cnt_32_avx2;
 diff -r 012f315d3eda -r 5740ec22db67 source/common/x86/blockcopy8.asm
 --- a/source/common/x86/blockcopy8.asm  Wed Sep 10 17:27:20 2014 +0200
 +++ b/source/common/x86/blockcopy8.asm  Thu Sep 11 16:41:44 2014 +0530
 @@ -3987,35 +3987,6 @@
  %endif
  RET

 -
 -INIT_YMM avx2
 -cglobal copy_cnt_4, 3,3,3
 -add r2d, r2d
 -xorpd   xm2, xm2
 -
 -; row 0  1
 -movqxm0, [r1]
 -movhps  xm0, [r1 + r2]
 -
 -; row 2  3
 -movqxm1, [r1 + r2 * 2]
 -lea r2, [r2 * 3]
 -movhps  xm1, [r1 + r2]
 -
 -vinserti128 m0, m0, xm1, 1
 -movu[r0], m0
 -
 -vextractf128 xm1, m0, 1
 -packsswb xm0, xm1
 -pcmpeqb  xm0, xm2
 -
 -; get count
 -pmovmskbeax, xm0
 -not ax
 -popcnt  ax, ax
 -RET
 -
 -

  
 ;--
  ; uint32_t copy_cnt(int16_t *dst, int16_t *src, intptr_t stride);

  
 ;--

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] Fwd: Fwd: [PATCH] copy_cnt_4: faster AVX2 code

2014-09-10 Thread Praveen Tiwari
-- Forwarded message --
From: chen chenm...@163.com
Date: Wed, Sep 10, 2014 at 12:14 PM
Subject: Re: [x265] Fwd: [PATCH] copy_cnt_4: faster AVX2 code
To: Development for x265 x265-devel@videolan.org




At 2014-09-10 09:34:31,Praveen Tiwari prav...@multicorewareinc.com
wrote:


-- Forwarded message --
From: chen chenm...@163.com
Date: Tue, Sep 9, 2014 at 10:17 AM
Subject: Re: [x265] [PATCH] copy_cnt_4: faster AVX2 code
To: Development for x265 x265-devel@videolan.org


 Most operator is SSE2, just one movu, why we need AVX2 version on 4x4?
what about vinserti128 ?

you want to use vinserti128 combin 128bits to 256 bits, is it more cost
than two of movu

I tested both sse and avx2 code on HASWELL-I5 machine,  avx2 code seems a
bit faster so, I think we should keep both versions. Here is result of 3
runs:

*SSE VERSION:-*
 copy_cnt[4x4]  4.21x110.16  463.86
copy_cnt[4x4]  4.18x104.64  437.08
copy_cnt[4x4]  4.17x110.23  460.02

*AVX2 VERSION:-*
copy_cnt[4x4]  4.71x99.23   467.63
copy_cnt[4x4]  4.39x104.46  458.58
copy_cnt[4x4]  4.71x99.27   467.91


___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] Fwd: [PATCH] copy_cnt_4: faster AVX2 code

2014-09-09 Thread Praveen Tiwari
-- Forwarded message --
From: chen chenm...@163.com
Date: Tue, Sep 9, 2014 at 10:17 AM
Subject: Re: [x265] [PATCH] copy_cnt_4: faster AVX2 code
To: Development for x265 x265-devel@videolan.org


Most operator is SSE2, just one movu, why we need AVX2 version on 4x4?
what about vinserti128 ?

At 2014-09-09 16:37:23,prav...@multicorewareinc.com wrote:
# HG changeset patch # User Praveen Tiwari # Date 1410251834 -19800
# Node ID d011073f35258cb2f0ad95db6038c2d9fb840b27 # Parent
ebb84e9dbb0fa0e8c4c9304b2efd57f8ac3d0c05 copy_cnt_4: faster AVX2 code 
diff -r ebb84e9dbb0f -r d011073f3525 source/common/x86/blockcopy8.asm
--- a/source/common/x86/blockcopy8.asm Tue Sep 09 11:36:58 2014 +0530
+++ b/source/common/x86/blockcopy8.asm Tue Sep 09 14:07:14 2014 +0530
@@ -3990,7 +3990,7 @@  INIT_YMM avx2  cglobal copy_cnt_4, 3,3,3
 add r2d, r2d -xorpd   xm2, xm2
+xorpd   m2,  m2; row 0  1  movqxm0, [r1]
@@ -4004,11 +4004,9 @@  vinserti128 m0, m0, xm1, 1
 movu[r0], m0   -vextractf128 xm1, m0, 1
-packsswb xm0, xm1 -pcmpeqb  xm0, xm2 -
 ; get count +packsswbxm0, xm1 +pcmpeqb xm0, xm2
 pmovmskbeax, xm0  not ax  popcnt  ax, ax
___ x265-devel mailing list 
x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] count_nonzero primitive, downscaling quantCoeff from int32_t* to int16_t*

2014-08-12 Thread Praveen Tiwari
Thanks, just sent a fix for it.


Regards,
Praveen


On Tue, Aug 12, 2014 at 7:18 PM, chen chenm...@163.com wrote:

 -X265_CHECK((int)numSig == primitives.count_nonzero(coeff, 1  
 log2TrSize * 2), numSig differ\n);
 +/* This section of code is to safely convert int32_t coefficients 
 to int16_t, once the caller function is
 + * optimize to take coefficients as int16_t*, it will be cleanse.*/
 +int numCoeff = (1  (log2TrSize * 2));
 +assert(numCoeff = 1024);
 +ALIGN_VAR_16(int16_t, qCoeff[32 * 32]);
 +for (int i = 0; i  numCoeff; i++)
 +{
 +qCoeff[i] = (
 coeff[i]  0x);
 +}
 I suggest use clip on it, to avoid value problem (eg: 0x1 become zero) 
 and asm instruction match to clip


 ___
 x265-devel mailing list
 x265-devel@videolan.org
 https://mailman.videolan.org/listinfo/x265-devel


___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] x265: uncommon behavior by changing the 8-point DCT matrix

2014-06-10 Thread Praveen Tiwari
I think you are testing with asm code enabled. Assembly code has it's own
table, it nothing to do with constant 'g_t8' at
source/Lib/TLibCommon/TComRom.cpp (only for C code). Check dct8.asm file
for asm tables.


Regards,
Praveen Tiwari


On Wed, May 28, 2014 at 5:15 AM, Paulo André Oliveira 
oliveirapa...@globo.com wrote:

 Dear x265 development team,

 I am trying to conduct the following experiment: assess the change in the
 compressed video's quality by changing only the 8-point DCT matrix, which I
 suppose is the constant 'g_t8' at source/Lib/TLibCommon/TComRom.cpp

 However, the video's quality, which I am monitoring by the PSNR and SSIM
 metrics, keeps the same with any random matrix that I define in 'g_t8'. I
 am using the last version of x265 as of today.

 Sincerely,

 Paulo Oliveira

 ___
 x265-devel mailing list
 x265-devel@videolan.org
 https://mailman.videolan.org/listinfo/x265-devel


___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] Fwd: [PATCH] noise reduction feature, ported from x264

2014-05-12 Thread Praveen Tiwari
Yes that is true, thanks for your suggestions, I scan through few papers to
find from where following constant values (to generate the weight table)
are coming.

#define W(i) (i==0 ? FIX8(*1.*) :\
  i==1 ? FIX8(*0.8859*) :\
  i==2 ? FIX8(*1.6000*) :\
  i==3 ? FIX8(*0.9415*) :\
  i==4 ? FIX8(*1.2651*) :\
  i==5 ? FIX8(*1.1910*) :0)

it seems these values depends on dct coefficients too, so we need new
weight table for x265. I found these are generated through formula:-

Qstep ≈ Vi8 / (Si8 * 2^8 )  (for 8x8 block)

where rescaling matrix Vi8 is (32, 28, 51, 30, 40, 38) (qp = 4 from
following table)

QP vm0 vm1 vm2 vm3 vm4 vm5
0 2018321925   24
1 2219   35212826
2 2623   42243331
3 2825   45263533
432 28   51304038
536 32  58 344643

Si8 = 1/8 (0.125) (basically Si is also a matrix but it seems first element
is chosen for transform normalization)

So, if we will apply the above formula then:-

W(0) = 32 / (0.125 * 256) = 1  ≈ 1.
W(1) = 28 / (0.125 * 256) = 0.875≈ 0.8859
W(2) = 51 / (0.125 * 256) = 1.59  ≈ 1.6000
W(3) = 30 / (0.125 * 256) = 0.9375   ≈ 0.9415
W(4) = 40 / (0.125 * 256) = 1.25  ≈ 1.265
W(5) = 38 / (0.125 * 256) = 1.1875   ≈ 1.1910

Does my analysis is in right direction? if it is why Vi8
is chosen corresponding to  qp = 4 why not any other qp ?

Finally weight table is arranged as

W(0), W(3), W(4), W(3),  W(0), W(3), W(4), W(3),
W(3), W(1), W(5), W(1),  W(3), W(1), W(5), W(1),
W(4), W(5), W(2), W(5),  W(4), W(5), W(2), W(5),
W(3), W(1), W(5), W(1),  W(3), W(1), W(5), W(1),

W(0), W(3), W(4), W(3),  W(0), W(3), W(4), W(3),
W(3), W(1), W(5), W(1),  W(3), W(1), W(5), W(1),
W(4), W(5), W(2), W(5),  W(4), W(5), W(2), W(5),
W(3), W(1), W(5), W(1),  W(3), W(1), W(5), W(1)

what is logic behind such arrangement ?


Regards,
Praveen Tiwari



On Sat, May 10, 2014 at 8:12 AM, Jason Garrett-Glaser ja...@x264.comwrote:

 That isn't correct at all; the weights depend on the transforms, which
 depend on the video format. You can't just build a 16x16 out of 8x8s
 or 4x4s; you need to match the way the format works.

 Jason
 ___
 x265-devel mailing list
 x265-devel@videolan.org
 https://mailman.videolan.org/listinfo/x265-devel

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] Fwd: [PATCH] noise reduction feature, ported from x264

2014-05-08 Thread Praveen Tiwari
-- Forwarded message --
From: Jason Garrett-Glaser ja...@x264.com
Date: Thu, May 8, 2014 at 5:08 PM
Subject: Re: [x265] [PATCH] noise reduction feature, ported from x264
To: Development for x265 x265-devel@videolan.org


This only seems to have 4x4 and 8x8 transform sizes; how does this
work given that H.265 has many other transform sizes?  What does it do
for other transform sizes?

4x4 and 8x8 transform sizes are used as basic blocks to generate the bigger
sizes (16x16, 32x32), as we have weight tables only for 4x4 and 8x8 (taken
from x264).

Jason
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] all_angs_pred_32x32, asm code improvement

2014-02-27 Thread Praveen Tiwari
This is new patch same changes in other modes, but I have given same commit
message perhaps that's why it seems confusing. Do I need to send as an
attachment ?


On Thu, Feb 27, 2014 at 4:28 PM, Deepthi Nandakumar 
deep...@multicorewareinc.com wrote:

 The earlier patch was pushed, Praveen. Can you send a new patch which just
 removes the unused statements?

 ___
 x265-devel mailing list
 x265-devel@videolan.org
 https://mailman.videolan.org/listinfo/x265-devel


___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] all_angs_pred_32x32, asm code improvement

2014-02-26 Thread Praveen Tiwari
Oh, just left by mistake. I commented old code to test correctness of new
code, I will update the patch.


On Thu, Feb 27, 2014 at 3:33 AM, chen chenm...@163.com wrote:

 At 2014-02-26 20:28:52,prav...@multicorewareinc.com wrote:

 # HG changeset patch
 # User Praveen Tiwari
 # Date 1393417704 -19800
 # Node ID 7de2875c614058648475618d2b9faa5a9611225b
 # Parent  53c7e3e789435a3e7b51f1ad61e9425f59ea6cf7
 all_angs_pred_32x32, asm code improvement
 
 @@ -23679,8 +23563,9 @@
  pmaddubsw m3,m1,  m6
  pmulhrsw  m3,m7
  pslldqm4,2
 -pinsrbm4,[r4 + 8],   1
 -pinsrbm4,[r4 + 7],   0
 +;pinsrbm4,[r4 + 8],   1
 +;pinsrbm4,[r4 + 7],   0
 +pinsrwm4, [r4 + 7],  0
 please remove unused comment line

 ___
 x265-devel mailing list
 x265-devel@videolan.org
 https://mailman.videolan.org/listinfo/x265-devel


___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] all_angs_pred_4x4, mova replace with pxor

2013-12-04 Thread Praveen Tiwari
Min, I have sent the updated full patch.


Regards,
Praveen Tiwari


On Wed, Dec 4, 2013 at 8:58 PM, chen chenm...@163.com wrote:

 can you send a full patch, not patch to patch

 At 2013-12-04 22:50:05,prav...@multicorewareinc.com wrote:

 # HG changeset patch
 # User Praveen Tiwari
 # Date 1386168592 -19800
 # Node ID 52d604b17f7b6c7dedee4a5defcb8f089221b02b
 # Parent  c31e28cd26aa8a3f07ba0023a5923931cc687a2d
 all_angs_pred_4x4, mova replace with pxor
 
 diff -r c31e28cd26aa -r 52d604b17f7b source/common/x86/intrapred8.asm
 --- a/source/common/x86/intrapred8.asm Wed Dec 04 20:05:57 2013 +0530
 +++ b/source/common/x86/intrapred8.asm Wed Dec 04 20:19:52 2013 +0530
 @@ -34,8 +34,6 @@
 
  c_trans_4x4 db 0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15
 
 -tab_Zero: db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
 -
  const ang_table
  %assign x 0
  %rep 32
 @@ -945,7 +943,7 @@
  pshufd   m3, m2,0
  movu [r0 + 128], m3
 
 -mova m3,  [tab_Zero]
 +pxor m3,  m3
 
  pshufb   m4,  m2,   m3
  punpcklbwm4,  m3
 @@ -1347,7 +1345,7 @@
  pshufd   m2, m1,0
  movu [r0 + 384], m2
 
 -mova m2,  [tab_Zero]
 +pxor m2, m2
 
  pshufb   m3,  m1,   m2
  punpcklbwm3,  m2
 ___
 x265-devel mailing list
 x265-devel@videolan.org
 https://mailman.videolan.org/listinfo/x265-devel

 ___
 x265-devel mailing list
 x265-devel@videolan.org
 https://mailman.videolan.org/listinfo/x265-devel


___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] asm-primitives.cpp, removed temporary function pointer initialization, generated through macro calls

2013-11-22 Thread Praveen Tiwari
sorry, I removed wrong pointer initialization, I will fix it in next patch,
don't merge it.


On Fri, Nov 22, 2013 at 4:34 PM, prav...@multicorewareinc.com wrote:

 # HG changeset patch
 # User Praveen Tiwari
 # Date 1385118266 -19800
 # Node ID f2b8bcaf435c00d835cd4389063ed09d22e7be28
 # Parent  87a797d1c03afaea0b3cf9a2dfcac2c7e2950efc
 asm-primitives.cpp, removed temporary function pointer initialization,
 generated through macro calls

 diff -r 87a797d1c03a -r f2b8bcaf435c source/common/x86/asm-primitives.cpp
 --- a/source/common/x86/asm-primitives.cpp  Fri Nov 22 15:47:02 2013
 +0530
 +++ b/source/common/x86/asm-primitives.cpp  Fri Nov 22 16:34:26 2013
 +0530
 @@ -145,7 +145,8 @@
  p.chroma[X265_CSP_I420].filter_hpp[CHROMA_ ## W ## x ## H] =
 x265_interp_4tap_horiz_pp_ ## W ## x ## H ## cpu; \
  p.chroma[X265_CSP_I420].filter_hps[CHROMA_ ## W ## x ## H] =
 x265_interp_4tap_horiz_ps_ ## W ## x ## H ## cpu; \
  p.chroma[X265_CSP_I420].filter_vpp[CHROMA_ ## W ## x ## H] =
 x265_interp_4tap_vert_pp_ ## W ## x ## H ## cpu; \
 -p.chroma[X265_CSP_I420].filter_vps[CHROMA_ ## W ## x ## H] =
 x265_interp_4tap_vert_ps_ ## W ## x ## H ## cpu;
 +p.chroma[X265_CSP_I420].filter_vps[CHROMA_ ## W ## x ## H] =
 x265_interp_4tap_vert_ps_ ## W ## x ## H ## cpu; \
 +p.chroma[X265_CSP_I420].add_ps[CHROMA_ ## W ## x ## H] =
 x265_pixel_add_ps_ ## W ## x ## H ## cpu;

  #define SETUP_CHROMA_SP_FUNC_DEF(W, H, cpu) \
  p.chroma[X265_CSP_I420].filter_vsp[CHROMA_ ## W ## x ## H] =
 x265_interp_4tap_vert_sp_ ## W ## x ## H ## cpu;
 @@ -234,7 +235,8 @@
  p.luma_vpp[LUMA_ ## W ## x ## H] = x265_interp_8tap_vert_pp_ ## W ##
 x ## H ## cpu; \
  p.luma_vps[LUMA_ ## W ## x ## H] = x265_interp_8tap_vert_ps_ ## W ##
 x ## H ## cpu; \
  p.luma_copy_ps[LUMA_ ## W ## x ## H] = x265_blockcopy_ps_ ## W ## x
 ## H ## cpu; \
 -p.luma_sub_ps[LUMA_ ## W ## x ## H] = x265_pixel_sub_ps_ ## W ## x ##
 H ## cpu;
 +p.luma_sub_ps[LUMA_ ## W ## x ## H] = x265_pixel_sub_ps_ ## W ## x ##
 H ## cpu; \
 +p.luma_add_ps[LUMA_ ## W ## x ## H] = x265_pixel_add_ps_ ## W ## x ##
 H ## cpu;

  #define SETUP_LUMA_SP_FUNC_DEF(W, H, cpu) \
  p.luma_vsp[LUMA_ ## W ## x ## H] = x265_interp_8tap_vert_sp_ ## W ##
 x ## H ## cpu;
 @@ -477,40 +479,6 @@
  CHROMA_SS_FILTERS(_sse2);
  LUMA_SS_FILTERS(_sse2);

 -// This function pointer initialization is temporary will be
 removed
 -// later with macro definitions.  It is used to avoid linker
 errors
 -// until all partitions are coded and commit smaller patches,
 easier to
 -// review.
 -
 -p.chroma[X265_CSP_I420].copy_sp[CHROMA_4x2] =
 x265_blockcopy_sp_4x2_sse2;
 -p.chroma[X265_CSP_I420].copy_sp[CHROMA_4x4] =
 x265_blockcopy_sp_4x4_sse2;
 -p.chroma[X265_CSP_I420].copy_sp[CHROMA_4x8] =
 x265_blockcopy_sp_4x8_sse2;
 -p.chroma[X265_CSP_I420].copy_sp[CHROMA_4x16] =
 x265_blockcopy_sp_4x16_sse2;
 -p.chroma[X265_CSP_I420].copy_sp[CHROMA_8x2] =
 x265_blockcopy_sp_8x2_sse2;
 -p.chroma[X265_CSP_I420].copy_sp[CHROMA_8x4] =
 x265_blockcopy_sp_8x4_sse2;
 -p.chroma[X265_CSP_I420].copy_sp[CHROMA_8x6] =
 x265_blockcopy_sp_8x6_sse2;
 -p.chroma[X265_CSP_I420].copy_sp[CHROMA_8x8] =
 x265_blockcopy_sp_8x8_sse2;
 -p.chroma[X265_CSP_I420].copy_sp[CHROMA_8x16] =
 x265_blockcopy_sp_8x16_sse2;
 -p.chroma[X265_CSP_I420].copy_sp[CHROMA_12x16] =
 x265_blockcopy_sp_12x16_sse2;
 -p.chroma[X265_CSP_I420].copy_sp[CHROMA_16x4] =
 x265_blockcopy_sp_16x4_sse2;
 -p.chroma[X265_CSP_I420].copy_sp[CHROMA_16x8] =
 x265_blockcopy_sp_16x8_sse2;
 -p.chroma[X265_CSP_I420].copy_sp[CHROMA_16x12] =
 x265_blockcopy_sp_16x12_sse2;
 -p.chroma[X265_CSP_I420].copy_sp[CHROMA_16x16] =
 x265_blockcopy_sp_16x16_sse2;
 -p.chroma[X265_CSP_I420].copy_sp[CHROMA_16x32] =
 x265_blockcopy_sp_16x32_sse2;
 -p.chroma[X265_CSP_I420].copy_sp[CHROMA_24x32] =
 x265_blockcopy_sp_24x32_sse2;
 -p.chroma[X265_CSP_I420].copy_sp[CHROMA_32x8] =
 x265_blockcopy_sp_32x8_sse2;
 -p.chroma[X265_CSP_I420].copy_sp[CHROMA_32x16] =
 x265_blockcopy_sp_32x16_sse2;
 -p.chroma[X265_CSP_I420].copy_sp[CHROMA_32x24] =
 x265_blockcopy_sp_32x24_sse2;
 -p.chroma[X265_CSP_I420].copy_sp[CHROMA_32x32] =
 x265_blockcopy_sp_32x32_sse2;
 -
 -p.luma_copy_sp[LUMA_32x64] = x265_blockcopy_sp_32x64_sse2;
 -p.luma_copy_sp[LUMA_16x64] = x265_blockcopy_sp_16x64_sse2;
 -p.luma_copy_sp[LUMA_48x64] = x265_blockcopy_sp_48x64_sse2;
 -p.luma_copy_sp[LUMA_64x16] = x265_blockcopy_sp_64x16_sse2;
 -p.luma_copy_sp[LUMA_64x32] = x265_blockcopy_sp_64x32_sse2;
 -p.luma_copy_sp[LUMA_64x48] = x265_blockcopy_sp_64x48_sse2;
 -p.luma_copy_sp[LUMA_64x64] = x265_blockcopy_sp_64x64_sse2;
 -
  p.blockfill_s[BLOCK_4x4] = x265_blockfill_s_4x4_sse2;
  p.blockfill_s[BLOCK_8x8] = x265_blockfill_s_8x8_sse2

Re: [x265] [PATCH] asm code for pixeladd_ps_4x4 and testbench integration

2013-11-20 Thread Praveen Tiwari
Merged, sent implementation.

Regards,
Praveen Tiwari




On Wed, Nov 20, 2013 at 6:08 PM, chen chenm...@163.com wrote:

 At 2013-11-20 19:45:24,prav...@multicorewareinc.com wrote:
 # HG changeset patch
 # User Praveen Tiwari
 # Date 1384947915 -19800
 # Node ID c1e556f54d61422d153ff67f4830dc62ddd9
 # Parent  a7fb47a7eddf18634449a5ac898f7c2d029048e9
 asm code for pixeladd_ps_4x4 and testbench integration
 
 diff -r a7fb47a7eddf -r c1e556f54d61 source/common/CMakeLists.txt
 --- a/source/common/CMakeLists.txt   Wed Nov 20 12:57:57 2013 +0530
 +++ b/source/common/CMakeLists.txt   Wed Nov 20 17:15:15 2013 +0530
 @@ -113,7 +113,7 @@
 
  if(ENABLE_PRIMITIVES_ASM)
  set(C_SRCS asm-primitives.cpp pixel.h mc.h ipfilter8.h blockcopy8.h)
 -set(A_SRCS pixel-a.asm const-a.asm cpu-a.asm sad-a.asm mc-a.asm 
 mc-a2.asm ipfilter8.asm pixel-util.asm blockcopy8.asm)
 +set(A_SRCS pixel-a.asm const-a.asm cpu-a.asm sad-a.asm mc-a.asm 
 mc-a2.asm ipfilter8.asm pixel-util.asm blockcopy8.asm pixeladd8.asm)
  if (NOT X64)
  set(A_SRCS ${A_SRCS} pixel-32.asm)
  endif()
 diff -r a7fb47a7eddf -r c1e556f54d61 source/common/x86/asm-primitives.cpp
 --- a/source/common/x86/asm-primitives.cpp   Wed Nov 20 12:57:57 2013 +0530
 +++ b/source/common/x86/asm-primitives.cpp   Wed Nov 20 17:15:15 2013 +0530
 @@ -633,6 +633,13 @@
  p.calcrecon[BLOCK_32x32] = x265_calcRecons32_sse4;
  p.calcresidual[BLOCK_16x16] = x265_getResidual16_sse4;
  p.calcresidual[BLOCK_32x32] = x265_getResidual32_sse4;
 +
 +// This function pointer initialization is temporary will be removed
 +// later with macro definitions.  It is used to avoid linker errors
 +// until all partitions are coded and commit smaller patches, 
 easier to
 +// review.
 +
 +p.chroma_add_ps[X265_CSP_I420][CHROMA_4x4] = 
 x265_pixel_add_ps_4x4_sse4;
  }
  if (cpuMask  X265_CPU_AVX)
  {
 diff -r a7fb47a7eddf -r c1e556f54d61 source/common/x86/pixel.h
 --- a/source/common/x86/pixel.h  Wed Nov 20 12:57:57 2013 +0530
 +++ b/source/common/x86/pixel.h  Wed Nov 20 17:15:15 2013 +0530
 @@ -313,7 +313,8 @@
  SETUP_CHROMA_PIXELSUB_PS_FUNC(8, 32, cpu);
 
  #define SETUP_LUMA_PIXELSUB_PS_FUNC(W, H, cpu) \
 -void x265_pixel_sub_ps_ ## W ## x ## H ## cpu(int16_t *dest, intptr_t 
 destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t 
 srcstride1);
 +void x265_pixel_sub_ps_ ## W ## x ## H ## cpu(int16_t *dest, intptr_t 
 destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t 
 srcstride1);\
 +void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel *dest, int 
 destride, pixel *src0, int16_t *scr1, int srcStride0, int srcStride1);
 
  #define LUMA_PIXELSUB_DEF(cpu) \
  SETUP_LUMA_PIXELSUB_PS_FUNC(4,   4, cpu); \
 @@ -342,6 +343,8 @@
  SETUP_LUMA_PIXELSUB_PS_FUNC(64, 16, cpu); \
  SETUP_LUMA_PIXELSUB_PS_FUNC(16, 64, cpu);
 
 +//void x265_pixeladd_ps_4x4_sse4(pixel *dest, int destride, pixel 
 *src0, int16_t *scr1, int srcStride0, int srcStride1);
 +
 remove unused line



  CHROMA_PIXELSUB_DEF(_sse4);
  LUMA_PIXELSUB_DEF(_sse4);
 
 diff -r a7fb47a7eddf -r c1e556f54d61 source/common/x86/pixeladd8.asm
 --- /dev/nullThu Jan 01 00:00:00 1970 +
 +++ b/source/common/x86/pixeladd8.asmWed Nov 20 17:15:15 2013 +0530
 @@ -0,0 +1,79 @@
 +;*
 +;* Copyright (C) 2013 x265 project
 +;*
 +;* Authors: Praveen Kumar Tiwari prav...@multicorewareinc.com
 +;*
 +;* This program is free software; you can redistribute it and/or modify
 +;* it under the terms of the GNU General Public License as published by
 +;* the Free Software Foundation; either version 2 of the License, or
 +;* (at your option) any later version.
 +;*
 +;* This program is distributed in the hope that it will be useful,
 +;* but WITHOUT ANY WARRANTY; without even the implied warranty of
 +;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 +;* GNU General Public License for more details.
 +;*
 +;* You should have received a copy of the GNU General Public License
 +;* along with this program; if not, write to the Free Software
 +;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, 
 USA.
 +;*
 +;* This program is also available under a commercial proprietary license.
 +;* For more information, contact us at licens...@multicorewareinc.com.
 +;*/
 +
 +%include x86inc.asm
 +%include x86util.asm
 +
 +SECTION_RODATA 32
 +
 +SECTION .text
 +
 +;-
 +; void pixel_add_ps_4x4(pixel *dest, int destride, pixel *src0, int16_t 
 *scr1, int srcStride0, int srcStride1)
 +;-
 +INIT_XMM sse4
 +cglobal pixel_add_ps_4x4, 6, 6, 2, dest, destride, src0, scr1, srcStride0

Re: [x265] [PATCH] bug fix in blockcopy_pp_4x4

2013-11-12 Thread Praveen Tiwari
Please, ignore this patch old code is also fine. Some other bug.


Regards,
Praveen Tiwari


On Tue, Nov 12, 2013 at 3:09 PM, prav...@multicorewareinc.com wrote:

 # HG changeset patch
 # User Praveen Tiwari
 # Date 1384249182 -19800
 # Node ID 40695de368b6c890fa27a08c8e5a277c9682149c
 # Parent  d5e30ab8c8b756dd5de2a6e8f455210cb517e28b
 bug fix in blockcopy_pp_4x4

 diff -r d5e30ab8c8b7 -r 40695de368b6 source/common/x86/blockcopy8.asm
 --- a/source/common/x86/blockcopy8.asm  Tue Nov 12 14:14:04 2013 +0530
 +++ b/source/common/x86/blockcopy8.asm  Tue Nov 12 15:09:42 2013 +0530
 @@ -113,13 +113,13 @@
  movd m0, [r2]
  movd m1, [r2 + r3]
  movd m2, [r2 + 2 * r3]
 -lea  r3, [r3 + r3 * 2]
 +lea  r2, [r2 + 2 * r3]
  movd m3, [r2 + r3]

  movd [r0],m0
  movd [r0 + r1],   m1
  movd [r0 + 2 * r1],   m2
 -lea  r1,  [r1 + 2 * r1]
 +lea  r0,  [r0 + 2 * r1]
  movd [r0 + r1],   m3

  RET

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] asm code for blockcopy_ps, 8x6, 8x16 and 8x32

2013-11-11 Thread Praveen Tiwari
I mistyped one partition size, instead of 8x6 it will be 8x8, rest are
correct.

Regards,
Praveen Tiwari


On Mon, Nov 11, 2013 at 2:58 PM, prav...@multicorewareinc.com wrote:

 # HG changeset patch
 # User Praveen Tiwari
 # Date 1384162089 -19800
 # Node ID 6da0a0291ed8d10dc3dfdb3df396cd1a8c74ceeb
 # Parent  da0b44e67fe07caa7ed113ec4946a371d96801be
 asm code for blockcopy_ps, 8x6, 8x16 and 8x32

 diff -r da0b44e67fe0 -r 6da0a0291ed8 source/common/x86/asm-primitives.cpp
 --- a/source/common/x86/asm-primitives.cpp  Mon Nov 11 14:36:21 2013
 +0530
 +++ b/source/common/x86/asm-primitives.cpp  Mon Nov 11 14:58:09 2013
 +0530
 @@ -459,6 +459,9 @@
  p.chroma_copy_ps[CHROMA_8x2] = x265_blockcopy_ps_8x2_sse4;
  p.chroma_copy_ps[CHROMA_8x4] = x265_blockcopy_ps_8x4_sse4;
  p.chroma_copy_ps[CHROMA_8x6] = x265_blockcopy_ps_8x6_sse4;
 +p.chroma_copy_ps[CHROMA_8x8] = x265_blockcopy_ps_8x8_sse4;
 +p.chroma_copy_ps[CHROMA_8x16] = x265_blockcopy_ps_8x16_sse4;
 +p.chroma_copy_ps[CHROMA_8x32] = x265_blockcopy_ps_8x32_sse4;
  }
  if (cpuMask  X265_CPU_AVX)
  {
 diff -r da0b44e67fe0 -r 6da0a0291ed8 source/common/x86/blockcopy8.asm
 --- a/source/common/x86/blockcopy8.asm  Mon Nov 11 14:36:21 2013 +0530
 +++ b/source/common/x86/blockcopy8.asm  Mon Nov 11 14:58:09 2013 +0530
 @@ -1743,3 +1743,46 @@
  movu   [r0 + r1], m0

  RET
 +

 +;-
 +; void blockcopy_ps_%1x%2(int16_t *dest, intptr_t destStride, pixel *src,
 intptr_t srcStride);

 +;-
 +%macro BLOCKCOPY_PS_W8_H4 2
 +INIT_XMM sse4
 +cglobal blockcopy_ps_%1x%2, 4, 5, 1, dest, destStride, src, srcStride
 +
 +add r1,  r1
 +movr4d,  %2/4
 +
 +.loop
 +  movh   m0,[r2]
 +  pmovzxbw   m0,m0
 +  movu   [r0],  m0
 +
 +  movh   m0,[r2 + r3]
 +  pmovzxbw   m0,m0
 +  movu   [r0 + r1], m0
 +
 +  movh   m0,[r2 + 2 * r3]
 +  pmovzxbw   m0,m0
 +  movu   [r0 + 2 * r1], m0
 +
 +  lear2,[r2 + 2 * r3]
 +  lear0,[r0 + 2 * r1]
 +
 +  movh   m0,[r2 + r3]
 +  pmovzxbw   m0,m0
 +  movu   [r0 + r1], m0
 +
 +  lear0,[r0 + 2 * r1]
 +  lear2,[r2 + 2 * r3]
 +
 +  decr4d
 +  jnz.loop
 +
 +RET
 +%endmacro
 +
 +BLOCKCOPY_PS_W8_H4  8,  8
 +BLOCKCOPY_PS_W8_H4  8, 16
 +BLOCKCOPY_PS_W8_H4  8, 32
 diff -r da0b44e67fe0 -r 6da0a0291ed8 source/common/x86/blockcopy8.h
 --- a/source/common/x86/blockcopy8.hMon Nov 11 14:36:21 2013 +0530
 +++ b/source/common/x86/blockcopy8.hMon Nov 11 14:58:09 2013 +0530
 @@ -96,7 +96,10 @@
  #define CHROMA_BLOCKCOPY_DEF_SSE4(cpu) \
  SETUP_CHROMA_BLOCKCOPY_FUNC_SSE4(8, 2, cpu); \
  SETUP_CHROMA_BLOCKCOPY_FUNC_SSE4(8, 4, cpu); \
 -SETUP_CHROMA_BLOCKCOPY_FUNC_SSE4(8, 6, cpu);
 +SETUP_CHROMA_BLOCKCOPY_FUNC_SSE4(8, 6, cpu); \
 +SETUP_CHROMA_BLOCKCOPY_FUNC_SSE4(8, 8, cpu); \
 +SETUP_CHROMA_BLOCKCOPY_FUNC_SSE4(8, 16, cpu); \
 +SETUP_CHROMA_BLOCKCOPY_FUNC_SSE4(8, 32, cpu);

  CHROMA_BLOCKCOPY_DEF_SSE4(_sse4);


___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] asm code for blockcopy_ps_16x4

2013-11-11 Thread Praveen Tiwari
Fixed.


Regards,
Praveen Tiwari


On Mon, Nov 11, 2013 at 4:06 PM, chen chenm...@163.com wrote:

 +movu   m1, [r2]
 +punpcklbw  m2, m1,m0
 Here have a hide register copy, try to avoid it by SSE4.1 pmovzxbw m2, m1

 +movu   [r0],   m2
 +punpckhbw  m1, m0
 +movu   [r0 + 16],  m1

 ___
 x265-devel mailing list
 x265-devel@videolan.org
 https://mailman.videolan.org/listinfo/x265-devel


___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] asm code for blockcopy_ps_2x4

2013-11-11 Thread Praveen Tiwari
Replaced.

Regards,
Praveen Tiwari


On Mon, Nov 11, 2013 at 7:02 PM, chen chenm...@163.com wrote:

 +movd   m0,[r2]
 +pmovzxbw   m0,m0
 +pextrd [r0],  m0,   0
 same as movd

 ___
 x265-devel mailing list
 x265-devel@videolan.org
 https://mailman.videolan.org/listinfo/x265-devel


___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] asm code for blockcopy_ps_24x32

2013-11-11 Thread Praveen Tiwari
Sent Patch.

Regards,
Praveen Tiwari


On Mon, Nov 11, 2013 at 6:54 PM, chen chenm...@163.com wrote:


 +;-

 +; void blockcopy_ps_%1x%2(int16_t *dest, intptr_t destStride, pixel *src, 
 intptr_t srcStride);

 +;-
 +%macro BLOCKCOPY_PS_W24_H2 2
 +INIT_XMM sse4
 +cglobal blockcopy_ps_%1x%2, 4, 5, 3, dest, destStride, src, srcStride
 +
 +addr1,  r1
 +movr4d, %2/2
 +pxor   m0,  m0
 +
 +.loop
 +  movu   m1, [r2]
 +  pmovzxbw   m2, m1
 +  movu   [r0],   m2
 +  punpckhbw  m1, m0
 +  movu   [r0 + 16],  m1
 +
 +  movu   m1, [r2 + 16]
 movh

 +  pmovzxbw   m1, m1
 +  movu   [r0 + 32],  m1

 ___
 x265-devel mailing list
 x265-devel@videolan.org
 https://mailman.videolan.org/listinfo/x265-devel


___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] Fwd: [PATCH] blockcopy_sp_4x8, optimized asm code

2013-11-08 Thread Praveen Tiwari
-- Forwarded message --
From: chen chenm...@163.com
Date: Fri, Nov 8, 2013 at 3:29 PM
Subject: Re: [x265] [PATCH] blockcopy_sp_4x8, optimized asm code
To: Development for x265 x265-devel@videolan.org


At 2013-11-08 17:34:19,prav...@multicorewareinc.com wrote:

# HG changeset patch
# User Praveen Tiwari
# Date 1383903250 -19800
# Node ID 1e6bf52b6e3471b81e636569daa667f6dec9838a
# Parent  44ac213169c906eab5cba6b4aba876391b81da99
blockcopy_sp_4x8, optimized asm code

diff -r 44ac213169c9 -r 1e6bf52b6e34 source/common/x86/blockcopy8.asm
--- a/source/common/x86/blockcopy8.asm Fri Nov 08 14:46:07 2013 +0530
+++ b/source/common/x86/blockcopy8.asm Fri Nov 08 15:04:10 2013 +0530
@@ -948,45 +948,42 @@
 ; void blockcopy_sp_4x8(pixel *dest, intptr_t destStride, int16_t *src, 
 intptr_t srcStride)
 ;-
 INIT_XMM sse2
-cglobal blockcopy_sp_4x8, 4, 6, 8, dest, destStride, src, srcStride
+cglobal blockcopy_sp_4x8, 4, 4, 8, dest, destStride, src, srcStride
you have used r5
Min, r5 was in old code I have removed that. I think you are talking about
[ -lear5,  [r4 + 2 * r3] ]. In new code I have used just 4
registers.



___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] Fwd: [PATCH] blockcopy_sp_8x2, optimized asm code

2013-11-08 Thread Praveen Tiwari
-- Forwarded message --
From: chen chenm...@163.com
Date: Fri, Nov 8, 2013 at 4:30 PM
Subject: Re: [x265] [PATCH] blockcopy_sp_8x2, optimized asm code
To: Development for x265 x265-devel@videolan.org


+movh   [r0],   m0
+movhps [r0 + r1],  m0
change movh to movlps is better, movh+movhps is mixed float and integer
path
Will movh+movhps cause any problem ? I thought movh will be faster.
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] Fwd: [PATCH] blockcopy_sp_16xN, optimized asm code

2013-11-08 Thread Praveen Tiwari
-- Forwarded message --
From: chen chenm...@163.com
Date: Fri, Nov 8, 2013 at 7:10 PM
Subject: Re: [x265] [PATCH] blockcopy_sp_16xN, optimized asm code
To: Development for x265 x265-devel@videolan.org


code is right, but need uncrustify it, ex: add r3, r3
Does uncrustify work for .asm files?
t 2013-11-08 21:32:05,prav...@multicorewareinc.com wrote:

# HG changeset patch
# User Praveen Tiwari
# Date 1383917516 -19800
# Node ID 662664f0863b38b838a15867745c5564f574fb09
# Parent  227a5666e08869d36e07a75f3db95dd94c774715
blockcopy_sp_16xN, optimized asm code

diff -r 227a5666e088 -r 662664f0863b source/common/x86/blockcopy8.asm
--- a/source/common/x86/blockcopy8.asm Fri Nov 08 17:38:24 2013 +0530
+++ b/source/common/x86/blockcopy8.asm Fri Nov 08 19:01:56 2013 +0530
@@ -1325,51 +1325,38 @@
 ;-
 %macro BLOCKCOPY_SP_W16_H4 2
 INIT_XMM sse2
-cglobal blockcopy_sp_%1x%2, 4, 7, 7, dest, destStride, src, srcStride
+cglobal blockcopy_sp_%1x%2, 4, 5, 8, dest, destStride, src, srcStride

-mov r6d,%2
+mov r4d, %2/4

-addr3,  r3
-
-mova   m0,  [tab_Vm]
+add r3,  r3

 .loop
- movu   m1,  [r2]
- movu   m2,  [r2 + 16]
- movu   m3,  [r2 + r3]
- movu   m4,  [r2 + r3 + 16]
- movu   m5,  [r2 + 2 * r3]
- movu   m6,  [r2 + 2 * r3 + 16]
+ movu   m0,  [r2]
+ movu   m1,  [r2 + 16]
+ movu   m2,  [r2 + r3]
+ movu   m3,  [r2 + r3 + 16]
+ movu   m4,  [r2 + 2 * r3]
+ movu   m5,  [r2 + 2 * r3 + 16]
+ lear2,  [r2 + 2 * r3]
+ movu   m6,  [r2 + r3]
+ movu   m7,  [r2 + r3 + 16]

- pshufb m1,  m0
- pshufb m2,  m0
- pshufb m3,  m0
- pshufb m4,  m0
- pshufb m5,  m0
- pshufb m6,  m0
+ packuswb   m0,  m1
+ packuswb   m2,  m3
+ packuswb   m4,  m5
+ packuswb   m6,  m7

- movh   [r0],  m1
- movh   [r0 + 8],  m2
- movh   [r0 + r1], m3
- movh   [r0 + r1 + 8], m4
- movh   [r0 + 2 * r1], m5
- movh   [r0 + 2 * r1 + 8], m6
+ movu   [r0],  m0
+ movu   [r0 + r1], m2
+ movu   [r0 + 2 * r1], m4
+ lear0,[r0 + 2 * r1]
+ movu   [r0 + r1], m6

- lear4,  [r2 + 2 * r3]
- movu   m1,  [r4 + r3]
- movu   m2,  [r4 + r3 + 16]
+ lear0,[r0 + 2 * r1]
+ lear2,[r2 + 2 * r3]

- pshufb m1,  m0
- pshufb m2,  m0
-
- lear5,[r0 + 2 * r1]
- movh   [r5 + r1], m1
- movh   [r5 + r1 + 8], m2
-
- lear0,  [r5 + 2 * r1]
- lear2,  [r4 + 2 * r3]
-
- subr6d, 4
+ decr4d
  jnz.loop

 RET
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] Fwd: [PATCH] added pixelsub_ps C primitive and function pointer creation

2013-11-07 Thread Praveen Tiwari
-- Forwarded message --
From: Steve Borho st...@borho.org
Date: Thu, Nov 7, 2013 at 1:51 PM
Subject: Re: [x265] [PATCH] added pixelsub_ps C primitive and function
pointer creation
To: Development for x265 x265-devel@videolan.org





On Thu, Nov 7, 2013 at 1:01 AM, prav...@multicorewareinc.com wrote:

 # HG changeset patch
 # User Praveen Tiwari
 # Date 1383807695 -19800
 # Node ID 34ba8955747b66dcf3471fa216d15b97a3b07e0c
 # Parent  93cccbe49a93dd4c054ef06aca76974948793613
 added pixelsub_ps C primitive and function pointer creation

 diff -r 93cccbe49a93 -r 34ba8955747b source/common/pixel.cpp
 --- a/source/common/pixel.cpp   Wed Nov 06 19:49:38 2013 -0600
 +++ b/source/common/pixel.cpp   Thu Nov 07 12:31:35 2013 +0530
 @@ -790,6 +790,22 @@
  b += strideb;
  }
  }
 +
 +templateint bx, int by
 +void pixelsub_ps_c(int16_t *a, intptr_t dstride, pixel *b0, pixel *b1,
 intptr_t sstride0, intptr_t sstride1)
 +{
 +for (int y = 0; y  by; y++)
 +{
 +for (int x = 0; x  bx; x++)
 +{
 +a[x] = (int16_t)(b0[x] - b1[x]);
 +}
 +
 +b0 += sstride0;
 +b1 += sstride1;
 +a += dstride;
 +}
 +}
  }  // end anonymous namespace

  namespace x265 {
 @@ -832,10 +848,12 @@

  #define CHROMA(W, H) \
  p.chroma_copy_pp[CHROMA_ ## W ## x ## H] = blockcopy_pp_cW, H; \
 -p.chroma_copy_sp[CHROMA_ ## W ## x ## H] = blockcopy_sp_cW, H;
 +p.chroma_copy_sp[CHROMA_ ## W ## x ## H] = blockcopy_sp_cW, H;\
 +p.chroma_pixelsub_ps[CHROMA_ ## W ## x ## H] = pixelsub_ps_cW, H;
  #define LUMA(W, H) \
  p.luma_copy_pp[LUMA_ ## W ## x ## H] = blockcopy_pp_cW, H; \
 -p.luma_copy_sp[LUMA_ ## W ## x ## H] = blockcopy_sp_cW, H;
 +p.luma_copy_sp[LUMA_ ## W ## x ## H] = blockcopy_sp_cW, H;\
 +p.luma_pixelsub_ps[LUMA_ ## W ## x ## H] = pixelsub_ps_cW, H;

  LUMA(4, 4);
  LUMA(8, 8);
 diff -r 93cccbe49a93 -r 34ba8955747b source/common/primitives.h
 --- a/source/common/primitives.hWed Nov 06 19:49:38 2013 -0600
 +++ b/source/common/primitives.hThu Nov 07 12:31:35 2013 +0530
 @@ -216,6 +216,8 @@
  typedef void (*copy_pp_t)(pixel *dst, intptr_t dstride, pixel *src,
 intptr_t sstride); // dst is aligned
  typedef void (*copy_sp_t)(pixel *dst, intptr_t dstStride, int16_t *src,
 intptr_t srcStride);

 +typedef void (*pixelsub_ps_t)(int16_t *dst, intptr_t dstStride, pixel
 *src0, pixel *src1, intptr_t srcStride0, intptr_t srcStride1);


there's already a function typedef with the same name, that one needs to
be removed or this one needs to be renamed

I can see only, pixelsub_sp_t  from old function typedef and I have
created typedef void pixelsub_ps_t (pixel to short).


 +
  /* Define a structure containing function pointers to optimized encoder
   * primitives.  Each pointer can reference either an assembly routine,
   * a vectorized primitive, or a C function. */
 @@ -283,6 +285,9 @@
  pixeladd_pp_t   pixeladd_pp;
  pixelavg_pp_t   pixelavg_pp[NUM_LUMA_PARTITIONS];

 +pixelsub_ps_t   chroma_pixelsub_ps[NUM_CHROMA_PARTITIONS];
 +pixelsub_ps_t   luma_pixelsub_ps[NUM_LUMA_PARTITIONS];
 +
  scale_t scale1D_128to64;
  scale_t scale2D_64to32;
  downscale_t frame_init_lowres_core;
 ___
 x265-devel mailing list
 x265-devel@videolan.org
 https://mailman.videolan.org/listinfo/x265-devel




-- 
Steve Borho

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] asm code for blockfil_s, 16x16

2013-11-07 Thread Praveen Tiwari
Applied to code.

Regards,
Praveen Tiwari


On Thu, Nov 7, 2013 at 8:09 PM, chen chenm...@163.com wrote:

 +movr3d,   %2
 %2/8

 +
 + subr3d,8
 + jnz.loop
 dec r3d

 ___
 x265-devel mailing list
 x265-devel@videolan.org
 https://mailman.videolan.org/listinfo/x265-devel


___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] Fwd: [PATCH] asm code for blockfil_s, 4x4

2013-11-07 Thread Praveen Tiwari
-- Forwarded message --
From: Steve Borho st...@borho.org
Date: 2013/11/8
Subject: Re: [x265] [PATCH] asm code for blockfil_s, 4x4
To: Development for x265 x265-devel@videolan.org





On Thu, Nov 7, 2013 at 6:56 AM, prav...@multicorewareinc.com wrote:

 # HG changeset patch
 # User Praveen Tiwari
 # Date 1383828996 -19800
 # Node ID f2af7af43dfcb08135a08e755f654314a89efae7
 # Parent  d71f86b1c58b4fc9f8a3ffeaaef45c60f8bcc468
 asm code for blockfil_s, 4x4


blockfill has two l

Actually I named all pointers with blockfill (two I) and function with
blockfil (one I), perhaps matching naming convention from old code but
seems odd, I will take care off it.

diff -r d71f86b1c58b -r f2af7af43dfc source/common/x86/asm-primitives.cpp
 --- a/source/common/x86/asm-primitives.cpp  Thu Nov 07 18:16:22 2013
 +0530
 +++ b/source/common/x86/asm-primitives.cpp  Thu Nov 07 18:26:36 2013
 +0530
 @@ -361,6 +361,8 @@
  p.luma_copy_sp[LUMA_64x32] = x265_blockcopy_sp_64x32_sse2;
  p.luma_copy_sp[LUMA_64x48] = x265_blockcopy_sp_64x48_sse2;
  p.luma_copy_sp[LUMA_64x64] = x265_blockcopy_sp_64x64_sse2;
 +
 +p.blockfill_s[BLOCK_4x4] = x265_blockfil_s_4x4_sse2;
  #if X86_64
  p.satd[LUMA_8x32] = x265_pixel_satd_8x32_sse2;
  p.satd[LUMA_16x4] = x265_pixel_satd_16x4_sse2;
 diff -r d71f86b1c58b -r f2af7af43dfc source/common/x86/blockcopy8.asm
 --- a/source/common/x86/blockcopy8.asm  Thu Nov 07 18:16:22 2013 +0530
 +++ b/source/common/x86/blockcopy8.asm  Thu Nov 07 18:26:36 2013 +0530
 @@ -1646,3 +1646,22 @@
  BLOCKCOPY_SP_W64_H1 64, 32
  BLOCKCOPY_SP_W64_H1 64, 48
  BLOCKCOPY_SP_W64_H1 64, 64
 +

 +;-
 +; void blockfil_s_4x4(int16_t *dest, intptr_t destride, int16_t val)

 +;-
 +INIT_XMM sse2
 +cglobal blockfil_s_4x4, 3, 3, 1, dest, destStride, val
 +
 +addr1,r1
 +
 +movd   m0,r2d
 +pshuflwm0,m0, 0
 +
 +movh   [r0],  m0
 +movh   [r0 + r1], m0
 +movh   [r0 + 2 * r1], m0
 +lear0,[r0 + 2 * r1]
 +movh   [r0 + r1], m0
 +
 +RET
 diff -r d71f86b1c58b -r f2af7af43dfc source/common/x86/pixel.h
 --- a/source/common/x86/pixel.h Thu Nov 07 18:16:22 2013 +0530
 +++ b/source/common/x86/pixel.h Thu Nov 07 18:26:36 2013 +0530
 @@ -266,6 +266,8 @@
  DECL_ADS(2, avx2)
  DECL_ADS(1, avx2)

 +void x265_blockfil_s_4x4_sse2(int16_t *dst, intptr_t dstride, int16_t
 val);
 +


this belongs in blockcopy8.h
Will be moved to blockcopy8.h.


  #undef DECL_PIXELS
  #undef DECL_SUF
  #undef DECL_HEVC_SSD
 ___
 x265-devel mailing list
 x265-devel@videolan.org
 https://mailman.videolan.org/listinfo/x265-devel




-- 
Steve Borho

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] asm code for blockcopy_sp, 6x8

2013-11-06 Thread Praveen Tiwari
Fixed.

Regards,
Praveen Tiwari



On Wed, Nov 6, 2013 at 8:09 PM, chen chenm...@163.com wrote:

 + movd  [r0 + 2 * r1], m3
 + pextrwr6,m3,2
 + mov   [r0 + 2 * r1 + 4], r6w
 SSE4.1 support below:
  pextrw[r0 + 2 * r1 + 4],  m3,2

 ___
 x265-devel mailing list
 x265-devel@videolan.org
 https://mailman.videolan.org/listinfo/x265-devel


___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] asm: assembly code for pixel_sad_12x16

2013-10-30 Thread Praveen Tiwari
-- Forwarded message --
From: dnyanesh...@multicorewareinc.com
Date: Wed, Oct 30, 2013 at 7:47 PM
Subject: [x265] [PATCH] asm: assembly code for pixel_sad_12x16
To: x265-devel@videolan.org


# HG changeset patch
# User Dnyaneshwar Gorade dnyanesh...@multicorewareinc.com
# Date 1383142575 -19800
#  Wed Oct 30 19:46:15 2013 +0530
# Node ID 5037cc891114619e32ceeff332884d0abfd138fd
# Parent  62a51fe2fcbfd76fc8476a6f714f961b3f3f23ef
asm: assembly code for pixel_sad_12x16

diff -r 62a51fe2fcbf -r 5037cc891114 source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp  Wed Oct 30 18:11:01 2013
+0530
+++ b/source/common/x86/asm-primitives.cpp  Wed Oct 30 19:46:15 2013
+0530
@@ -253,6 +253,7 @@

 p.sad[LUMA_48x64]  = x265_pixel_sad_48x64_sse2;
 p.sad[LUMA_24x32]  = x265_pixel_sad_24x32_sse2;
+p.sad[LUMA_12x16]  = x265_pixel_sad_12x16_sse2;

 ASSGN_SSE(sse2);
 INIT2(sad, _sse2);
diff -r 62a51fe2fcbf -r 5037cc891114 source/common/x86/pixel.h
--- a/source/common/x86/pixel.h Wed Oct 30 18:11:01 2013 +0530
+++ b/source/common/x86/pixel.h Wed Oct 30 19:46:15 2013 +0530
@@ -53,6 +53,7 @@
 ret x265_pixel_ ## name ## _64x64_ ## suffix args; \
 ret x265_pixel_ ## name ## _48x64_ ## suffix args; \
 ret x265_pixel_ ## name ## _24x32_ ## suffix args; \
+ret x265_pixel_ ## name ## _12x16_ ## suffix args; \

 #define DECL_X1(name, suffix) \
 DECL_PIXELS(int, name, suffix, (pixel *, intptr_t, pixel *, intptr_t))
diff -r 62a51fe2fcbf -r 5037cc891114 source/common/x86/sad-a.asm
--- a/source/common/x86/sad-a.asm   Wed Oct 30 18:11:01 2013 +0530
+++ b/source/common/x86/sad-a.asm   Wed Oct 30 19:46:15 2013 +0530
@@ -31,8 +31,9 @@

 SECTION_RODATA 32

+MSK:  db
255,255,255,255,255,255,255,255,255,255,255,255,0,0,0,0
 pb_shuf8x8c2: times 2 db 0,0,0,0,8,8,8,8,-1,-1,-1,-1,-1,-1,-1,-1
-hpred_shuf: db 0,0,2,2,8,8,10,10,1,1,3,3,9,9,11,11
+hpred_shuf:   db 0,0,2,2,8,8,10,10,1,1,3,3,9,9,11,11

 SECTION .text

@@ -119,6 +120,39 @@
 RET
 %endmacro

+%macro PROCESS_SAD_12x4 0
+movum1,  [r2]
+movum2,  [r0]
+pandm1,  m4
+pandm2,  m4
+psadbw  m1,  m2
+paddd   m0,  m1
+lea r2,  [r2 + r3]
+lea r0,  [r0 + r1]
+movum1,  [r2]
+movum2,  [r0]
+pandm1,  m4
+pandm2,  m4
+psadbw  m1,  m2
+paddd   m0,  m1
+lea r2,  [r2 + r3]
+lea r0,  [r0 + r1]
+movum1,  [r2]
+movum2,  [r0]

we don't need to load address every time when we are adding stride to it.
we should try to calculate address first using multiply by 1, 2, 4, or 8 if
it not the case then we should load it.
 like above four instruction can be replaced with these two only.

movum1,  [r2 + 2 * r3]
movum2,  [r0 + 2 * r1]

+pandm1,  m4
+pandm2,  m4
+psadbw  m1,  m2
+paddd   m0,  m1
+lea r2,  [r2 + r3]
+lea r0,  [r0 + r1]
+movum1,  [r2]
+movum2,  [r0]
+pandm1,  m4
+pandm2,  m4
+psadbw  m1,  m2
+paddd   m0,  m1
+%endmacro
+
 %macro PROCESS_SAD_16x4 0
 movum1,  [r2]
 movum2,  [r2 + r3]
@@ -1007,6 +1041,29 @@
 movdeax, m0
 RET

+;-
+; int pixel_sad_12x16( uint8_t *, intptr_t, uint8_t *, intptr_t )
+;-
+cglobal pixel_sad_12x16, 4,4,4
+mova  m4,  [MSK]
+pxor  m0,  m0
+
+PROCESS_SAD_12x4
+lea r2,  [r2 + r3]
+lea r0,  [r0 + r1]
+PROCESS_SAD_12x4
+lea r2,  [r2 + r3]
+lea r0,  [r0 + r1]
+PROCESS_SAD_12x4
+lea r2,  [r2 + r3]
+lea r0,  [r0 + r1]
+PROCESS_SAD_12x4
+
+movhlps m1,  m0
+paddd   m0,  m1
+movdeax, m0
+RET
+
 %endmacro

overuse of lea  instruction please eliminate them, use available registers
to save loads operations.
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] Fwd: [PATCH] assembly code for pixel_sad_x3_24x32

2013-10-30 Thread Praveen Tiwari
-- Forwarded message --
From: yuva...@multicorewareinc.com
Date: Wed, Oct 30, 2013 at 2:38 PM
Subject: [x265] [PATCH] assembly code for pixel_sad_x3_24x32
To: x265-devel@videolan.org


# HG changeset patch
# User Yuvaraj Venkatesh yuva...@multicorewareinc.com
# Date 1383124045 -19800
#  Wed Oct 30 14:37:25 2013 +0530
# Node ID eca1142d1cec9303afad71108494f9076586ce05
# Parent  65462024832b4498cd9f05a5a81cb6b559bf378b
assembly code for pixel_sad_x3_24x32

diff -r 65462024832b -r eca1142d1cec source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp  Wed Oct 30 01:54:16 2013
-0500
+++ b/source/common/x86/asm-primitives.cpp  Wed Oct 30 14:37:25 2013
+0530
@@ -292,6 +292,7 @@
 p.sad_x4[LUMA_16x32] = x265_pixel_sad_x4_16x32_ssse3;
 p.sad_x3[LUMA_16x64] = x265_pixel_sad_x3_16x64_ssse3;
 p.sad_x4[LUMA_16x64] = x265_pixel_sad_x4_16x64_ssse3;
+p.sad_x3[LUMA_24x32] = x265_pixel_sad_x3_24x32_ssse3;

 p.luma_hvpp[LUMA_8x8] = x265_interp_8tap_hv_pp_8x8_ssse3;
 p.ipfilter_sp[FILTER_V_S_P_8] = x265_interp_8tap_v_sp_ssse3;
@@ -325,6 +326,7 @@
 p.sad_x4[LUMA_16x32] = x265_pixel_sad_x4_16x32_avx;
 p.sad_x3[LUMA_16x64] = x265_pixel_sad_x3_16x64_avx;
 p.sad_x4[LUMA_16x64] = x265_pixel_sad_x4_16x64_avx;
+p.sad_x3[LUMA_24x32] = x265_pixel_sad_x3_24x32_avx;
 }
 if (cpuMask  X265_CPU_XOP)
 {
diff -r 65462024832b -r eca1142d1cec source/common/x86/pixel.h
--- a/source/common/x86/pixel.h Wed Oct 30 01:54:16 2013 -0500
+++ b/source/common/x86/pixel.h Wed Oct 30 14:37:25 2013 +0530
@@ -47,6 +47,7 @@
 ret x265_pixel_ ## name ## _32x24_ ## suffix args; \
 ret x265_pixel_ ## name ## _32x32_ ## suffix args; \
 ret x265_pixel_ ## name ## _32x64_ ## suffix args; \
+ret x265_pixel_ ## name ## _24x32_ ## suffix args; \

 #define DECL_X1(name, suffix) \
 DECL_PIXELS(int, name, suffix, (pixel *, intptr_t, pixel *, intptr_t))
diff -r 65462024832b -r eca1142d1cec source/common/x86/sad-a.asm
--- a/source/common/x86/sad-a.asm   Wed Oct 30 01:54:16 2013 -0500
+++ b/source/common/x86/sad-a.asm   Wed Oct 30 14:37:25 2013 +0530
@@ -1988,6 +1988,117 @@
 RET
 %endmacro

+%macro SAD_X3_24x4 0
+movam3,  [r0]
+movam4,  [r0 + 16]
+movum5,  [r1]
+movum6,  [r1 + 16]
+psadbw  m5,  m3
+psadbw  m6,  m4
+pshufd  m6,  m6, 84
+paddd   m5,  m6
+paddd   m0,  m5
+movum5,  [r2]
+movum6,  [r2 + 16]
+psadbw  m5,  m3
+psadbw  m6,  m4
+pshufd  m6,  m6, 84
+paddd   m5,  m6
+paddd   m1,  m5
+movum5,  [r3]
+movum6,  [r3 + 16]
+psadbw  m5,  m3
+psadbw  m6,  m4
+pshufd  m6,  m6, 84
+paddd   m5,  m6
+paddd   m2,  m5
+lea r0,  [r0 + FENC_STRIDE]
+lea r1,  [r1 + r4]
+lea r2,  [r2 + r4]
+lea r3,  [r3 + r4]
+movam3,  [r0]
+movam4,  [r0 + 16]
+movum5,  [r1]
+movum6,  [r1 + 16]
+psadbw  m5,  m3
+psadbw  m6,  m4
+pshufd  m6,  m6, 84
+paddd   m5,  m6
+paddd   m0,  m5
+movum5,  [r2]
+movum6,  [r2 + 16]
+psadbw  m5,  m3
+psadbw  m6,  m4
+pshufd  m6,  m6, 84
+paddd   m5,  m6
+paddd   m1,  m5
+movum5,  [r3]
+movum6,  [r3 + 16]
+psadbw  m5,  m3
+psadbw  m6,  m4
+pshufd  m6,  m6, 84
+paddd   m5,  m6
+paddd   m2,  m5
+lea r0,  [r0 + FENC_STRIDE]
+lea r1,  [r1 + r4]
+lea r2,  [r2 + r4]
+lea r3,  [r3 + r4]
+movam3,  [r0]
+movam4,  [r0 + 16]
+movum5,  [r1]
+movum6,  [r1 + 16]

You don't need to load address every time. you can calculate it like

movam4,  [r0 + 2 * r4]
movam4,  [r0 + 4 * r4]
movam4,  [r0 + 8 * r4]

or even like

movam4,  [r0 + 2 * r4 + constant]

use this concept to eliminate lea instructions. Multiplication with 1, 2, 4
and 8 are allowed.

+psadbw  m5,  m3
+psadbw  m6,  m4
+pshufd  m6,  m6, 84
+paddd   m5,  m6
+paddd   m0,  m5
+movum5,  [r2]
+movum6,  [r2 + 16]
+psadbw  m5,  m3
+psadbw  m6,  m4
+pshufd  m6,  m6, 84
+paddd   m5,  m6
+paddd   m1,  m5
+movum5,  [r3]
+movum6,  [r3 + 16]
+psadbw  m5,  m3
+psadbw  m6,  m4
+pshufd  m6,  m6, 84
+paddd   m5,  m6
+paddd   m2,  m5
+lea r0,  [r0 + FENC_STRIDE]
+lea r1,  [r1 + r4]
+lea r2,  [r2 + r4]
+lea r3,  [r3 + r4]
+movam3,  [r0]
+movam4,  [r0 + 16]
+movum5,  [r1]
+movum6,  [r1 + 16]
+psadbw  m5,  m3
+psadbw  m6,  m4
+pshufd  m6,  m6, 84
+paddd   m5,  m6
+paddd   m0,  m5
+movum5,  [r2]
+movum6,  [r2 + 16]
+psadbw  m5,  m3
+psadbw  m6,  m4
+pshufd  m6,  m6, 84
+paddd   m5,  m6
+paddd   m1,  m5
+movum5,  [r3]
+movum6,  [r3 + 16]
+psadbw  m5,  m3
+psadbw  m6,  m4
+pshufd  m6,  m6, 

[x265] Fwd: [PATCH 4 of 4] asm: interp_8tap_v_sp for ipfilter_sp[FILTER_V_S_P_8]

2013-10-28 Thread Praveen Tiwari
-- Forwarded message --
From: Steve Borho st...@borho.org
Date: Mon, Oct 28, 2013 at 11:55 PM
Subject: Re: [x265] [PATCH 4 of 4] asm: interp_8tap_v_sp for
ipfilter_sp[FILTER_V_S_P_8]
To: Development for x265 x265-devel@videolan.org





On Mon, Oct 28, 2013 at 9:24 AM, Min Chen chenm...@163.com wrote:

 # HG changeset patch
 # User Min Chen chenm...@163.com
 # Date 1382970234 -28800
 # Node ID 41425f18efe14be468715bfa68fdebbb9a49145f
 # Parent  5f7b3d06d94c6aec44bfd4a7bfb6f6751182b4ed
 asm: interp_8tap_v_sp for ipfilter_sp[FILTER_V_S_P_8]



I'm getting link errors on x86_64 from this series:

error LNK2017: 'ADDR32' relocation to 'tab_LumaCoeffV' invalid without
/LARGEADDRESSAWARE:NO

This error is due to [register + global_constant] 64-bit does not support
it. I generally use PIC macro to protect it. like

%ifdef PIC
lea r5,[tab_ChromaCoeff]
movdm0,[r5 + r4 * 4]
%else
movdm0,[tab_ChromaCoeff + r4 * 4]
%endif

In general, I think we should drop all of the interpolation merging while
we get all the assembly completed for motion compensation.  When the
assembly is alltogether, we can experiment and figure out if it makes sense
to re-merge some of them back together.


 diff -r 5f7b3d06d94c -r 41425f18efe1 source/common/x86/asm-primitives.cpp
 --- a/source/common/x86/asm-primitives.cpp  Mon Oct 28 22:23:29 2013
 +0800
 +++ b/source/common/x86/asm-primitives.cpp  Mon Oct 28 22:23:54 2013
 +0800
 @@ -280,6 +280,7 @@
  p.sad_x4[LUMA_16x32] = x265_pixel_sad_x4_16x32_ssse3;

  p.luma_hvpp[LUMA_8x8] = x265_interp_8tap_hv_pp_8x8_ssse3;
 +p.ipfilter_sp[FILTER_V_S_P_8] = x265_interp_8tap_v_sp_ssse3;
  }
  if (cpuMask  X265_CPU_SSE4)
  {
 diff -r 5f7b3d06d94c -r 41425f18efe1 source/common/x86/ipfilter8.asm
 --- a/source/common/x86/ipfilter8.asm   Mon Oct 28 22:23:29 2013 +0800
 +++ b/source/common/x86/ipfilter8.asm   Mon Oct 28 22:23:54 2013 +0800
 @@ -774,3 +774,114 @@
  jnz .loopV

  RET
 +
 +

 +;-
 +; void interp_8tap_v_sp(int16_t *src, intptr_t srcStride, pixel *dst,
 intptr_t dstStride, int width, int height, const int coeffIdx);

 +;-
 +INIT_XMM ssse3
 +cglobal interp_8tap_v_sp, 4, 7, 8, 0-(2*4 + 3*gprsize)
 +%define old_r0  (rsp + 2 * 4 + 0 * gprsize)
 +%define old_r2  (rsp + 2 * 4 + 1 * gprsize)
 +%define old_r3  (rsp + 2 * 4 + 2 * gprsize)
 +%define old_r4d (rsp + 0 * 4)
 +%define old_6rows   (rsp + 1 * 4)
 +
 +mov r4d,r4m
 +mov r5d,r5m
 +
 +; load coeff table
 +mov r6d,r6m
 +shl r6, 6
 +lea r6, [tab_LumaCoeffV + r6]
 +
 +mov [old_r4d], r4d
 +mov [old_r2], r2
 +
 +; move to -3
 +lea r1, [r1 * 2]
 +lea r4, [r1 + r1 * 2]
 +sub r0, r4
 +lea r4, [r4 * 2]
 +mov [old_6rows], r4
 +
 +.loopH:
 +
 +; load width
 +mov r4d, [old_r4d]
 +
 +; save old src
 +mov [old_r0], r0
 +
 +.loopW:
 +
 +movum0, [r0]
 +movum1, [r0 + r1]
 +lea r0, [r0 + r1 * 2]
 +punpcklwd   m2, m0, m1
 +pmaddwd m2, [r6 + 0 * 16]
 +punpckhwd   m0, m1
 +pmaddwd m0, [r6 + 0 * 16]
 +
 +movum3, [r0]
 +movum4, [r0 + r1]
 +lea r0, [r0 + r1 * 2]
 +punpcklwd   m1, m3, m4
 +pmaddwd m1, [r6 + 1 * 16]
 +paddd   m2, m1
 +punpckhwd   m3, m4
 +pmaddwd m3, [r6 + 1 * 16]
 +paddd   m0, m3
 +
 +movum3, [r0]
 +movum4, [r0 + r1]
 +lea r0, [r0 + r1 * 2]
 +punpcklwd   m1, m3, m4
 +pmaddwd m1, [r6 + 2 * 16]
 +paddd   m2, m1
 +punpckhwd   m3, m4
 +pmaddwd m3, [r6 + 2 * 16]
 +paddd   m0, m3
 +
 +movum3, [r0]
 +movum4, [r0 + r1]
 +punpcklwd   m1, m3, m4
 +pmaddwd m1, [r6 + 3 * 16]
 +paddd   m2, m1
 +punpckhwd   m3, m4
 +pmaddwd m3, [r6 + 3 * 16]
 +paddd   m0, m3
 +
 +paddd   m2, [tab_c_526336]
 +paddd   m0, [tab_c_526336]
 +psrad   m2, 12
 +psrad   m0, 12
 +packssdwm2, m0
 +packuswbm2, m2
 +
 +; move to next 8 col
 +sub r0, [old_6rows]
 +
 +sub r4, 8
 +jl  .width4
 +movq[r2], m2
 +je  .nextH
 +lea r0, [r0 + 16]
 +lea r2, [r2 + 8]
 +jmp .loopW
 +
 +.width4:
 +movd[r2], m2
 +lea r0, [r0 + 4]
 +
 +.nextH:
 +; move to next row
 +mov r0, [old_r0]
 +lea r0, [r0 + r1]
 +add [old_r2], r3d
 +mov r2, [old_r2]
 +
 +dec r5d
 +jnz .loopH
 +
 +RET
 diff -r 

[x265] Fwd: [PATCH] check_IPFilterChroma_primitive, stride made equal to min width 2, fix for 2XN block

2013-10-17 Thread Praveen Tiwari

I tried using stride 64 for both the source and dest buffers, which is
perfectly reasonable, and the 2xN primitives failed their unit test which
tells me they need to be fixed prior to using them in the encoder.

 Sent patch for fix.
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] Fwd: [PATCH] Added C primitive and unit test code for chroma filter

2013-10-15 Thread Praveen Tiwari
 +templateint N, int width
 +void interp_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t
 dstStride, int height, int coeffIdx)
 +{
 +int cStride = 1;
 +short const * coeff= g_chromaFilter[coeffIdx];
 +src -= (N / 2 - 1) * cStride;
 +coeffIdx;
 +int offset;
 +short maxVal;
 +int headRoom = IF_INTERNAL_PREC - X265_DEPTH;
 +offset =  (1  (headRoom - 1));
 +maxVal = (1  X265_DEPTH) - 1;
 +
 +int row, col;
 +for (row = 0; row  height; row++)
 +{
 +for (col = 0; col  width; col++)
 +{
 +int sum;
 +
 +sum  = src[col + 0 * cStride] * coeff[0];
 +sum += src[col + 1 * cStride] * coeff[1];
 +if (N = 4)
 +{
 +sum += src[col + 2 * cStride] * coeff[2];
 +sum += src[col + 3 * cStride] * coeff[3];
 +}

the N= 6 check seems out of place, unless we're going to instantiate a
7tap filter
Actually, I wanted to add a single C primitive for chroma and luma this is
why I did not change check condition as they will be required in luma
functions.

 +if (N = 6)
 +{
 +sum += src[col + 4 * cStride] * coeff[4];
 +sum += src[col + 5 * cStride] * coeff[5];
 +}
 +if (N == 8)
 +{
 +sum += src[col + 6 * cStride] * coeff[6];
 +sum += src[col + 7 * cStride] * coeff[7];
 +}
 +short val = (short)(sum + offset)  headRoom;
 +
 +if (val  0) val = 0;
 +if (val  maxVal) val = maxVal;
 +dst[col] = (pixel)val;
 +}
 +
 +src += srcStride;
 +dst += dstStride;
 +}
 +}
  }

  namespace x265 {
 diff -r 1087f1f3bf5a -r 39fc3c36e1b1 source/test/ipfilterharness.cpp
 --- a/source/test/ipfilterharness.cpp   Tue Oct 15 20:57:54 2013 +0530
 +++ b/source/test/ipfilterharness.cpp   Tue Oct 15 21:22:03 2013 +0530
 @@ -3,6 +3,7 @@
   *
   * Authors: Deepthi Devaki deepthidev...@multicorewareinc.com,
   *  Rajesh Paulraj raj...@multicorewareinc.com
 + *  Praveen  Kumar Tiwari prav...@multicorewareinc.com
   *
   * This program is free software; you can redistribute it and/or modify
   * it under the terms of the GNU General Public License as published by
 @@ -39,6 +40,18 @@
  ipfilterV_pp4
  };

 +const char* ChromaFilterPPNames[] =
 +{
 +interp_4tap_horiz_pp_w2,
 +interp_4tap_horiz_pp_w4,
 +interp_4tap_horiz_pp_w6,
 +interp_4tap_horiz_pp_w8,
 +interp_4tap_horiz_pp_w12,
 +interp_4tap_horiz_pp_w16,
 +interp_4tap_horiz_pp_w24,
 +interp_4tap_horiz_pp_w32
 +};


the names should correspond with the chroma size enums, which only specify
a width. This string table should be re-usable for more than just 4tap
horizontal pixel to pixel interpolation.  Each element should just be W2
or something similar so it can be used as:

printf(chroma_hpp[%s]: , ChromaFilterName[w]);


 +
  IPFilterHarness::IPFilterHarness()
  {
  ipf_t_size = 200 * 200;
 @@ -262,6 +275,47 @@
  return true;
  }

 +bool IPFilterHarness::check_IPFilter_primitive(filter_pp_t ref,
 filter_pp_t opt)


there needs to be chroma and luma versions of this function for the two
filter lengths, or pass filter length as an argument


 +{
 +int rand_height = rand() % 100; // Randomly generated
 Height


I don't see a point to testing any sizes not used by the encoder; this just
prevents possible optimizations in the primitive.  Primitives that have
fixed dimensions should be tested with those fixed dimensions used by the
encoder.


 +int rand_val, rand_srcStride, rand_dstStride, rand_coeffIdx;
 +
 +for (int i = 0; i = 100; i++)
 +{
 +memset(IPF_vec_output_p, 0, ipf_t_size);  // Initialize
 output buffer to zero
 +memset(IPF_C_output_p, 0, ipf_t_size);// Initialize
 output buffer to zero


is memzero really necessary here? I don't think so

+
 +rand_coeffIdx = rand() % 8;// Random coeffIdex in
 the filter


chroma coeff index should be 1, 2, or 3

I think chroma table is
const short g_chromaFilter[8][NTAPS_CHROMA] =
{
{  0, 64,  0,  0 },
{ -2, 58, 10, -2 },
{ -4, 54, 16, -2 },
{ -6, 46, 28, -4 },
{ -4, 36, 36, -4 },
{ -4, 28, 46, -6 },
{ -2, 16, 54, -4 },
{ -2, 10, 58, -2 }
};
  we have coeff table also in similar fashion so I need 0 to 7 coeffIdex.

+rand_val = rand() % 4; // Random offset in the
 filter


rand_val is unused


 +rand_srcStride = rand() % 100;  // Randomly generated
 srcStride
 +rand_dstStride = rand() % 100;  // Randomly generated
 dstStride
 +
 +if (rand_srcStride  32)
 +rand_srcStride = 32;
 +
 +if (rand_dstStride  32)
 +rand_dstStride = 32;
 +
 +opt(pixel_buff + 3 * rand_srcStride,
 +rand_srcStride,
 +

Re: [x265] [PATCH REVIEW Only ] chroma 4XN block, coeffIdex insted of coeff pointer

2013-10-11 Thread Praveen Tiwari
I have just missed to change the line  movacoef2,   [tab_coeff
+ 16] (I was just testing for coeffIdex 1 ) I will make it for random
like  mova
   coef2,   [tab_coeff + height * 16]. Please Ignore this.

Regards,
Praveen


On Fri, Oct 11, 2013 at 10:20 PM, prav...@multicorewareinc.com wrote:

 # HG changeset patch
 # User Praveen Tiwari
 # Date 1381510220 -19800
 # Node ID 5a9160e8b0bdc3117c2417bc29453077488efd8e
 # Parent  c6d89dc62e191f56f63dbcb1781a6494da50a70d
 chroma 4XN block, coeffIdex insted of coeff pointer

 diff -r c6d89dc62e19 -r 5a9160e8b0bd source/common/x86/ipfilter8.asm
 --- a/source/common/x86/ipfilter8.asm   Fri Oct 11 01:47:53 2013 -0500
 +++ b/source/common/x86/ipfilter8.asm   Fri Oct 11 22:20:20 2013 +0530
 @@ -26,107 +26,58 @@
  %include x86inc.asm
  %include x86util.asm

 -%if ARCH_X86_64 == 0
 -
  SECTION_RODATA 32
 -tab_leftmask:   db -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0
 -
  tab_Tm: db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6
 -db 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10

  tab_c_512:  times 8 dw 512

 +tab_coeff:db  0, 64, 0, 0, 0, 64, 0, 0, 0, 64, 0, 0, 0, 64, 0, 0
 +  db -2, 58, 10, -2, -2, 58, 10, -2, -2, 58, 10, -2, -2, 58,
 10, -2
 +  db -4, 54, 16, -2, -4, 54, 16, -2, -4, 54, 16, -2, -4, 54,
 16, -2
 +  db -6, 46, 28, -4, -6, 46, 28, -4, -6, 46, 28, -4, -6, 46,
 28, -4
 +  db -4, 36, 36, -4, -4, 36, 36, -4, -4, 36, 36, -4, -4, 36,
 36, -4
 +  db -4, 28, 46, -6, -4, 28, 46, -6, -4, 28, 46, -6, -4, 28,
 46, -6
 +  db -2, 16, 54, -4, -2, 16, 54, -4, -2, 16, 54, -4, -2, 16,
 54, -4
 +  db -2, 10, 58, -2, -2, 10, 58, -2, -2, 10, 58, -2, -2, 10,
 58, -2
 +
  SECTION .text

 -%macro FILTER_H4 3
 -movu%1, [src + col - 1]
 -pshufb  %2, %1, Tm4
 +%macro FILTER_H4_w4 3
 +movu%1, [srcq - 1]
 +pshufb  %2, %1, Tm0
  pmaddubsw   %2, coef2
 -pshufb  %1, %1, Tm5
 -pmaddubsw   %1, coef2
  phaddw  %2, %1
  pmulhrsw%2, %3
  packuswb%2, %2
  %endmacro

 +%macro FILTER_H4_w4_CALL 0
 +FILTER_H4_w4   x0, x1, x2
 +
 +movd[dstq],  x1
 +
 +add srcq,srcstrideq
 +add dstq,dststrideq
 +%endmacro
 +

  
 ;-
 -; void filterHorizontal_p_p_4(pixel *src, intptr_t srcStride, pixel *dst,
 intptr_t dstStride, int width, int height, short const *coeff)
 +; void interp_4tap_horiz_pp_w4(pixel *src, intptr_t srcStride, pixel
 *dst, intptr_t dstStride, int height, int coeffIdx)

  
 ;-
  INIT_XMM sse4
 -cglobal filterHorizontal_p_p_4, 0, 7, 8
 -%define src r0
 -%define dst r1
 -%define row r2
 -%define col r3
 -%define width   r4
 -%define widthleft   r5
 -%define mask_offset r6
 -%define coef2   m7
 -%define x3  m6
 -%define Tm5 m5
 -%define Tm4 m4
 -%define x2  m3
 -%define x1  m2
 -%define x0  m1
 -%define leftmaskm0
 -%define tmp r0
 -%define tmp1r1
 -
 -mov tmp,r6m
 -movucoef2,  [tmp]
 -packsswbcoef2,  coef2
 -pshufd  coef2,  coef2,  0
 +cglobal interp_4tap_horiz_pp_w4, 6, 6, 5, src, srcstride, dst, dststride,
 height, coeffIdx
 +%define coef2   m4
 +%define Tm0 m3
 +%define x2  m2
 +%define x1  m1
 +%define x0  m0

 -movax3, [tab_c_512]
 +movacoef2,   [tab_coeff + 16]
 +movax2,  [tab_c_512]
 +movaTm0, [tab_Tm]

 -mov width,  r4m
 -mov widthleft,  width
 -and width,  ~7
 -and widthleft,  7
 -mov mask_offset,  widthleft
 -neg mask_offset
 +.loop
 +FILTER_H4_w4_CALL
 +dec  r4d
 +jnz .loop
 +RET

 -movqleftmask,   [tab_leftmask + (7 + mask_offset)]
 -movaTm4,[tab_Tm]
 -movaTm5,[tab_Tm + 16]
 -
 -mov src,r0m
 -mov dst,r2m
 -mov row,r5m
 -
 -_loop_row:
 -xor col,col
 -
 -_loop_col:
 -FILTER_H4   x0, x1, x3
 -movh[dst + col], x1
 -
 -add col, 8
 -
 -cmp col,width
 -jl _loop_col
 -
 -_end_col:
 -testwidthleft,  widthleft
 -jz  _next_row
 -
 -movqx2, [dst + col]
 -FILTER_H4   x0, x1, x3
 -pblendvbx2, x2, x1, leftmask
 -movh[dst + col], x2
 -
 -_next_row:
 -add src,r1m
 -add dst,r3m
 -dec row
 -
 -testrow,row
 -jz  _end_row
 -
 -jmp _loop_row
 -
 -_end_row

Re: [x265] [PATCH REVIEW Only ] chroma 4XN block, coeffIdex insted of coeff pointer

2013-10-11 Thread Praveen Tiwari
ohh... It will be  movacoef2,   [tab_coeff + coeffIdx * 16].


On Fri, Oct 11, 2013 at 11:21 PM, Praveen Tiwari 
prav...@multicorewareinc.com wrote:

 I have just missed to change the line  movacoef2,
 [tab_coeff + 16] (I was just testing for coeffIdex 1 ) I will make it for
 random like  movacoef2,   [tab_coeff + height * 16]. Please
 Ignore this.

 Regards,
 Praveen


 On Fri, Oct 11, 2013 at 10:20 PM, prav...@multicorewareinc.com wrote:

 # HG changeset patch
 # User Praveen Tiwari
 # Date 1381510220 -19800
 # Node ID 5a9160e8b0bdc3117c2417bc29453077488efd8e
 # Parent  c6d89dc62e191f56f63dbcb1781a6494da50a70d
 chroma 4XN block, coeffIdex insted of coeff pointer

 diff -r c6d89dc62e19 -r 5a9160e8b0bd source/common/x86/ipfilter8.asm
 --- a/source/common/x86/ipfilter8.asm   Fri Oct 11 01:47:53 2013 -0500
 +++ b/source/common/x86/ipfilter8.asm   Fri Oct 11 22:20:20 2013 +0530
 @@ -26,107 +26,58 @@
  %include x86inc.asm
  %include x86util.asm

 -%if ARCH_X86_64 == 0
 -
  SECTION_RODATA 32
 -tab_leftmask:   db -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0
 -
  tab_Tm: db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6
 -db 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10

  tab_c_512:  times 8 dw 512

 +tab_coeff:db  0, 64, 0, 0, 0, 64, 0, 0, 0, 64, 0, 0, 0, 64, 0, 0
 +  db -2, 58, 10, -2, -2, 58, 10, -2, -2, 58, 10, -2, -2, 58,
 10, -2
 +  db -4, 54, 16, -2, -4, 54, 16, -2, -4, 54, 16, -2, -4, 54,
 16, -2
 +  db -6, 46, 28, -4, -6, 46, 28, -4, -6, 46, 28, -4, -6, 46,
 28, -4
 +  db -4, 36, 36, -4, -4, 36, 36, -4, -4, 36, 36, -4, -4, 36,
 36, -4
 +  db -4, 28, 46, -6, -4, 28, 46, -6, -4, 28, 46, -6, -4, 28,
 46, -6
 +  db -2, 16, 54, -4, -2, 16, 54, -4, -2, 16, 54, -4, -2, 16,
 54, -4
 +  db -2, 10, 58, -2, -2, 10, 58, -2, -2, 10, 58, -2, -2, 10,
 58, -2
 +
  SECTION .text

 -%macro FILTER_H4 3
 -movu%1, [src + col - 1]
 -pshufb  %2, %1, Tm4
 +%macro FILTER_H4_w4 3
 +movu%1, [srcq - 1]
 +pshufb  %2, %1, Tm0
  pmaddubsw   %2, coef2
 -pshufb  %1, %1, Tm5
 -pmaddubsw   %1, coef2
  phaddw  %2, %1
  pmulhrsw%2, %3
  packuswb%2, %2
  %endmacro

 +%macro FILTER_H4_w4_CALL 0
 +FILTER_H4_w4   x0, x1, x2
 +
 +movd[dstq],  x1
 +
 +add srcq,srcstrideq
 +add dstq,dststrideq
 +%endmacro
 +

  
 ;-
 -; void filterHorizontal_p_p_4(pixel *src, intptr_t srcStride, pixel
 *dst, intptr_t dstStride, int width, int height, short const *coeff)
 +; void interp_4tap_horiz_pp_w4(pixel *src, intptr_t srcStride, pixel
 *dst, intptr_t dstStride, int height, int coeffIdx)

  
 ;-
  INIT_XMM sse4
 -cglobal filterHorizontal_p_p_4, 0, 7, 8
 -%define src r0
 -%define dst r1
 -%define row r2
 -%define col r3
 -%define width   r4
 -%define widthleft   r5
 -%define mask_offset r6
 -%define coef2   m7
 -%define x3  m6
 -%define Tm5 m5
 -%define Tm4 m4
 -%define x2  m3
 -%define x1  m2
 -%define x0  m1
 -%define leftmaskm0
 -%define tmp r0
 -%define tmp1r1
 -
 -mov tmp,r6m
 -movucoef2,  [tmp]
 -packsswbcoef2,  coef2
 -pshufd  coef2,  coef2,  0
 +cglobal interp_4tap_horiz_pp_w4, 6, 6, 5, src, srcstride, dst,
 dststride, height, coeffIdx
 +%define coef2   m4
 +%define Tm0 m3
 +%define x2  m2
 +%define x1  m1
 +%define x0  m0

 -movax3, [tab_c_512]
 +movacoef2,   [tab_coeff + 16]
 +movax2,  [tab_c_512]
 +movaTm0, [tab_Tm]

 -mov width,  r4m
 -mov widthleft,  width
 -and width,  ~7
 -and widthleft,  7
 -mov mask_offset,  widthleft
 -neg mask_offset
 +.loop
 +FILTER_H4_w4_CALL
 +dec  r4d
 +jnz .loop
 +RET

 -movqleftmask,   [tab_leftmask + (7 + mask_offset)]
 -movaTm4,[tab_Tm]
 -movaTm5,[tab_Tm + 16]
 -
 -mov src,r0m
 -mov dst,r2m
 -mov row,r5m
 -
 -_loop_row:
 -xor col,col
 -
 -_loop_col:
 -FILTER_H4   x0, x1, x3
 -movh[dst + col], x1
 -
 -add col, 8
 -
 -cmp col,width
 -jl _loop_col
 -
 -_end_col:
 -testwidthleft,  widthleft
 -jz  _next_row
 -
 -movqx2, [dst + col]
 -FILTER_H4   x0, x1, x3
 -pblendvbx2, x2, x1, leftmask
 -movh[dst + col], x2
 -
 -_next_row:
 -add src,r1m
 -add

[x265] Fwd: [PATCH] replace pixelsub_sp vector class function with intrinsic

2013-10-04 Thread Praveen Tiwari
for (int x = 0; x  bx; x += 16)
{
-Vec16uc word0, word1;
-Vec8s word3, word4;
-word0.load_a(src0 + x);
-word1.load_a(src1 + x);
-word3 = extend_low(word0) - extend_low(word1);
-word4 = extend_high(word0) - extend_high(word1);
-word3.store_a(dst + x);
-word4.store_a(dst + x + 8);
+__m128i word0, word1;
+__m128i word3, word4;
+__m128i mask = _mm_setzero_si128();
+
+word0 = _mm_load_si128((__m128i const*)(src0 + x));
 // load 16 bytes from src1
+word1 = _mm_load_si128((__m128i const*)(src1 + x));
 // load 16 bytes from src2

Please, notice the variable names while writing comments, it should be src0
and src1 not src1 and src2.
+
+word3 = _mm_unpacklo_epi8(word0, mask);// interleave
with zero extensions
+word4 = _mm_unpacklo_epi8(word1, mask);
+_mm_store_si128((__m128i*)dst[x], _mm_subs_epi16(word3,
word4));// store block into dst
+
+word3 = _mm_unpackhi_epi8(word0, mask);// interleave
with zero extensions
+word4 = _mm_unpackhi_epi8(word1, mask);
+_mm_store_si128((__m128i*)dst[x + 8],
_mm_subs_epi16(word3, word4));// store block into dst
 }

I think we should try to unroll the loop for multiple of 8 also, that may
give you some more performance gain.

Regards,
Praveen
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] asm code for ipfilterH_pp, 4 tap filter

2013-09-28 Thread Praveen Tiwari
suppose, during execution width comes less than 8 like 5, then we would
like to run our code section which handles the reaming width (_end_col:)
not the whole code (handle multiple of 8 and renaming width part, it will
computed twice in this case and  corrupting some (8 - widthleft) dst[] old
values which is being used with 'pblenvb' instruction.This is why we have
put a check. if width is always = 8 you are right, we don't need to put
the check.

Regards,
praveen


On Fri, Sep 27, 2013 at 9:05 PM, Jason Garrett-Glaser ja...@x264.comwrote:

  +_loop_row:
  +xor col,col
  +cmpwidth,  0
  +je _end_col

 I don't understand this. Why do we have to do this check?

 Jason
 ___
 x265-devel mailing list
 x265-devel@videolan.org
 https://mailman.videolan.org/listinfo/x265-devel

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel