Re: [x265] [PATCH] encoder: Do not include CLL SEI message if empty
Hello Vittorio, Sorry for the late reply, all of us were on leave due to the Diwali festival in India. Thanks for the patch, will run some basic test and push the patch. Regards, Praveen On Wed, Nov 7, 2018 at 12:35 AM Vittorio Giovara wrote: > > > On Thu, Nov 1, 2018 at 5:34 PM Vittorio Giovara < > vittorio.giov...@gmail.com> wrote: > >> Some devices render out-of-luminance pixels incorrectly otherwise. >> >> --- >> source/encoder/encoder.cpp | 11 +++ >> 1 file changed, 7 insertions(+), 4 deletions(-) >> >> diff -r fd517ae68f93 source/encoder/encoder.cpp >> --- a/source/encoder/encoder.cppTue Sep 25 16:02:31 2018 +0530 >> +++ b/source/encoder/encoder.cppThu Nov 01 17:27:51 2018 -0400 >> @@ -2381,10 +2381,13 @@ >> >> if (m_param->bEmitHDRSEI) >> { >> -SEIContentLightLevel cllsei; >> -cllsei.max_content_light_level = m_param->maxCLL; >> -cllsei.max_pic_average_light_level = m_param->maxFALL; >> -cllsei.writeSEImessages(bs, m_sps, NAL_UNIT_PREFIX_SEI, list, >> m_param->bSingleSeiNal); >> +if (m_emitCLLSEI) >> +{ >> +SEIContentLightLevel cllsei; >> +cllsei.max_content_light_level = m_param->maxCLL; >> +cllsei.max_pic_average_light_level = m_param->maxFALL; >> +cllsei.writeSEImessages(bs, m_sps, NAL_UNIT_PREFIX_SEI, >> list, m_param->bSingleSeiNal); >> +} >> >> if (m_param->masteringDisplayColorVolume) >> { >> -- >> Vittorio >> > > ping > -- > Vittorio > ___ > x265-devel mailing list > x265-devel@videolan.org > https://mailman.videolan.org/listinfo/x265-devel > ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] fix Issue #442: linking issue on non x86 platform
Thanks! I messed up the syntax. On Wed, Oct 31, 2018 at 5:45 PM Andrey Semashev wrote: > On 10/31/18 2:33 PM, prav...@multicorewareinc.com wrote: > > # HG changeset patch > > # User Praveen Tiwari > > # Date 1540983948 -19800 > > # Wed Oct 31 16:35:48 2018 +0530 > > # Node ID 1c878790edea64186edabcd40fb3df121f536311 > > # Parent fd517ae68f93dbfdd1bff45a9dd8e626523542b6 > > fix Issue #442: linking issue on non x86 platform > > > > diff -r fd517ae68f93 -r 1c878790edea source/common/cpu.cpp > > --- a/source/common/cpu.cpp Tue Sep 25 16:02:31 2018 +0530 > > +++ b/source/common/cpu.cpp Wed Oct 31 16:35:48 2018 +0530 > > @@ -127,6 +127,7 @@ > > { > > return(enable512); > > } > > + > > uint32_t cpu_detect(bool benableavx512 ) > > { > > > > diff -r fd517ae68f93 -r 1c878790edea source/common/quant.cpp > > --- a/source/common/quant.cpp Tue Sep 25 16:02:31 2018 +0530 > > +++ b/source/common/quant.cpp Wed Oct 31 16:35:48 2018 +0530 > > @@ -723,6 +723,7 @@ > > X265_CHECK(coeffNum[cgScanPos] == 0, "count of coeff > failure\n"); > > uint32_t scanPosBase = (cgScanPos << MLS_CG_SIZE); > > uint32_t blkPos = codeParams.scan[scanPosBase]; > > +#if X265_ARCH_X86 > > bool enable512 = detect512(); > > if (enable512) > > primitives.cu[log2TrSize - > 2].psyRdoQuant(m_resiDctCoeff, m_fencDctCoeff, costUncoded, > , , , blkPos); > > @@ -731,6 +732,10 @@ > > primitives.cu[log2TrSize - > 2].psyRdoQuant_1p(m_resiDctCoeff, costUncoded, , > ,blkPos); > > primitives.cu[log2TrSize - > 2].psyRdoQuant_2p(m_resiDctCoeff, m_fencDctCoeff, costUncoded, > , , , blkPos); > > } > > +#elif > > #else? Everywhere else, too. > > > +primitives.cu[log2TrSize - > 2].psyRdoQuant_1p(m_resiDctCoeff, costUncoded, , > , blkPos); > > +primitives.cu[log2TrSize - > 2].psyRdoQuant_2p(m_resiDctCoeff, m_fencDctCoeff, costUncoded, > , , , blkPos); > > +#endif > > } > > } > > else > > @@ -805,8 +810,8 @@ > > uint32_t blkPos = codeParams.scan[scanPosBase]; > > if (usePsyMask) > > { > > +#if X265_ARCH_X86 > > bool enable512 = detect512(); > > - > > if (enable512) > > primitives.cu[log2TrSize - > 2].psyRdoQuant(m_resiDctCoeff, m_fencDctCoeff, costUncoded, > , , , blkPos); > > else > > @@ -814,6 +819,10 @@ > > primitives.cu[log2TrSize - > 2].psyRdoQuant_1p(m_resiDctCoeff, costUncoded, , > , blkPos); > > primitives.cu[log2TrSize - > 2].psyRdoQuant_2p(m_resiDctCoeff, m_fencDctCoeff, costUncoded, > , , , blkPos); > > } > > +#elif > > +primitives.cu[log2TrSize - > 2].psyRdoQuant_1p(m_resiDctCoeff, costUncoded, , > , blkPos); > > +primitives.cu[log2TrSize - > 2].psyRdoQuant_2p(m_resiDctCoeff, m_fencDctCoeff, costUncoded, > , , , blkPos); > > +#endif > > blkPos = codeParams.scan[scanPosBase]; > > for (int y = 0; y < MLS_CG_SIZE; y++) > > { > > > > > > ___ > > x265-devel mailing list > > x265-devel@videolan.org > > https://mailman.videolan.org/listinfo/x265-devel > > > > ___ > x265-devel mailing list > x265-devel@videolan.org > https://mailman.videolan.org/listinfo/x265-devel > ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] Original C++ code used for sad functions' assembly code in COST_MV?
Hello Jeffrey, You can find all C primitives in source/common folder. SAD C primitives ares in source/common/pixel.cpp. Thanks, Praveen On Wed, Sep 5, 2018 at 12:23 PM, Mario *LigH* Rohkrämer wrote: > Jeffrey Chen schrieb am 04.09.2018 um 23:57: > >> Hi, I would like to configure the sad function in COST_MV for another >> platform. However, the assembly code would not be supported on the other >> platform. Where can I find the original programming language code that was >> made into the assembly language code? >> > > Hi Jeffrey. > > I'm not a developer, just guessing: > > source/encoder/motion.cpp line 234 #defines a loop. > ___ > x265-devel mailing list > x265-devel@videolan.org > https://mailman.videolan.org/listinfo/x265-devel > ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] Code performance issue
Hello Min, Thanks for the suggestion, we will run some tests and let you know if any change is required here. Thanks. Regards, Praveen Tiwari On Sat, Jun 2, 2018 at 9:18 AM, chen wrote: > There have series performance issues, such as, > > uint32_t sum = (uint32_t)pow((outOfBound >> 2), 2); > > Are you want to get square value from a small integer? > > > ___ > x265-devel mailing list > x265-devel@videolan.org > https://mailman.videolan.org/listinfo/x265-devel > > ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] threadpool.cpp: use WIN system call for popcount
It is just counting cpusPerNode, so the 64-bit number is not required, yes but I missed the fact of support on few CPUs. Lookup table based implementation could have been fastest due to better caching, but it is not used frequently so we can keep as it is. Thanks. On Thu, May 3, 2018 at 11:24 PM, Andrey Semashev <andrey.semas...@gmail.com> wrote: > On Thu, May 3, 2018 at 7:37 PM, Pradeep Ramachandran > <prad...@multicorewareinc.com> wrote: > > > > On Thu, May 3, 2018 at 2:23 PM, <prav...@multicorewareinc.com> wrote: > >> > >> # HG changeset patch > >> # User Praveen Tiwari <prav...@multicorewareinc.com> > >> # Date 1525328839 -19800 > >> # Thu May 03 11:57:19 2018 +0530 > >> # Branch stable > >> # Node ID 9cbb2aadcca3a2f7a308ea1dc792fb817bcc5b51 > >> # Parent 69aafa6d70ad4e151f4590766c6b125621c5d007 > >> threadpool.cpp: use WIN system call for popcount > > > > > > Unless this fixes a known bug, I don't want to push this directly into > > stable. Syscalls are notorious especially when working with older > versions > > of the OS. > > I would rather push this into default and allow users to test that this > > works with all kinds of systems and then merge with stable once the > answer > > is known. > > Does this fix a specific issue on some platform, or improve performance? > > The comment is not quite right, __popcnt is not a syscall but an > MSVC-specific intrinsic. > > https://msdn.microsoft.com/en-us/library/bb385231.aspx > > The equivalent gcc intrinsic is __builtin_popcount and friends. > > I think, the patch is buggy because the relevant field is a 64-bit > integer on 64-bit Windows and __popcnt is 32-bit. > > Note also that the popcount instruction only available in ABM ISA > extension. In Intel CPUs it is available since Nehalem. > > >> diff -r 69aafa6d70ad -r 9cbb2aadcca3 source/common/threadpool.cpp > >> --- a/source/common/threadpool.cpp Wed May 02 15:15:05 2018 +0530 > >> +++ b/source/common/threadpool.cpp Thu May 03 11:57:19 2018 +0530 > >> @@ -71,21 +71,6 @@ > >> # define strcasecmp _stricmp > >> #endif > >> > >> -#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 > >> -const uint64_t m1 = 0x; //binary: 0101... > >> -const uint64_t m2 = 0x; //binary: 00110011.. > >> -const uint64_t m3 = 0x0f0f0f0f0f0f0f0f; //binary: 4 zeros, 4 ones ... > >> -const uint64_t h01 = 0x0101010101010101; //the sum of 256 to the power > of > >> 0,1,2,3... > >> - > >> -static int popCount(uint64_t x) > >> -{ > >> -x -= (x >> 1) & m1; > >> -x = (x & m2) + ((x >> 2) & m2); > >> -x = (x + (x >> 4)) & m3; > >> -return (x * h01) >> 56; > >> -} > >> -#endif > >> - > >> namespace X265_NS { > >> // x265 private namespace > >> > >> @@ -274,7 +259,7 @@ > >> for (int i = 0; i < numNumaNodes; i++) > >> { > >> GetNumaNodeProcessorMaskEx((UCHAR)i, groupAffinityPointer); > >> -cpusPerNode[i] = popCount(groupAffinityPointer->Mask); > >> +cpusPerNode[i] = __popcnt(static_cast >> int>(groupAffinityPointer->Mask)); > >> } > >> delete groupAffinityPointer; > >> #elif HAVE_LIBNUMA > >> @@ -623,7 +608,7 @@ > >> for (int i = 0; i < numNumaNodes; i++) > >> { > >> GetNumaNodeProcessorMaskEx((UCHAR)i, ); > >> -cpus += popCount(groupAffinity.Mask); > >> +cpus += __popcnt(static_cast int>(groupAffinity.Mask)); > >> } > >> return cpus; > >> #elif _WIN32 > >> ___ > >> x265-devel mailing list > >> x265-devel@videolan.org > >> https://mailman.videolan.org/listinfo/x265-devel > > > > > > > > ___ > > x265-devel mailing list > > x265-devel@videolan.org > > https://mailman.videolan.org/listinfo/x265-devel > > > ___ > x265-devel mailing list > x265-devel@videolan.org > https://mailman.videolan.org/listinfo/x265-devel > ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH 000 of 307 ] AVX-512 implementataion in x265: breaks 32-bit compilation
Thanks for reporting, we are looking at the issue, will send a fix soon. Regards, Praveen Tiwari On Thu, Apr 12, 2018 at 2:31 AM, Mario Rohkrämer <cont...@ligh.de> wrote: > Am 07.04.2018, 04:29 Uhr, schrieb <mythr...@multicorewareinc.com>: > > This series of patches enables AVX-512 in x265. USe CLI option --asm >> avx512 to enable AVX-512 kernels. >> ___ >> x265-devel mailing list >> x265-devel@videolan.org >> https://mailman.videolan.org/listinfo/x265-devel >> > > > Compiling x265 for Win32 target (here in MSYS2/MinGW32) is not possible > anymore. > > Assembler code was still available for 8-bit depth core, at least. But: > > + > [ 13%] Building ASM_NASM object common/CMakeFiles/common.dir/x > 86/pixel-util8.asm.obj > H:/development/media-autobuild_suite-master/build/x265-hg/ > source/common/x86/pixel-util8.asm:1867: error: invalid combination of > opcode and operands > H:/development/media-autobuild_suite-master/build/x265-hg/ > source/common/x86/pixel-util8.asm:1880: error: invalid combination of > opcode and operands > H:/development/media-autobuild_suite-master/build/x265-hg/ > source/common/x86/pixel-util8.asm:1880: error: invalid combination of > opcode and operands > H:/development/media-autobuild_suite-master/build/x265-hg/ > source/common/x86/pixel-util8.asm:1880: error: invalid combination of > opcode and operands > H:/development/media-autobuild_suite-master/build/x265-hg/ > source/common/x86/pixel-util8.asm:1880: error: invalid combination of > opcode and operands > H:/development/media-autobuild_suite-master/build/x265-hg/ > source/common/x86/pixel-util8.asm:1941: error: invalid combination of > opcode and operands > H:/development/media-autobuild_suite-master/build/x265-hg/ > source/common/x86/pixel-util8.asm:1954: error: invalid combination of > opcode and operands > H:/development/media-autobuild_suite-master/build/x265-hg/ > source/common/x86/pixel-util8.asm:1954: error: invalid combination of > opcode and operands > H:/development/media-autobuild_suite-master/build/x265-hg/ > source/common/x86/pixel-util8.asm:1954: error: invalid combination of > opcode and operands > H:/development/media-autobuild_suite-master/build/x265-hg/ > source/common/x86/pixel-util8.asm:1954: error: invalid combination of > opcode and operands > H:/development/media-autobuild_suite-master/build/x265-hg/ > source/common/x86/pixel-util8.asm:1954: error: invalid combination of > opcode and operands > H:/development/media-autobuild_suite-master/build/x265-hg/ > source/common/x86/pixel-util8.asm:1954: error: invalid combination of > opcode and operands > H:/development/media-autobuild_suite-master/build/x265-hg/ > source/common/x86/pixel-util8.asm:1954: error: invalid combination of > opcode and operands > H:/development/media-autobuild_suite-master/build/x265-hg/ > source/common/x86/pixel-util8.asm:1954: error: invalid combination of > opcode and operands > H:/development/media-autobuild_suite-master/build/x265-hg/ > source/common/x86/pixel-util8.asm:1954: error: invalid combination of > opcode and operands > H:/development/media-autobuild_suite-master/build/x265-hg/ > source/common/x86/pixel-util8.asm:1954: error: invalid combination of > opcode and operands > H:/development/media-autobuild_suite-master/build/x265-hg/ > source/common/x86/pixel-util8.asm:1954: error: invalid combination of > opcode and operands > H:/development/media-autobuild_suite-master/build/x265-hg/ > source/common/x86/pixel-util8.asm:1954: error: invalid combination of > opcode and operands > H:/development/media-autobuild_suite-master/build/x265-hg/ > source/common/x86/pixel-util8.asm:1954: error: invalid combination of > opcode and operands > H:/development/media-autobuild_suite-master/build/x265-hg/ > source/common/x86/pixel-util8.asm:1954: error: invalid combination of > opcode and operands > H:/development/media-autobuild_suite-master/build/x265-hg/ > source/common/x86/pixel-util8.asm:1954: error: invalid combination of > opcode and operands > H:/development/media-autobuild_suite-master/build/x265-hg/ > source/common/x86/pixel-util8.asm:1954: error: invalid combination of > opcode and operands > make[2]: *** [common/CMakeFiles/common.dir/build.make:159: > common/CMakeFiles/common.dir/x86/pixel-util8.asm.obj] Error 1 > make[1]: *** [CMakeFiles/Makefile2:449: common/CMakeFiles/common.dir/all] > Error 2 > make: *** [Makefile:130: all] Error 2 > + > > Trying to compile AVX-512 instructions may have to be avoided in 32-bit > architecture mode (because there is surely no 32-bit only CPU supporting > this instruction set extension). > > -- > > Fun and success! &g
Re: [x265] [PATCH 000 of 307 ] AVX-512 implementataion in x265
Your request is on the way, soon we will share the performance related details. Thanks. Regards, Praveen Tiwari On Fri, Apr 6, 2018 at 9:36 PM, Vittorio Giovara <vittorio.giov...@gmail.com > wrote: > just curious, what kind of general speed improvement does this give? > I could have missed them in the series, but it would be nice to have some > sort of benchmarks > thanks > Vittorio > > On Sat, Apr 7, 2018 at 4:29 AM, <mythr...@multicorewareinc.com> wrote: > >> This series of patches enables AVX-512 in x265. USe CLI option --asm >> avx512 to enable AVX-512 kernels. >> ___ >> x265-devel mailing list >> x265-devel@videolan.org >> https://mailman.videolan.org/listinfo/x265-devel >> > > > > -- > Vittorio > > ___ > x265-devel mailing list > x265-devel@videolan.org > https://mailman.videolan.org/listinfo/x265-devel > > ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] quant.cpp: 'rdoQuant_c' primitive for SIMD optimization
Please ignore this patch I messed an update. I will resend this soon. Thanks On Mon, Nov 27, 2017 at 5:11 PM, <prav...@multicorewareinc.com> wrote: > # HG changeset patch > # User Praveen Tiwari <prav...@multicorewareinc.com> > # Date 1511167656 -19800 > # Mon Nov 20 14:17:36 2017 +0530 > # Node ID dffb056e5ad0e2298b0dd65d048f4f16d8508566 > # Parent b24454f3ff6de650aab6835e291837fc4e2a4466 > quant.cpp: 'rdoQuant_c' primitive for SIMD optimization > > This particular section of code appears to be bottleneck in many profiles, > as it > involves 64-bit multiplication operations. For SIMD optimization we need > to convert > few buffer/variables to double. > > diff -r b24454f3ff6d -r dffb056e5ad0 source/common/dct.cpp > --- a/source/common/dct.cpp Wed Nov 22 22:00:48 2017 +0530 > +++ b/source/common/dct.cpp Mon Nov 20 14:17:36 2017 +0530 > @@ -984,6 +984,32 @@ > return (sum & 0x00FF) + (c1 << 26) + (firstC2Idx << 28); > } > > +void rdoQuant_c(int16_t* m_resiDctCoeff, int16_t* m_fencDctCoeff, double* > costUncoded, double* totalUncodedCost, double* totalRdCost, int64_t > psyScale, uint32_t blkPos, uint32_t log2TrSize) > +{ > +const int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - > log2TrSize; /* Represents scaling through forward transform */ > +const int scaleBits = SCALE_BITS - 2 * transformShift; > +const uint32_t trSize = 1 << log2TrSize; > +int max = X265_MAX(0, (2 * transformShift + 1)); > + > +for (int y = 0; y < MLS_CG_SIZE; y++) > +{ > +for (int x = 0; x < MLS_CG_SIZE; x++) > +{ > +int64_t signCoef = m_resiDctCoeff[blkPos + x];/* > pre-quantization DCT coeff */ > +int64_t predictedCoef = m_fencDctCoeff[blkPos + x] - > signCoef; /* predicted DCT = source DCT - residual DCT*/ > + > +costUncoded[blkPos + x] = static_cast((signCoef * > signCoef) << scaleBits); > + > +/* when no residual coefficient is coded, predicted coef == > recon coef */ > +costUncoded[blkPos + x] -= static_cast((psyScale * > (predictedCoef)) >> max); > + > +*totalUncodedCost += costUncoded[blkPos + x]; > +*totalRdCost += costUncoded[blkPos + x]; > +} > +blkPos += trSize; > +} > +} > + > namespace X265_NS { > // x265 private namespace > > @@ -993,6 +1019,7 @@ > p.dequant_normal = dequant_normal_c; > p.quant = quant_c; > p.nquant = nquant_c; > +p.rdoQuant = rdoQuant_c; > p.dst4x4 = dst4_c; > p.cu[BLOCK_4x4].dct = dct4_c; > p.cu[BLOCK_8x8].dct = dct8_c; > diff -r b24454f3ff6d -r dffb056e5ad0 source/common/primitives.h > --- a/source/common/primitives.hWed Nov 22 22:00:48 2017 +0530 > +++ b/source/common/primitives.hMon Nov 20 14:17:36 2017 +0530 > @@ -216,6 +216,7 @@ > > typedef void (*integralv_t)(uint32_t *sum, intptr_t stride); > typedef void (*integralh_t)(uint32_t *sum, pixel *pix, intptr_t stride); > +typedef void (*rdoQuant_t)(int16_t* m_resiDctCoeff, int16_t* > m_fencDctCoeff, double* costUncoded, double* totalUncodedCost, double* > totalRdCost, int64_t psyScale, uint32_t blkPos, uint32_t log2TrSize); > > /* Function pointers to optimized encoder primitives. Each pointer can > reference > * either an assembly routine, a SIMD intrinsic primitive, or a C > function */ > @@ -304,6 +305,7 @@ > > quant_t quant; > nquant_t nquant; > +rdoQuant_trdoQuant; > dequant_scaling_t dequant_scaling; > dequant_normal_t dequant_normal; > denoiseDct_t denoiseDct; > diff -r b24454f3ff6d -r dffb056e5ad0 source/common/quant.cpp > --- a/source/common/quant.cpp Wed Nov 22 22:00:48 2017 +0530 > +++ b/source/common/quant.cpp Mon Nov 20 14:17:36 2017 +0530 > @@ -663,7 +663,7 @@ > #define PSYVALUE(rec) ((psyScale * (rec)) >> X265_MAX(0, (2 * > transformShift + 1))) > > int64_t costCoeff[trSize * trSize]; /* d*d + lambda * bits */ > -int64_t costUncoded[trSize * trSize]; /* d*d + lambda * 0*/ > +double costUncoded[trSize * trSize]; /* d*d + lambda * 0*/ > int64_t costSig[trSize * trSize]; /* lambda * bits */ > > int rateIncUp[trSize * trSize]; /* signal overhead of increasing > level */ > @@ -677,12 +677,12 @@ > bool bIsLuma = ttype == TEXT_LUMA; > > /* total rate distortion cost of transform block, as CBF=0 */ > -int64_t totalUncodedCost = 0; > +double totalUncodedCost = 0; > > /* Total rate distortion cost of this transform block, counting te > di
Re: [x265] [PATCH 2 of 2] x86: Change assembler from YASM to NASM
Yes, that's true looking at the future prospect we have decided to move the support to NASM. It comes with additional advantages as Andrey mentioned above, but we understand the concern to change assembler support, we will make it a smooth transition as much as possible. Thanks. Regards, Praveen Tiwari ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
[x265] Fwd: [PATCH] intra: sse4 version of strong intra smoothing
-- Forwarded message -- From: chenDate: Tue, Nov 21, 2017 at 10:07 AM Subject: Re: [x265] [PATCH] intra: sse4 version of strong intra smoothing To: Development for x265 >diff -r a7c2f80c18af -r 973560d58dfb source/common/x86/intrapred8.asm >--- a/source/common/x86/intrapred8.asm Mon Nov 20 14:31:22 2017 +0530 >+++ b/source/common/x86/intrapred8.asm Tue Nov 21 03:10:14 2017 +0800 >@@ -22313,11 +22313,144 @@ > mov [r1 + 64], r3b ; LeftLast > RET > >-INIT_XMM sse4 >-cglobal intra_filter_32x32, 2,4,6 >-mov r2b, byte [r0 + 64]; topLast >-mov r3b, byte [r0 + 128]; LeftLast >- >+; this function add strong intra filter >+ INIT_XMM sse4 >+cglobal intra_filter_32x32, 3,8,7 >+xor r3d, r3d ; R9 >+xor r4d, r4d ; R10 >+mov r3b, byte [r0 + 64] ; topLast >+mov r4b, byte [r0 + 128] ; LeftLast xor+mov = movzx, the xor (clear to zero) does not spending cycle, but affect instruction decode rate >+ >+; strong intra filter is diabled >+cmp r2m, byte 0 >+jz .normal_filter32 >+; decide to do strong intra filter >+xor r5d, r5d ; R11 >+xor r6d, r6d ; RAX >+xor r7d, r7d ; RDI >+mov r5b, byte [r0] ; topLeft >+mov r6b, byte [r0 + 96] ; leftMiddle >+mov r7b, byte [r0 + 32] ; topMiddle >+ >+; threshold = 8 >+mov r2d, r3d ; R8 >+add r2d, r5d ; (topLast + topLeft) >+shl r7d, 1 ; 2 * topMiddle >+sub r2d, r7d (A+B) - 2 * C <==> (A-C) + (B-C) >+mov r7d, r2d ; backup r2d >+sar r7d, 31 >+xor r2d, r7d >+sub r2d, r7d ; abs(r2d) >+cmp r2d, 8 ; how about this or instruction cdq? ; abs(x-y) mov eax, X sub eax, Y sub Y, X cmovg eax, Y >+; bilinearAbove is false >+jns .normal_filter32 >+ >+mov r2d, r5d >+add r2d, r4d >+shl r6d, 1 >+sub r2d, r6d >+mov r6d, r2d >+sar r6d, 31 >+xor r2d, r6d >+sub r2d, r6d >+cmp r2d, 8 >+; bilinearLeft is false >+jns .normal_filter32 >+ >+; do strong intra filter shift = 6 >+mov r2d, r5d >+shl r2d, 6 >+add r2d, 32 ; init >+mov r6d, r4d >+sub r6w, r5w ; deltaL size is word partial register may stall in here >+mov r7d, r3d >+sub r7w, r5w ; deltaR size is word >+movdxmm0, r2d >+ vpbroadcastwxmm0, xmm0 SSE4? This is AVX2 instruction, so * *intialization on top is wrong. We genrally we don't prefix xmm, ymm for native version m0, m1 will be better. >+movaxmm4, xmm0 >+ ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
[x265] Fwd: [PATCH 3 of 3] SEA motion search:integralv functions avx2 implementation
-- Forwarded message -- From:Date: Tue, May 2, 2017 at 3:16 PM Subject: [x265] [PATCH 3 of 3] SEA motion search:integralv functions avx2 implementation To: x265-devel@videolan.org # HG changeset patch # User Vignesh Vijayakumar # Date 1493121121 -19800 # Tue Apr 25 17:22:01 2017 +0530 # Node ID e5ee88d08fcedee83efa63869a5a346c711a0e3d # Parent 1afc127e62b4502c8f052ee989843c64b45ffc56 SEA motion search:integralv functions avx2 implementation diff -r 1afc127e62b4 -r e5ee88d08fce source/common/CMakeLists.txt --- a/source/common/CMakeLists.txt Fri Apr 28 11:22:29 2017 +0530 +++ b/source/common/CMakeLists.txt Tue Apr 25 17:22:01 2017 +0530 @@ -57,10 +57,10 @@ set(VEC_PRIMITIVES vec/vec-primitives.cpp ${PRIMITIVES}) source_group(Intrinsics FILES ${VEC_PRIMITIVES}) -set(C_SRCS asm-primitives.cpp pixel.h mc.h ipfilter8.h blockcopy8.h dct8.h loopfilter.h) +set(C_SRCS asm-primitives.cpp pixel.h mc.h ipfilter8.h blockcopy8.h dct8.h loopfilter.h seaintegral.h) set(A_SRCS pixel-a.asm const-a.asm cpu-a.asm ssd-a.asm mc-a.asm mc-a2.asm pixel-util8.asm blockcopy8.asm - pixeladd8.asm dct8.asm) + pixeladd8.asm dct8.asm seaintegral.asm) if(HIGH_BIT_DEPTH) set(A_SRCS ${A_SRCS} sad16-a.asm intrapred16.asm ipfilter16.asm loopfilter.asm) else() diff -r 1afc127e62b4 -r e5ee88d08fce source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cpp Fri Apr 28 11:22:29 2017 +0530 +++ b/source/common/x86/asm-primitives.cpp Tue Apr 25 17:22:01 2017 +0530 @@ -2158,6 +2158,13 @@ p.fix8Unpack = PFX(cutree_fix8_unpack_avx2); p.fix8Pack = PFX(cutree_fix8_pack_avx2); +p.integral_init4v = PFX(integral4v_avx2); +p.integral_init8v = PFX(integral8v_avx2); +p.integral_init12v = PFX(integral12v_avx2); +p.integral_init16v = PFX(integral16v_avx2); +p.integral_init24v = PFX(integral24v_avx2); +p.integral_init32v = PFX(integral32v_avx2); + /* TODO: This kernel needs to be modified to work with HIGH_BIT_DEPTH only p.planeClipAndMax = PFX(planeClipAndMax_avx2); */ @@ -2178,6 +2185,7 @@ p.costCoeffNxN = PFX(costCoeffNxN_avx2_bmi2); } } + } #else // if HIGH_BIT_DEPTH @@ -3696,6 +3704,13 @@ p.fix8Unpack = PFX(cutree_fix8_unpack_avx2); p.fix8Pack = PFX(cutree_fix8_pack_avx2); +p.integral_init4v = PFX(integral4v_avx2); +p.integral_init8v = PFX(integral8v_avx2); +p.integral_init12v = PFX(integral12v_avx2); +p.integral_init16v = PFX(integral16v_avx2); +p.integral_init24v = PFX(integral24v_avx2); +p.integral_init32v = PFX(integral32v_avx2); + } #endif } diff -r 1afc127e62b4 -r e5ee88d08fce source/common/x86/seaintegral.asm --- /dev/null Thu Jan 01 00:00:00 1970 + +++ b/source/common/x86/seaintegral.asm Tue Apr 25 17:22:01 2017 +0530 @@ -0,0 +1,155 @@ +;** *** +;* Copyright (C) 2013-2017 MulticoreWare, Inc +;* +;* Authors: Jayashri Murugan +;* Vignesh V Menon +;* +;* This program is free software; you can redistribute it and/or modify +;* it under the terms of the GNU General Public License as published by +;* the Free Software Foundation; either version 2 of the License, or +;* (at your option) any later version. +;* +;* This program is distributed in the hope that it will be useful, +;* but WITHOUT ANY WARRANTY; without even the implied warranty of +;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +;* GNU General Public License for more details. +;* +;* You should have received a copy of the GNU General Public License +;* along with this program; if not, write to the Free Software +;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. +;* +;* This program is also available under a commercial proprietary license. +;* For more information, contact us at license @ x265.com. +;** ***/ + +%include "x86inc.asm" +%include "x86util.asm" + +SECTION .text + +;-- --- +;void integral_init4v_c(uint32_t *sum4, intptr_t stride) +;-- --- +INIT_YMM avx2 +cglobal integral4v, 2, 4, 2 + +mov r2, 0 xor will be faster method of clearing a register. +mov r3, r1 What are possible values of stride here, is it random number or multiple of a specific number? +shl r3, 4 + +.loop: +movum0, [r0] +movum1, [r0 + r3] +psubd m0, m1, m0 +movu[r0], m0 +add r2, 8 +add r0, 32 +cmp r2, r1 +jl .loop +RET +
[x265] Fwd: [PATCH 2 of 3] SEA motion search:Add testbench for integralv functions
-- Forwarded message -- From:Date: 2017-05-02 15:16 GMT+05:30 Subject: [x265] [PATCH 2 of 3] SEA motion search:Add testbench for integralv functions To: x265-devel@videolan.org # HG changeset patch # User Vignesh Vijayakumar # Date 1493358749 -19800 # Fri Apr 28 11:22:29 2017 +0530 # Node ID 1afc127e62b4502c8f052ee989843c64b45ffc56 # Parent cb67dffd0e2a596c8d3c6d042b8e6c532487d427 SEA motion search:Add testbench for integralv functions diff -r cb67dffd0e2a -r 1afc127e62b4 source/test/pixelharness.cpp --- a/source/test/pixelharness.cpp Tue May 02 09:58:13 2017 +0530 +++ b/source/test/pixelharness.cpp Fri Apr 28 11:22:29 2017 +0530 @@ -2003,6 +2003,228 @@ return true; } +bool PixelHarness::check_integral_init4v(integral4v_t ref, integral4v_t opt) +{ +intptr_t srcStep = 64; +int j = 0; >> +uint32_t sum_ans[BUFFSIZE] = { 0 }; >> +uint32_t sum_ans1[BUFFSIZE] = { 0 }; Better names please, check existing naming conventions. + +for (int i = 0; i < 64; i++) +{ +sum_ans[i] = pixel_test_buff[0][i]; +sum_ans1[i] = pixel_test_buff[0][i]; +} +for (int i = 0, k = 0; i < BUFFSIZE; i++) +{ +if (i % 64 == 0) +k++; +sum_ans[i] = sum_ans[i % 64] + k; +sum_ans1[i] = sum_ans1[i % 64] + k; +} +int padx = 4; +int pady = 4; +uint32_t *sum_ans_ptr = sum_ans + srcStep * pady + padx; +uint32_t *sum_ans1_ptr = sum_ans1 + srcStep * pady + padx; +for (int i = 0; i < ITERS; i++) +{ +ref(sum_ans_ptr, srcStep); +checked(opt, sum_ans1_ptr, srcStep); + +if (memcmp(sum_ans, sum_ans1, sizeof(uint32_t) * BUFFSIZE)) +return false; + +reportfail() +j += INCR; +} +return true; +} + +bool PixelHarness::check_integral_init8v(integral8v_t ref, integral8v_t opt) + { +intptr_t srcStep = 64; +int j = 0; +uint32_t sum_ans[BUFFSIZE] = { 0 }; +uint32_t sum_ans1[BUFFSIZE] = { 0 }; + +for (int i = 0; i < 64; i++) +{ +sum_ans[i] = pixel_test_buff[0][i]; +sum_ans1[i] = pixel_test_buff[0][i]; +} +for (int i = 0, k = 0; i < BUFFSIZE; i++) +{ +if (i % 64 == 0) +k++; +sum_ans[i] = sum_ans[i % 64] + k; +sum_ans1[i] = sum_ans1[i % 64] + k; +} +int padx = 4; +int pady = 4; +uint32_t *sum_ans_ptr = sum_ans + srcStep * pady + padx; +uint32_t *sum_ans1_ptr = sum_ans1 + srcStep * pady + padx; +for (int i = 0; i < ITERS; i++) +{ +ref(sum_ans_ptr, srcStep); +checked(opt, sum_ans1_ptr, srcStep); + +if (memcmp(sum_ans, sum_ans1, sizeof(uint32_t) * BUFFSIZE)) +return false; + +reportfail() +j += INCR; +} +return true; +} + +bool PixelHarness::check_integral_init12v(integral12v_t ref, integral12v_t opt) + { +intptr_t srcStep = 64; +int j = 0; +uint32_t sum_ans[BUFFSIZE] = { 0 }; +uint32_t sum_ans1[BUFFSIZE] = { 0 }; + +for (int i = 0; i < 64; i++) +{ +sum_ans[i] = pixel_test_buff[0][i]; +sum_ans1[i] = pixel_test_buff[0][i]; +} +for (int i = 0, k = 0; i < BUFFSIZE; i++) +{ +if (i % 64 == 0) +k++; +sum_ans[i] = sum_ans[i % 64] + k; +sum_ans1[i] = sum_ans1[i % 64] + k; +} +int padx = 4; +int pady = 4; +uint32_t *sum_ans_ptr = sum_ans + srcStep * pady + padx; +uint32_t *sum_ans1_ptr = sum_ans1 + srcStep * pady + padx; +for (int i = 0; i < ITERS; i++) +{ +ref(sum_ans_ptr, srcStep); +checked(opt, sum_ans1_ptr, srcStep); + +if (memcmp(sum_ans, sum_ans1, sizeof(uint32_t) * BUFFSIZE)) +return false; + +reportfail() +j += INCR; +} +return true; +} + +bool PixelHarness::check_integral_init16v(integral16v_t ref, integral16v_t opt) +{ +intptr_t srcStep = 64; +int j = 0; +uint32_t sum_ans[BUFFSIZE] = { 0 }; +uint32_t sum_ans1[BUFFSIZE] = { 0 }; + +for (int i = 0; i < 64; i++) +{ +sum_ans[i] = pixel_test_buff[0][i]; +sum_ans1[i] = pixel_test_buff[0][i]; +} +for (int i = 0, k = 0; i < BUFFSIZE; i++) +{ +if (i % 64 == 0) +k++; +sum_ans[i] = sum_ans[i % 64] + k; +sum_ans1[i] = sum_ans1[i % 64] + k; +} +int padx = 4; +int pady = 4; +uint32_t *sum_ans_ptr = sum_ans + srcStep * pady + padx; +uint32_t *sum_ans1_ptr = sum_ans1 + srcStep * pady + padx; +for (int i = 0; i < ITERS; i++) +{ +ref(sum_ans_ptr, srcStep); +checked(opt, sum_ans1_ptr, srcStep); + +if (memcmp(sum_ans, sum_ans1, sizeof(uint32_t) * BUFFSIZE)) +return false; + +reportfail() +j += INCR; +} +return true; +} + +bool PixelHarness::check_integral_init24v(integral24v_t ref, integral24v_t opt) +{ +intptr_t srcStep = 64; +int j = 0; +uint32_t
[x265] Fwd: [PATCH 1 of 3] SEA motion search:Setup asm primitives for integral calculation
-- Forwarded message -- From:Date: Tue, May 2, 2017 at 3:16 PM Subject: [x265] [PATCH 1 of 3] SEA motion search:Setup asm primitives for integral calculation To: x265-devel@videolan.org # HG changeset patch # User Vignesh Vijayakumar # Date 1493699293 -19800 # Tue May 02 09:58:13 2017 +0530 # Node ID cb67dffd0e2a596c8d3c6d042b8e6c532487d427 # Parent 5bc5e73760cdb61d2674e74cc52149fa0603af8a SEA motion search:Setup asm primitives for integral calculation diff -r 5bc5e73760cd -r cb67dffd0e2a source/common/primitives.cpp --- a/source/common/primitives.cpp Sat Apr 22 17:00:28 2017 -0700 +++ b/source/common/primitives.cpp Tue May 02 09:58:13 2017 +0530 @@ -57,6 +57,7 @@ void setupIntraPrimitives_c(EncoderPrimitives ); void setupLoopFilterPrimitives_c(EncoderPrimitives ); void setupSaoPrimitives_c(EncoderPrimitives ); +void setupSeaIntegralPrimitives_c(EncoderPrimitives ); void setupCPrimitives(EncoderPrimitives ) { @@ -66,6 +67,7 @@ setupIntraPrimitives_c(p); // intrapred.cpp setupLoopFilterPrimitives_c(p); // loopfilter.cpp setupSaoPrimitives_c(p);// sao.cpp +setupSeaIntegralPrimitives_c(p); // framefilter.cpp } void setupAliasPrimitives(EncoderPrimitives ) diff -r 5bc5e73760cd -r cb67dffd0e2a source/common/primitives.h --- a/source/common/primitives.hSat Apr 22 17:00:28 2017 -0700 +++ b/source/common/primitives.hTue May 02 09:58:13 2017 +0530 @@ -202,6 +202,18 @@ typedef void (*pelFilterLumaStrong_t)(pixel* src, intptr_t srcStep, intptr_t offset, int32_t tcP, int32_t tcQ); typedef void (*pelFilterChroma_t)(pixel* src, intptr_t srcStep, intptr_t offset, int32_t tc, int32_t maskP, int32_t maskQ); >> + typedef void(*integral4h_t)(uint32_t *sum, pixel *pix, intptr_t stride); >> +typedef void(*integral8h_t)(uint32_t *sum, pixel *pix, intptr_t stride); >> +typedef void(*integral12h_t)(uint32_t *sum, pixel *pix, intptr_t stride); >> +typedef void(*integral16h_t)(uint32_t *sum, pixel *pix, intptr_t stride); >> +typedef void(*integral24h_t)(uint32_t *sum, pixel *pix, intptr_t stride); >> +typedef void(*integral32h_t)(uint32_t *sum, pixel *pix, intptr_t stride); >> + typedef void(*integral4v_t)(uint32_t *sum, intptr_t stride); >> +typedef void(*integral8v_t)(uint32_t *sum, intptr_t stride); >> +typedef void(*integral12v_t)(uint32_t *sum, intptr_t stride); >> +typedef void(*integral16v_t)(uint32_t *sum, intptr_t stride); >> +typedef void(*integral24v_t)(uint32_t *sum, intptr_t stride); >> +typedef void(*integral32v_t)(uint32_t *sum, intptr_t stride); Just needed two typedef here, one for horitontal and one for vertical rest of the typedef are redudent here. /* Function pointers to optimized encoder primitives. Each pointer can reference * either an assembly routine, a SIMD intrinsic primitive, or a C function */ @@ -342,6 +354,19 @@ pelFilterLumaStrong_t pelFilterLumaStrong[2]; // EDGE_VER = 0, EDGE_HOR = 1 pelFilterChroma_t pelFilterChroma[2]; // EDGE_VER = 0, EDGE_HOR = 1 >> +integral4h_tintegral_init4h; >> +integral8h_tintegral_init8h; >> +integral12h_tintegral_init12h; >> +integral16h_tintegral_init16h; >> +integral24h_tintegral_init24h; >> +integral32h_tintegral_init32h; >> +integral4v_tintegral_init4v; >> +integral8v_tintegral_init8v; >> +integral12v_tintegral_init12v; >> +integral16v_tintegral_init16v; >> +integral24v_tintegral_init24v; >> +integral32v_tintegral_init32v; >> + An array of appropiate size for horizontal and another for vertical. /* There is one set of chroma primitives per color space. An encoder will * have just a single color space and thus it will only ever use one entry * in this array. However we always fill all entries in the array in case diff -r 5bc5e73760cd -r cb67dffd0e2a source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cpp Sat Apr 22 17:00:28 2017 -0700 +++ b/source/common/x86/asm-primitives.cpp Tue May 02 09:58:13 2017 +0530 @@ -114,6 +114,7 @@ #include "blockcopy8.h" #include "intrapred.h" #include "dct8.h" +#include "seaintegral.h" } #define ALL_LUMA_CU_TYPED(prim, fncdef, fname, cpu) \ diff -r 5bc5e73760cd -r cb67dffd0e2a source/common/x86/seaintegral.h --- /dev/null Thu Jan 01 00:00:00 1970 + +++ b/source/common/x86/seaintegral.h Tue May 02 09:58:13 2017 +0530 @@ -0,0 +1,41 @@ +/** *** +* Copyright (C) 2013-2017 MulticoreWare, Inc +* +* Authors: Vignesh V Menon +* Jayashri Murugan +* +* This program is free software; you can redistribute it and/or modify +* it under the
Re: [x265] Interested in fast popcnt substitute below SSE4.2?
Hi Mario, Sorry for late reply, you have shared an interesting and useful information. Currently we are doing some experimental refactoring over the ASM code base, so it might take some time. Hoping to receive more post like this. Regards, Praveen Tiwari On Wed, Mar 1, 2017 at 8:21 PM, Mario *LigH* Rohkrämer <cont...@ligh.de> wrote: > Apparently not interesting... > > > > Am 23.02.2017, 10:05 Uhr, schrieb Mario *LigH* Rohkrämer <cont...@ligh.de > >: > > Another point of view on this matter: >> >> http://danluu.com/assembly-intrinsics/ >> >> Seems to relativate the impact. >> >> I don't know if you already knew about all this before... >> >> >> Am 22.02.2017, 13:39 Uhr, schrieb Mario *LigH* Rohkrämer <cont...@ligh.de >> >: >> >> http://wm.ite.pl/articles/sse-popcount.html >>> >>> May even be faster than the popcnt instruction implemented in a >>> supporting CPU! >>> >>> Found via a German "conspiracy news" blog (no, that's not at all meant >>> seriously) which sometimes also mentions computer security issues and >>> interesting programming challenges: https://blog.fefe.de/?ts=a653b91f >>> >>> >> >> > > -- > > Fun and success! > Mario *LigH* Rohkrämer > mailto:cont...@ligh.de > > ___ > x265-devel mailing list > x265-devel@videolan.org > https://mailman.videolan.org/listinfo/x265-devel > ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH 1 of 9] pcs: update design to have 'm_achivedFps' for every PCS Instance
Please, ignore this patch. Thanks. On Thu, Nov 17, 2016 at 8:51 PM, <prav...@multicorewareinc.com> wrote: > # HG changeset patch > # User Praveen Tiwari <prav...@multicorewareinc.com> > # Date 1479128885 -19800 > # Mon Nov 14 18:38:05 2016 +0530 > # Branch stable > # Node ID 8defd4e7b2e4875247e4ec95e0dd3b9630983526 > # Parent bdf273f9521784ceeda868222d415303a0bcf58b > pcs: update design to have 'm_achivedFps' for every PCS Instance > > diff -r bdf273f95217 -r 8defd4e7b2e4 source/api-uhdkit.cpp > --- a/source/api-uhdkit.cpp Tue Nov 08 14:20:24 2016 +0530 > +++ b/source/api-uhdkit.cpp Mon Nov 14 18:38:05 2016 +0530 > @@ -206,8 +206,6 @@ > return -1; > if (numEncoded > 0) > { > -uhdkitEnc->m_achievedFps = numEncoded * 100.0 / > (double)(endTime - startTime); > -uhdkitEnc->m_achievedFps = uhdkitEnc->m_achievedFps / > uhdkitEnc->m_param->gops; // Achieved fps for each gop encoder > uhdkitEnc->m_encodedFrameCount += numEncoded; > controllerIndex = ((uhdkitEnc->m_encodedFrameCount - 1) / > uhdkitEnc->m_param->x265Param->keyframeMax) % uhdkitEnc->m_param->gops; > X265_CHECK(controllerIndex >= 0 && controllerIndex < > uhdkitEnc->m_param->gops, "Invalid controllerIndex: %d, must be between 0 > and %d\n", controllerIndex, uhdkitEnc->m_param->gops); > diff -r bdf273f95217 -r 8defd4e7b2e4 source/pcs/api-pcs.cpp > --- a/source/pcs/api-pcs.cppTue Nov 08 14:20:24 2016 +0530 > +++ b/source/pcs/api-pcs.cppMon Nov 14 18:38:05 2016 +0530 > @@ -211,6 +211,7 @@ > m_pcsParam->statusPrintInterval = param->statusPrintInterval; > m_curTimeStamp = m_lastTimeStamp = X265_NS::x265_mdate(); > m_framesWindow = 1; > +m_achievedFps = 0.0; > m_outFrameCountOfLastAccumulatorReset = 0; > time(_lastStatusOutputTime); > > @@ -289,11 +290,11 @@ > int64_t elapsedEncTime = m_curTimeStamp - m_lastTimeStamp; > int controllerIndex = ((uhdkitEnc->m_encodedFrameCount - 1) / > uhdkitEnc->m_param->x265Param->keyframeMax) % uhdkitEnc->m_param->gops; > X265_CHECK(controllerIndex >= 0 && controllerIndex < > uhdkitEnc->m_param->gops, "Invalid controllerIndex: %d, must be between 0 > and %d\n", controllerIndex, uhdkitEnc->m_param->gops); > -if (((m_bScenecut == 1) && elapsedEncTime > 0) || elapsedEncTime > >= 30 || uhdkitEnc->m_achievedFps < m_pcsParam->fpsSetPoint) > +if (((m_bScenecut == 1) && elapsedEncTime > 0) || elapsedEncTime > >= 30 || m_achievedFps < m_pcsParam->fpsSetPoint) > { > // Don't allow outrageously high frame rate measurements to > skew the controller. > -uhdkitEnc->m_achievedFps = X265_MIN(uhdkitEnc->m_achievedFps, > 4 * m_pcsParam->fpsSetPoint); > -error = (m_pcsParam->fpsSetPoint - uhdkitEnc->m_achievedFps) > / m_pcsParam->fpsSetPoint; > +m_achievedFps = X265_MIN(m_achievedFps, 4 * > m_pcsParam->fpsSetPoint); > +error = (m_pcsParam->fpsSetPoint - m_achievedFps) / > m_pcsParam->fpsSetPoint; > > if (m_pcsParam->integralReset > 0) > { > @@ -308,7 +309,7 @@ > { > double lowerBound = (m_pcsParam->fpsSetPoint * > SATURATION_RANGE_MIN) / 100.0; /* Lower bound, 3% of set-point */ > double upperBound = (m_pcsParam->fpsSetPoint * > SATURATION_RANGE_MAX) / 100.0; /* Upper bound, 10% of set-point */ > -double fpsDiff = (uhdkitEnc->m_achievedFps - > m_pcsParam->fpsSetPoint); > +double fpsDiff =(m_achievedFps - > m_pcsParam->fpsSetPoint); > resetErrorAccumulater = (fpsDiff >= lowerBound && fpsDiff > <= upperBound) || m_bScenecut; /* Steady state, or scenecut */ > } > > @@ -388,7 +389,7 @@ > m_outFrameCountOfLastAccumulatorReset = uhdkitEnc->m_ > encodedFrameCount; > m_lastTimeStamp = m_curTimeStamp; > if (uhdkitEnc->m_reconfigParam->logLevel == UHDKIT_LOG_INFO) > - > uhdkit_pcs_printStatus(>m_reconfigParam[controllerIndex], > uhdkitEnc->m_achievedFps); > + > uhdkit_pcs_printStatus(>m_reconfigParam[controllerIndex], > m_achievedFps); > } > return true; > } > @@ -398,6 +399,11 @@ > m_bScenecut = pic->frameData.bScenecut; > } > > +void pcs::uhdkit_pcs_update_fps(int64_t startTime, int64_t endTime, int &
Re: [x265] [PATCH] [multi-lib] Support 8+10+12 bits in single DLL (Workaround)
Hi Min, Can you please verify for VC12 ? I double checked on this I am getting different output for this patch. 8-bit encoded file size is same but different binary (compared using beyond compare), 10 and 12 bit size and binary both are different. I applied you patch build once (like 8 bit build) and collected all depth outputs (8, 10 and 12), compared with three builds of x265 i.e 8 bit, 10 bit and 12 bit. Regards, Praveen On Fri, Sep 23, 2016 at 2:47 AM, chen <chenm...@163.com> wrote: > Hi Praveen, > > I test your cmdlind on my VS2008 build. > I build three bit-depth version and compare with one bit-depth version, > but the output are still matched in both 10 and 12 bit. > > Regards, > Min > > At 2016-09-22 14:39:50,"Praveen Tiwari" <prav...@multicorewareinc.com> > wrote: > > Hi Min, > > After this patch outputs are changing, tested for following command line > for 10-bit and 12-bit outputs. > > --input=NebutaFestival_2560x1600_60_10bit_crop.yuv --input-res=2560x1600 > --fps=60 --numa-pools="NULL" --output-depth=12 --hash=1 -o NFOut12.hevc > > > > > Regards, > Praveen > > On Thu, Sep 15, 2016 at 1:55 AM, chen <chenm...@163.com> wrote: > >> From ea50e494473623ed0dbff2907194aaf268dc449a Mon Sep 17 00:00:00 2001 >> From: Min Chen <min.c...@multicorewareinc.com> >> Date: Wed, 14 Sep 2016 15:23:38 -0500 >> Subject: [PATCH] [multi-lib] Support 8+10+12 bits in single DLL >> (Workaround) >> >> --- >> source/CMakeLists.txt | 40 +++- >> 1 files changed, 39 insertions(+), 1 deletions(-) >> >> diff --git a/source/CMakeLists.txt b/source/CMakeLists.txt >> index dd19d28..c2c2f7f 100644 >> --- a/source/CMakeLists.txt >> +++ b/source/CMakeLists.txt >> @@ -36,6 +36,7 @@ configure_file("${PROJECT_SOURCE_DIR}/x265.def.in" >> configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in" >> "${PROJECT_BINARY_DIR}/x265_config.h") >> >> + >> SET(CMAKE_MODULE_PATH "${PROJECT_SOURCE_DIR}/cmake" >> "${CMAKE_MODULE_PATH}") >> >> # System architecture detection >> @@ -396,6 +397,39 @@ if(WIN32) >> endif(WINXP_SUPPORT) >> endif() >> >> + >> +if(ENABLE_SHARED AND LINKED_10BIT AND LINKED_12BIT) >> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "\n") >> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?setParamAspectRatio@x265 >> @@YAXPEAUx265_param@@HH@Z\n") >> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?getParamAspectRatio@x265 >> @@YAXPEAUx265_param@@AEAH1@Z\n") >> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?general_log_file@x265 >> @@YAXPEBUx265_param@@PEBDH1ZZ\n") >> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?general_log@x265 >> @@YAXPEBUx265_param@@PEBDH1ZZ\n") >> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def >> "?x265_api_get_94@x265_10bit@@YAPEBUx265_api@@H@Z\n") >> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def >> "?x265_api_get_94@x265_12bit@@YAPEBUx265_api@@H@Z\n") >> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def >> "?x265_api_query@x265_10bit@@YAPEBUx265_api@@HHPEAH@Z\n") >> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def >> "?x265_api_query@x265_12bit@@YAPEBUx265_api@@HHPEAH@Z\n") >> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_mdate@x265 >> @@YA_JXZ\n") >> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def >> "?x265_picturePlaneSize@x265@@YAI@Z\n") >> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_ssim2dB@x265 >> @@YANN@Z\n") >> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_ssim2dB@x265 >> @@YANN@Z\n") >> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_report_simd@x265 >> @@YAXPEAUx265_param@@@Z\n") >> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_fopen@x265 >> @@YAPEAU_iobuf@@PEBD0@Z\n") >> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_malloc@x265 >> @@YAPEAX_K@Z\n") >> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_free@x265 >> @@YAXPEAX@Z\n") >> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_atoi@x265 >> @@YAHPEBDAEA_N@Z\n") >> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?start@Thread@x265@ >> @QEAA_NXZ\n") >> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?stop@Thread@x265@ >> @QEAAXXZ\n") >> +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "??0Thre
Re: [x265] [PATCH] [multi-lib] Support 8+10+12 bits in single DLL (Workaround)
Hi Min, After this patch outputs are changing, tested for following command line for 10-bit and 12-bit outputs. --input=NebutaFestival_2560x1600_60_10bit_crop.yuv --input-res=2560x1600 --fps=60 --numa-pools="NULL" --output-depth=12 --hash=1 -o NFOut12.hevc Regards, Praveen On Thu, Sep 15, 2016 at 1:55 AM, chenwrote: > From ea50e494473623ed0dbff2907194aaf268dc449a Mon Sep 17 00:00:00 2001 > From: Min Chen > Date: Wed, 14 Sep 2016 15:23:38 -0500 > Subject: [PATCH] [multi-lib] Support 8+10+12 bits in single DLL > (Workaround) > > --- > source/CMakeLists.txt | 40 +++- > 1 files changed, 39 insertions(+), 1 deletions(-) > > diff --git a/source/CMakeLists.txt b/source/CMakeLists.txt > index dd19d28..c2c2f7f 100644 > --- a/source/CMakeLists.txt > +++ b/source/CMakeLists.txt > @@ -36,6 +36,7 @@ configure_file("${PROJECT_SOURCE_DIR}/x265.def.in" > configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in" > "${PROJECT_BINARY_DIR}/x265_config.h") > > + > SET(CMAKE_MODULE_PATH "${PROJECT_SOURCE_DIR}/cmake" > "${CMAKE_MODULE_PATH}") > > # System architecture detection > @@ -396,6 +397,39 @@ if(WIN32) > endif(WINXP_SUPPORT) > endif() > > + > +if(ENABLE_SHARED AND LINKED_10BIT AND LINKED_12BIT) > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "\n") > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?setParamAspectRatio@x265 > @@YAXPEAUx265_param@@HH@Z\n") > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?getParamAspectRatio@x265 > @@YAXPEAUx265_param@@AEAH1@Z\n") > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?general_log_file@x265@@ > YAXPEBUx265_param@@PEBDH1ZZ\n") > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?general_log@x265@@ > YAXPEBUx265_param@@PEBDH1ZZ\n") > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def > "?x265_api_get_94@x265_10bit@@YAPEBUx265_api@@H@Z\n") > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def > "?x265_api_get_94@x265_12bit@@YAPEBUx265_api@@H@Z\n") > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_api_query@x265_10bit > @@YAPEBUx265_api@@HHPEAH@Z\n") > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_api_query@x265_12bit > @@YAPEBUx265_api@@HHPEAH@Z\n") > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_mdate@x265 > @@YA_JXZ\n") > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def > "?x265_picturePlaneSize@x265@@YAI@Z\n") > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_ssim2dB@x265 > @@YANN@Z\n") > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_ssim2dB@x265 > @@YANN@Z\n") > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_report_simd@x265@@ > YAXPEAUx265_param@@@Z\n") > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_fopen@x265@@YAPEAU_ > iobuf@@PEBD0@Z\n") > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_malloc@x265 > @@YAPEAX_K@Z\n") > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_free@x265 > @@YAXPEAX@Z\n") > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_atoi@x265 > @@YAHPEBDAEA_N@Z\n") > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?start@Thread@x265@ > @QEAA_NXZ\n") > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?stop@Thread@x265@ > @QEAAXXZ\n") > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "??0Thread@x265@@QEAA@XZ > \n") > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "??1Thread@x265@@UEAA@XZ > \n") > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?g_maxCUDepth@x265 > @@3IA\n") > +if(WINXP_SUPPORT) > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?cond_init@x265@@ > YAHPEAUConditionVariable@1@@Z\n") > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?cond_wait@x265@@ > YAHPEAUConditionVariable@1@PEAU_RTL_CRITICAL_SECTION@@K@Z\n") > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?cond_destroy@x265@@ > YAXPEAUConditionVariable@1@@Z\n") > +file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?cond_broadcast@x265 > @@YAXPEAUConditionVariable@1@@Z\n") > +endif() > +endif() > + > include(version) # determine X265_VERSION and X265_LATEST_TAG > include_directories(. common encoder "${PROJECT_BINARY_DIR}") > > @@ -608,7 +642,11 @@ if(ENABLE_CLI) > if(WIN32 OR NOT ENABLE_SHARED OR INTEL_CXX) > # The CLI cannot link to the shared library on Windows, it > # requires internal APIs not exported from the DLL > -target_link_libraries(cli x265-static ${PLATFORM_LIBS}) > +if(ENABLE_SHARED AND LINKED_10BIT AND LINKED_12BIT) > +target_link_libraries(cli x265-shared ${PLATFORM_LIBS}) > +else() > +target_link_libraries(cli x265-static ${PLATFORM_LIBS}) > +endif() > else() > target_link_libraries(cli x265-shared ${PLATFORM_LIBS}) > endif() > -- > 1.7.9.msysgit.0 > > > ___ > x265-devel mailing list > x265-devel@videolan.org > https://mailman.videolan.org/listinfo/x265-devel >
Re: [x265] [PATCH] threadpool.cpp: fix default pool param behaviour, if NULL or “” (default) x265 will use all available threads on each NUMA node
Please ignore this this behaviour is not required for linux systems. Thanks. Regards, Praveen On Wed, Sep 7, 2016 at 5:19 PM, <prav...@multicorewareinc.com> wrote: > # HG changeset patch > # User Praveen Tiwari <prav...@multicorewareinc.com> > # Date 1473246754 -19800 > # Wed Sep 07 16:42:34 2016 +0530 > # Node ID 9587a394ba58a2c3a579db5fb3f7531daf49559b > # Parent df559450949bd085b0fc5e01332aa8458af2fa43 > threadpool.cpp: fix default pool param behaviour, if NULL or 灯 (default) > x265 will use all available threads on each NUMA node > > diff -r df559450949b -r 9587a394ba58 source/common/threadpool.cpp > --- a/source/common/threadpool.cpp Wed Aug 10 13:26:18 2016 +0530 > +++ b/source/common/threadpool.cpp Wed Sep 07 16:42:34 2016 +0530 > @@ -330,8 +330,8 @@ > { > for (int j = i; j < numNumaNodes; j++) > { > -threadsPerPool[numNumaNodes] += cpusPerNode[j]; > -nodeMaskPerPool[numNumaNodes] |= ((uint64_t)1 << j); > +threadsPerPool[j] += cpusPerNode[j]; > +nodeMaskPerPool[j] |= ((uint64_t)1 << j); > } > break; > } > @@ -366,8 +366,8 @@ > { > for (int i = 0; i < numNumaNodes; i++) > { > -threadsPerPool[numNumaNodes] += cpusPerNode[i]; > -nodeMaskPerPool[numNumaNodes] |= ((uint64_t)1 << i); > +threadsPerPool[i] += cpusPerNode[i]; > +nodeMaskPerPool[i] |= ((uint64_t)1 << i); > } > } > > ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] threadpool: fix warning: ‘int popCount(uint64_t)’ defined but not used [-Wunused-function]
I remember some numa functionality requires minimum window 7, they are not supported on previous version of window OS. Regards, Praveen On Mon, May 30, 2016 at 6:43 PM, Mateusz <mateu...@poczta.onet.pl> wrote: > There is a serious bug in threadpool code that prevent working in Windows > XP/Vista. > VS 2015 error when compiling for 32-bit Windows XP: > (ClCompile target) -> > I:\vs\x265\source\common\threadpool.cpp(590): error C3861: > 'GetNumaNodeProcessorMaskEx': identifier not found [I:\vs\x265\ma\ > 8-b\common\common.vcxproj] > > Did you see patch https://patches.videolan.org/patch/13495/ (it fixes > also this warning)? > > > W dniu 2016-05-30 o 14:45, prav...@multicorewareinc.com pisze: > > # HG changeset patch > > # User Praveen Tiwari <prav...@multicorewareinc.com> > > # Date 1464585837 -19800 > > # Mon May 30 10:53:57 2016 +0530 > > # Node ID b8dbe8d7c09e7fc0b7cce236569fc5df2eb70b1e > > # Parent aeade2e8d8688ebffb8455b8948d89d6a72e2c38 > > threadpool: fix warning: ‘int popCount(uint64_t)’ defined but not used > [-Wunused-function] > > static int popCount(uint64_t x) > > > > diff -r aeade2e8d868 -r b8dbe8d7c09e source/common/threadpool.cpp > > --- a/source/common/threadpool.cppThu May 26 16:45:09 2016 +0530 > > +++ b/source/common/threadpool.cppMon May 30 10:53:57 2016 +0530 > > @@ -68,6 +68,7 @@ > > # define strcasecmp _stricmp > > #endif > > > > +#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 > > const uint64_t m1 = 0x; //binary: 0101... > > const uint64_t m2 = 0x; //binary: 00110011.. > > const uint64_t m3 = 0x0f0f0f0f0f0f0f0f; //binary: 4 zeros, 4 ones ... > > @@ -80,6 +81,7 @@ > > x = (x + (x >> 4)) & m3; > > return (x * h01) >> 56; > > } > > +#endif > > > > namespace X265_NS { > > // x265 private namespace > > > > > > > > ___ > > x265-devel mailing list > > x265-devel@videolan.org > > https://mailman.videolan.org/listinfo/x265-devel > > > > > ___ > x265-devel mailing list > x265-devel@videolan.org > https://mailman.videolan.org/listinfo/x265-devel > ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH 1 of 7] threadpool.cpp: get correct CPU count for multisocket machines -> windows system fix
Hi, I am combining these patches into a single patch along with some updates, so please ignore these patches. On top of this I will update Mateusz patch (CLI: new logic for '--pools ' option ) to avoid merge conflicts. Thanks. . Regards, Praveen On Fri, May 20, 2016 at 4:31 PM, <prav...@multicorewareinc.com> wrote: > # HG changeset patch > # User Praveen Tiwari <prav...@multicorewareinc.com> > # Date 1463655478 -19800 > # Thu May 19 16:27:58 2016 +0530 > # Node ID 9a6ab28b736e1167ac26977d7da8ab2d23cc296f > # Parent aca781339b4c8dae94ff7da73f18cd4439757e87 > threadpool.cpp: get correct CPU count for multisocket machines -> windows > system fix > > diff -r aca781339b4c -r 9a6ab28b736e source/common/threadpool.cpp > --- a/source/common/threadpool.cpp Tue May 10 15:33:17 2016 +0530 > +++ b/source/common/threadpool.cpp Thu May 19 16:27:58 2016 +0530 > @@ -64,6 +64,19 @@ > # define strcasecmp _stricmp > #endif > > +const uint64_t m1 = 0x; //binary: 0101... > +const uint64_t m2 = 0x; //binary: 00110011.. > +const uint64_t m3 = 0x0f0f0f0f0f0f0f0f; //binary: 4 zeros, 4 ones ... > +const uint64_t h01 = 0x0101010101010101; //the sum of 256 to the power of > 0,1,2,3... > + > +int popCount(uint64_t x) > +{ > +x -= (x >> 1) & m1; > +x = (x & m2) + ((x >> 2) & m2); > +x = (x + (x >> 4)) & m3; > +return (x * h01) >> 56; > +} > + > namespace X265_NS { > // x265 private namespace > > @@ -525,9 +538,17 @@ > int ThreadPool::getCpuCount() > { > #if _WIN32 > -SYSTEM_INFO sysinfo; > -GetSystemInfo(); > -return sysinfo.dwNumberOfProcessors; > +enum { MAX_NODE_NUM = 127 }; > +int cpus = 0; > +int numNumaNodes = X265_MIN(getNumaNodeCount(), MAX_NODE_NUM); > +PGROUP_AFFINITY groupAffinityPointer = new GROUP_AFFINITY; > +for (int i = 0; i < numNumaNodes; i++) > +{ > +GetNumaNodeProcessorMaskEx((UCHAR)i, groupAffinityPointer); > +cpus += popCount(groupAffinityPointer->Mask); > +} > +delete groupAffinityPointer; > +return cpus; > #elif __unix__ && X265_ARCH_ARM > /* Return the number of processors configured by OS. Because, most > embedded linux distributions > * uses only one processor as the scheduler doesn't have enough work > to utilize all processors */ > ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] ThreadPool.cpp: fix getCpuCount function for windows systems
Please ignore this sending updated patch. thanks. Regards, Praveen On Tue, May 17, 2016 at 7:17 PM, <prav...@multicorewareinc.com> wrote: > # HG changeset patch > # User Praveen Tiwari <prav...@multicorewareinc.com> > # Date 1463492830 -19800 > # Tue May 17 19:17:10 2016 +0530 > # Node ID cf3c2e0dce0997a499ae1d50fda6891cae83e685 > # Parent 372fc5b12ed6003f8784702956ccf7203ea68a2e > ThreadPool.cpp: fix getCpuCount function for windows systems > > diff -r 372fc5b12ed6 -r cf3c2e0dce09 source/common/threadpool.cpp > --- a/source/common/threadpool.cpp Tue May 17 19:06:36 2016 +0530 > +++ b/source/common/threadpool.cpp Tue May 17 19:17:10 2016 +0530 > @@ -545,9 +545,17 @@ > int ThreadPool::getCpuCount() > { > #if _WIN32 > -SYSTEM_INFO sysinfo; > -GetSystemInfo(); > -return sysinfo.dwNumberOfProcessors; > +enum { MAX_NODE_NUM = 127 }; > +int cpus = 0; > +int numNumaNodes = X265_MIN(getNumaNodeCount(), MAX_NODE_NUM); > +PGROUP_AFFINITY groupAffinityPointer = new GROUP_AFFINITY; > +for (int i = 0; i < numNumaNodes; i++) > +{ > +GetNumaNodeProcessorMaskEx((UCHAR)i, groupAffinityPointer); > +cpus += (int)bitCount(groupAffinityPointer->Mask); > +} > +delete groupAffinityPointer; > +return cpus; > #elif __unix__ && X265_ARCH_ARM > /* Return the number of processors configured by OS. Because, most > embedded linux distributions > * uses only one processor as the scheduler doesn't have enough work > to utilize all processors */ > ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] ThreadPool.cpp: fix core count for windows machines
Please ignore this sending updated patch. Thanks Regards, Praveen On Tue, May 17, 2016 at 8:01 PM, Pradeep Ramachandran < prad...@multicorewareinc.com> wrote: > > On Tue, May 17, 2016 at 7:07 PM, <prav...@multicorewareinc.com> wrote: > >> # HG changeset patch >> # User Praveen Tiwari <prav...@multicorewareinc.com> >> # Date 1463492196 -19800 >> # Tue May 17 19:06:36 2016 +0530 >> # Node ID 372fc5b12ed6003f8784702956ccf7203ea68a2e >> # Parent e5b5bdc3c154f908706fb75e006f9abf9b3de96f >> ThreadPool.cpp: fix core count for windows machines >> >> diff -r e5b5bdc3c154 -r 372fc5b12ed6 source/common/threadpool.cpp >> --- a/source/common/threadpool.cpp Sat May 14 07:29:46 2016 +0530 >> +++ b/source/common/threadpool.cpp Tue May 17 19:06:36 2016 +0530 >> @@ -27,6 +27,7 @@ >> #include "threading.h" >> >> #include >> +#include >> >> #if X86_64 >> >> @@ -64,6 +65,18 @@ >> # define strcasecmp _stricmp >> #endif >> >> +uint64_t bitCount(uint64_t value) >> +{ >> +uint64_t count = 0; >> +while (value > 0) // until all bits are zero >> +{ >> +if ((value & 1) == 1) // check lower bit >> +count++; >> +value >>= 1; // shift bits, removing lower bit >> +} >> +return count; >> +} >> + >> namespace X265_NS { >> // x265 private namespace >> >> @@ -238,7 +251,6 @@ >> memset(nodeMaskPerPool, 0, sizeof(nodeMaskPerPool)); >> >> int numNumaNodes = X265_MIN(getNumaNodeCount(), MAX_NODE_NUM); >> -int cpuCount = getCpuCount(); >> bool bNumaSupport = false; >> >> #if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 >> @@ -248,20 +260,28 @@ >> #endif >> >> >> +#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 >> +PGROUP_AFFINITY groupAffinityPointer = new GROUP_AFFINITY; >> +for (int i = 0; i < numNumaNodes; i++) >> +{ >> +GetNumaNodeProcessorMaskEx((UCHAR)i, groupAffinityPointer); >> +cpusPerNode[i] = (int)bitCount(groupAffinityPointer->Mask); >> +} >> +delete groupAffinityPointer; >> +#elif HAVE_LIBNUMA >> +int cpuCount = getCpuCount(); >> > > Can we move to the cleaner implementation of not relying on CPU counts for > non-windows platforms also? > > >> for (int i = 0; i < cpuCount; i++) >> { >> -#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 >> -UCHAR node; >> -if (GetNumaProcessorNode((UCHAR)i, )) >> -cpusPerNode[X265_MIN(node, (UCHAR)MAX_NODE_NUM)]++; >> -else >> -#elif HAVE_LIBNUMA >> if (bNumaSupport >= 0) >> cpusPerNode[X265_MIN(numa_node_of_cpu(i), MAX_NODE_NUM)]++; >> -else >> +} >> +#elif >> +int cpuCount = getCpuCount(); >> +for (int i = 0; i < cpuCount; i++) >> +{ >> +cpusPerNode[0]++; >> +} >> > > How about cpusPerNode[0] = getCpuCount() here? The for loop is unnecessary. > > >> #endif >> -cpusPerNode[0]++; >> -} >> >> if (bNumaSupport && p->logLevel >= X265_LOG_DEBUG) >> for (int i = 0; i < numNumaNodes; i++) >> ___ >> x265-devel mailing list >> x265-devel@videolan.org >> https://mailman.videolan.org/listinfo/x265-devel >> > > > ___ > x265-devel mailing list > x265-devel@videolan.org > https://mailman.videolan.org/listinfo/x265-devel > > ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] motion.cpp: optimize 'X265_DIA_SEARCH' byeliminating costly branch instructions
Yes, this is for eliminating if...else so it's perform a conditional assignment for correctness of code. I will try to update macro definition. Thanks. -Original Message- From: "chen" <chenm...@163.com> Sent: 09-03-2016 05:52 To: "Development for x265" <x265-devel@videolan.org> Subject: Re: [x265] [PATCH] motion.cpp: optimize 'X265_DIA_SEARCH' byeliminating costly branch instructions I suggest you to modify macro And this patch depends on side effect of conditional statment, it is bad code style. At 2016-03-08 22:48:49,prav...@multicorewareinc.com wrote: ># HG changeset patch ># User Praveen Tiwari <prav...@multicorewareinc.com> ># Date 1457448163 -19800 ># Tue Mar 08 20:12:43 2016 +0530 ># Node ID 519441d72cf723dc3b279a91a6080f329729cb49 ># Parent 0e1b6472c05e3a53538d8e064e502d8a7508eb6e >motion.cpp: optimize 'X265_DIA_SEARCH' by eliminating costly branch >instructions > >diff -r 0e1b6472c05e -r 519441d72cf7 source/encoder/motion.cpp >--- a/source/encoder/motion.cppTue Mar 08 19:08:57 2016 +0530 >+++ b/source/encoder/motion.cppTue Mar 08 20:12:43 2016 +0530 >@@ -659,10 +659,10 @@ > do > { > COST_MV_X4_DIR(0, -1, 0, 1, -1, 0, 1, 0, costs); >-COPY1_IF_LT(bcost, (costs[0] << 4) + 1); >-COPY1_IF_LT(bcost, (costs[1] << 4) + 3); >-COPY1_IF_LT(bcost, (costs[2] << 4) + 4); >-COPY1_IF_LT(bcost, (costs[3] << 4) + 12); >+(((costs[0] << 4) + 1) < bcost) && (bcost = ((costs[0] << 4) + >1)); // if ((y) < (x)) (x) = (y); >+(((costs[1] << 4) + 3) < bcost) && (bcost = ((costs[1] << 4) + >3)); >+(((costs[2] << 4) + 4) < bcost) && (bcost = ((costs[2] << 4) + >4)); >+(((costs[3] << 4) + 12) < bcost) && (bcost = ((costs[3] << 4) + >12)); > if (!(bcost & 15)) > break; > bmv.x -= (bcost << 28) >> 30; >___ >x265-devel mailing list >x265-devel@videolan.org >https://mailman.videolan.org/listinfo/x265-devel___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] param: cleanup, print reconfigured param option along with its old and configured value
Please ignore the patch need to update. Thanks. Regards, Praveen On Tue, Mar 8, 2016 at 10:57 AM, <prav...@multicorewareinc.com> wrote: > # HG changeset patch > # User Praveen Tiwari <prav...@multicorewareinc.com> > # Date 1457356750 -19800 > # Mon Mar 07 18:49:10 2016 +0530 > # Node ID 6f7dbb1c901cb5b5b88cc20c3213906465021338 > # Parent 88aebc166fa8e16f91d5f0acce77690003be9d91 > param: cleanup, print reconfigured param option along with its old and > configured value > > diff -r 88aebc166fa8 -r 6f7dbb1c901c source/common/param.cpp > --- a/source/common/param.cpp Fri Mar 04 16:59:45 2016 +0530 > +++ b/source/common/param.cpp Mon Mar 07 18:49:10 2016 +0530 > @@ -1373,36 +1373,31 @@ > if (!param || !reconfiguredParam) > return; > > -x265_log(param,X265_LOG_INFO, "Reconfigured param options :\n"); > - > -char buf[80] = { 0 }; > char tmp[40]; > -#define TOOLCMP(COND1, COND2, STR, VAL) if (COND1 != COND2) { > sprintf(tmp, STR, VAL); appendtool(param, buf, sizeof(buf), tmp); } > -TOOLCMP(param->maxNumReferences, reconfiguredParam->maxNumReferences, > "ref=%d", reconfiguredParam->maxNumReferences); > -TOOLCMP(param->maxTUSize, reconfiguredParam->maxTUSize, > "max-tu-size=%d", reconfiguredParam->maxTUSize); > -TOOLCMP(param->searchRange, reconfiguredParam->searchRange, > "merange=%d", reconfiguredParam->searchRange); > -TOOLCMP(param->subpelRefine, reconfiguredParam->subpelRefine, "subme= > %d", reconfiguredParam->subpelRefine); > -TOOLCMP(param->rdLevel, reconfiguredParam->rdLevel, "rd=%d", > reconfiguredParam->rdLevel); > -TOOLCMP(param->psyRd, reconfiguredParam->psyRd, "psy-rd=%.2lf", > reconfiguredParam->psyRd); > -TOOLCMP(param->rdoqLevel, reconfiguredParam->rdoqLevel, "rdoq=%d", > reconfiguredParam->rdoqLevel); > -TOOLCMP(param->psyRdoq, reconfiguredParam->psyRdoq, "psy-rdoq=%.2lf", > reconfiguredParam->psyRdoq); > -TOOLCMP(param->noiseReductionIntra, > reconfiguredParam->noiseReductionIntra, "nr-intra=%d", > reconfiguredParam->noiseReductionIntra); > -TOOLCMP(param->noiseReductionInter, > reconfiguredParam->noiseReductionInter, "nr-inter=%d", > reconfiguredParam->noiseReductionInter); > -TOOLCMP(param->bEnableTSkipFast, reconfiguredParam->bEnableTSkipFast, > "tskip-fast=%d", reconfiguredParam->bEnableTSkipFast); > -TOOLCMP(param->bEnableSignHiding, > reconfiguredParam->bEnableSignHiding, "signhide=%d", > reconfiguredParam->bEnableSignHiding); > -TOOLCMP(param->bEnableFastIntra, reconfiguredParam->bEnableFastIntra, > "fast-intra=%d", reconfiguredParam->bEnableFastIntra); > -if (param->bEnableLoopFilter && (param->deblockingFilterBetaOffset != > reconfiguredParam->deblockingFilterBetaOffset > +#define TOOLCMP(COND1, COND2, STR, OLD_VAL, NEW_VAL) if (COND1 != COND2) > { sprintf(tmp, STR, OLD_VAL, NEW_VAL);} > +TOOLCMP(param->maxNumReferences, reconfiguredParam->maxNumReferences, > "[x265] Reconfigure: ref=%d to %d", param->maxNumReferences, > reconfiguredParam->maxNumReferences); > +TOOLCMP(param->maxTUSize, reconfiguredParam->maxTUSize, "[x265] > Reconfigure: max-tu-size=%d to %d", param->maxTUSize, > reconfiguredParam->maxTUSize); > +TOOLCMP(param->searchRange, reconfiguredParam->searchRange, "[x265] > Reconfigure: merange=%d to %d", param->searchRange, > reconfiguredParam->searchRange); > +TOOLCMP(param->subpelRefine, reconfiguredParam->subpelRefine, "[x265] > Reconfigure: subme=%d to %d", param->subpelRefine, > reconfiguredParam->subpelRefine); > +TOOLCMP(param->rdLevel, reconfiguredParam->rdLevel, "[x265] > Reconfigure: rd=%d to %d", param->rdLevel, reconfiguredParam->rdLevel); > +TOOLCMP(param->psyRd, reconfiguredParam->psyRd, "[x265] Reconfigure: > psy-rd=%.2lf to %.2lf", param->psyRd, reconfiguredParam->psyRd); > +TOOLCMP(param->rdoqLevel, reconfiguredParam->rdoqLevel, "[x265] > Reconfigure: rdoq=%d to %d", param->rdoqLevel, > reconfiguredParam->rdoqLevel); > +TOOLCMP(param->psyRdoq, reconfiguredParam->psyRdoq, "[x265] > Reconfigure: psy-rdoq=%.2lf to %.2lf", param->psyRdoq, > reconfiguredParam->psyRdoq); > +TOOLCMP(param->noiseReductionIntra, > reconfiguredParam->noiseReductionIntra, "[x265] Reconfigure: nr-intra=%d to > %d", param->noiseReductionIntra, reconf
[x265] Fwd: [PATCH] asm: avx2 code for weight_sp() 16bpp
-- Forwarded message -- From: aasaipr...@multicorewareinc.com Date: Mon, Jun 29, 2015 at 4:51 PM Subject: [x265] [PATCH] asm: avx2 code for weight_sp() 16bpp To: x265-devel@videolan.org # HG changeset patch # User Aasaipriya Chandran aasaipr...@multicorewareinc.com # Date 1435562395 -19800 # Mon Jun 29 12:49:55 2015 +0530 # Node ID bebe4e496a432608cf0a9c495debd1970caa387e # Parent 9feee64efa440c25f016d15ae982789e5393a77e asm: avx2 code for weight_sp() 16bpp avx2: weight_sp 11.37x 4496.63 51139.20 sse4: weight_sp 6.48x8163.87 52870.36 diff -r 9feee64efa44 -r bebe4e496a43 source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cpp Fri Jun 26 15:29:51 2015 +0530 +++ b/source/common/x86/asm-primitives.cpp Mon Jun 29 12:49:55 2015 +0530 @@ -1517,6 +1517,7 @@ p.scale1D_128to64 = PFX(scale1D_128to64_avx2); p.scale2D_64to32 = PFX(scale2D_64to32_avx2); p.weight_pp = PFX(weight_pp_avx2); +p.weight_sp = PFX(weight_sp_avx2); p.sign = PFX(calSign_avx2); p.cu[BLOCK_16x16].calcresidual = PFX(getResidual16_avx2); diff -r 9feee64efa44 -r bebe4e496a43 source/common/x86/pixel-util8.asm --- a/source/common/x86/pixel-util8.asm Fri Jun 26 15:29:51 2015 +0530 +++ b/source/common/x86/pixel-util8.asm Mon Jun 29 12:49:55 2015 +0530 @@ -1674,8 +1674,128 @@ dec r5d jnz .loopH RET - -%if ARCH_X86_64 +%endif + + +%if HIGH_BIT_DEPTH +INIT_YMM avx2 +cglobal weight_sp, 6,7,9 +mova m1, [pw_1023] +mova m2, [pw_1] +mov r6d, r7m r7 is 8th register (0-7). so it should be cglobal weight_sp, 6, 8, 9 and ARCH_X86_64 only code. +shl r6d, 16 +orr6d, r6m +vpbroadcastd m3, r6d ; m3 = [round w0] +movd xm4, r8m ; m4 = [shift] +vpbroadcastd m5, r9m ; m5 = [offset] + +; correct row stride +add r3d, r3d +add r2d, r2d +mov r6d, r4d +and r6d, ~(mmsize / SIZEOF_PIXEL - 1) +sub r3d, r6d +sub r3d, r6d +sub r2d, r6d +sub r2d, r6d + +; generate partial width mask (MUST BE IN YMM0) +mov r6d, r4d +and r6d, (mmsize / SIZEOF_PIXEL - 1) +movd xm0, r6d +pshuflw m0, m0, 0 +punpcklqdqm0, m0 +vinserti128 m0, m0, xm0, 1 +pcmpgtw m0, [pw_0_15] + +.loopH: +mov r6d, r4d + +.loopW: +movu m6, [r0] +paddw m6, [pw_2000] + +punpcklwd m7, m6, m2 +pmaddwd m7, m3 ;(round w0) +psrad m7, xm4 ;(shift) +paddd m7, m5 ;(offset) + +punpckhwd m6, m2 +pmaddwd m6, m3 +psrad m6, xm4 +paddd m6, m5 + +packusdw m7, m6 +pminuwm7, m1 + +sub r6d, (mmsize / SIZEOF_PIXEL) +jl.width14 +movu [r1], m7 +lea r0, [r0 + mmsize] +lea r1, [r1 + mmsize] +je.nextH +jmp .loopW + +.width14: +add r6d, 16 +cmp r6d, 14 +jl.width12 +movu [r1], xm7 +vextracti128 xm8, m7, 1 +movq [r1 + 16], xm8 +pextrd[r1 + 24], xm8, 2 +je.nextH + +.width12: +cmp r6d, 12 +jl.width10 +movu [r1], xm7 +vextracti128 xm8, m7, 1 +movq [r1 + 16], xm8 +je.nextH + +.width10: +cmp r6d, 10 +jl.width8 +movu [r1], xm7 +vextracti128 xm8, m7, 1 +movd [r1 + 16], xm8 +je.nextH + +.width8: +cmp r6d, 8 +jl.width6 +movu [r1], xm7 +je.nextH + +.width6 +cmp r6d, 6 +jl.width4 +movq [r1], xm7 +pextrd[r1 + 8], xm7, 2 +je.nextH + +.width4: +cmp r6d, 4 +jl
Re: [x265] Fwd: [PATCH] asm: pixelavg_pp[8xN] avx2 code for 10bpp
You would like to visit 8bpp code as well. Regards, Praveen On Mon, Jun 29, 2015 at 11:24 AM, Rajesh Paulraj raj...@multicorewareinc.com wrote: We don't need to push this patch. I will improve sse version for the same size. We may not need avx2 code for this.(will make sure after rewriting sse2 code) On Mon, Jun 29, 2015 at 10:21 AM, Deepthi Nandakumar deep...@multicorewareinc.com wrote: This does not build for HBD disabled On Fri, Jun 26, 2015 at 5:40 PM, Rajesh Paulraj raj...@multicorewareinc.com wrote: yes. It looks like we need to optimize sse2 code. I will work on this. On Fri, Jun 26, 2015 at 5:31 PM, Praveen Tiwari prav...@multicorewareinc.com wrote: -- Forwarded message -- From: raj...@multicorewareinc.com Date: Fri, Jun 26, 2015 at 3:14 PM Subject: [x265] [PATCH] asm: pixelavg_pp[8xN] avx2 code for 10bpp To: x265-devel@videolan.org # HG changeset patch # User Rajesh Paulrajraj...@multicorewareinc.com # Date 1435311076 -19800 # Fri Jun 26 15:01:16 2015 +0530 # Node ID 956401f1a679f1e71181b704d64e4acdb6f1a93f # Parent d64227e54233d1646c55bcb4b0b831e5340009ed asm: pixelavg_pp[8xN] avx2 code for 10bpp avx2: avg_pp[ 8x4] 4.39x145.09 636.75 avg_pp[ 8x8] 5.33x215.27 1146.55 avg_pp[ 8x16] 6.50x336.88 2190.68 avg_pp[ 8x32] 7.71x579.86 4470.84 sse2: avg_pp[ 8x4] 2.31x287.63 663.94 avg_pp[ 8x8] 3.26x370.21 1205.26 avg_pp[ 8x16] 3.99x581.63 2323.25 avg_pp[ 8x32] 4.78x995.79 4755.58 Basically, our macro pixel_avg_8xN just SSE (just simple syntax conversion for avx2, not using 256 bit capability) so, fundamentally there should be no major improvement in speed. But improvements 287.63c - 145.09c, 370.21c - 215.27 etc are quite good. Does it means SSE2 codes are not optimize well ? Can you revisit SSE code using this algorithm? diff -r d64227e54233 -r 956401f1a679 source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cpp Thu Jun 25 16:25:51 2015 +0530 +++ b/source/common/x86/asm-primitives.cpp Fri Jun 26 15:01:16 2015 +0530 @@ -1362,6 +1362,10 @@ p.cu[BLOCK_32x32].intra_pred[33]= PFX(intra_pred_ang32_33_avx2); p.cu[BLOCK_32x32].intra_pred[34]= PFX(intra_pred_ang32_2_avx2); +p.pu[LUMA_8x4].pixelavg_pp = PFX(pixel_avg_8x4_avx2); +p.pu[LUMA_8x8].pixelavg_pp = PFX(pixel_avg_8x8_avx2); +p.pu[LUMA_8x16].pixelavg_pp = PFX(pixel_avg_8x16_avx2); +p.pu[LUMA_8x32].pixelavg_pp = PFX(pixel_avg_8x32_avx2); p.pu[LUMA_12x16].pixelavg_pp = PFX(pixel_avg_12x16_avx2); p.pu[LUMA_16x4].pixelavg_pp = PFX(pixel_avg_16x4_avx2); p.pu[LUMA_16x8].pixelavg_pp = PFX(pixel_avg_16x8_avx2); diff -r d64227e54233 -r 956401f1a679 source/common/x86/mc-a.asm --- a/source/common/x86/mc-a.asmThu Jun 25 16:25:51 2015 +0530 +++ b/source/common/x86/mc-a.asmFri Jun 26 15:01:16 2015 +0530 @@ -4490,6 +4490,88 @@ RET %endif +%macro pixel_avg_W8 0 +movuxm0, [r2] +movuxm1, [r4] +pavgw xm0, xm1 +movu[r0], xm0 +movuxm2, [r2 + r3] +movuxm3, [r4 + r5] +pavgw xm2, xm3 +movu[r0 + r1], xm2 + +movuxm0, [r2 + r3 * 2] +movuxm1, [r4 + r5 * 2] +pavgw xm0, xm1 +movu[r0 + r1 * 2], xm0 +movuxm2, [r2 + r6] +movuxm3, [r4 + r7] +pavgw xm2, xm3 +movu[r0 + r8], xm2 + +lea r0, [r0 + 4 * r1] +lea r2, [r2 + 4 * r3] +lea r4, [r4 + 4 * r5] +%endmacro + +;--- +;void pixelavg_pp(pixel dst, intptr_t dstride, const pixel src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int) +;--- +%if ARCH_X86_64 +INIT_YMM avx2 +cglobal pixel_avg_8x4, 6,10,4 +add r1d, r1d +add r3d, r3d +add r5d, r5d +lea r6, [r3 * 3] +lea r7, [r5 * 3] +lea r8, [r1 * 3] +pixel_avg_W8 +RET + +cglobal pixel_avg_8x8, 6,10,4 +add r1d, r1d +add r3d, r3d +add r5d, r5d +lea r6, [r3 * 3] +lea r7, [r5 * 3] +lea r8, [r1 * 3] +mov r9d, 2 +.loop +pixel_avg_W8 +dec r9d +jnz .loop +RET + +cglobal pixel_avg_8x16, 6,10,4 +add r1d, r1d +add r3d, r3d +add r5d, r5d +lea r6, [r3 * 3] +lea r7, [r5 * 3] +lea r8, [r1 * 3] +mov r9d, 4 +.loop +pixel_avg_W8 +dec r9d +jnz .loop +RET + +cglobal pixel_avg_8x32, 6,10,4 +add r1d, r1d +add r3d, r3d +add r5d, r5d +lea r6, [r3 * 3
[x265] Fwd: [PATCH] asm: pixelavg_pp[8xN] avx2 code for 10bpp
-- Forwarded message -- From: raj...@multicorewareinc.com Date: Fri, Jun 26, 2015 at 3:14 PM Subject: [x265] [PATCH] asm: pixelavg_pp[8xN] avx2 code for 10bpp To: x265-devel@videolan.org # HG changeset patch # User Rajesh Paulrajraj...@multicorewareinc.com # Date 1435311076 -19800 # Fri Jun 26 15:01:16 2015 +0530 # Node ID 956401f1a679f1e71181b704d64e4acdb6f1a93f # Parent d64227e54233d1646c55bcb4b0b831e5340009ed asm: pixelavg_pp[8xN] avx2 code for 10bpp avx2: avg_pp[ 8x4] 4.39x145.09 636.75 avg_pp[ 8x8] 5.33x215.27 1146.55 avg_pp[ 8x16] 6.50x336.88 2190.68 avg_pp[ 8x32] 7.71x579.86 4470.84 sse2: avg_pp[ 8x4] 2.31x287.63 663.94 avg_pp[ 8x8] 3.26x370.21 1205.26 avg_pp[ 8x16] 3.99x581.63 2323.25 avg_pp[ 8x32] 4.78x995.79 4755.58 diff -r d64227e54233 -r 956401f1a679 source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cpp Thu Jun 25 16:25:51 2015 +0530 +++ b/source/common/x86/asm-primitives.cpp Fri Jun 26 15:01:16 2015 +0530 @@ -1362,6 +1362,10 @@ p.cu[BLOCK_32x32].intra_pred[33]= PFX(intra_pred_ang32_33_avx2); p.cu[BLOCK_32x32].intra_pred[34]= PFX(intra_pred_ang32_2_avx2); +p.pu[LUMA_8x4].pixelavg_pp = PFX(pixel_avg_8x4_avx2); +p.pu[LUMA_8x8].pixelavg_pp = PFX(pixel_avg_8x8_avx2); +p.pu[LUMA_8x16].pixelavg_pp = PFX(pixel_avg_8x16_avx2); +p.pu[LUMA_8x32].pixelavg_pp = PFX(pixel_avg_8x32_avx2); p.pu[LUMA_12x16].pixelavg_pp = PFX(pixel_avg_12x16_avx2); p.pu[LUMA_16x4].pixelavg_pp = PFX(pixel_avg_16x4_avx2); p.pu[LUMA_16x8].pixelavg_pp = PFX(pixel_avg_16x8_avx2); diff -r d64227e54233 -r 956401f1a679 source/common/x86/mc-a.asm --- a/source/common/x86/mc-a.asmThu Jun 25 16:25:51 2015 +0530 +++ b/source/common/x86/mc-a.asmFri Jun 26 15:01:16 2015 +0530 @@ -4490,6 +4490,88 @@ RET %endif +%macro pixel_avg_W8 0 +movuxm0, [r2] +movuxm1, [r4] +pavgw xm0, xm1 +movu[r0], xm0 +movuxm2, [r2 + r3] +movuxm3, [r4 + r5] +pavgw xm2, xm3 +movu[r0 + r1], xm2 + Your macro is not using avx2 capabilities, did you check the performance of two rows combined ? It will reduce your pavgw and movu instruction by half. You can use vinserti128 to combine two rows at a time. +movuxm0, [r2 + r3 * 2] +movuxm1, [r4 + r5 * 2] +pavgw xm0, xm1 +movu[r0 + r1 * 2], xm0 +movuxm2, [r2 + r6] +movuxm3, [r4 + r7] +pavgw xm2, xm3 +movu[r0 + r8], xm2 + +lea r0, [r0 + 4 * r1] +lea r2, [r2 + 4 * r3] +lea r4, [r4 + 4 * r5] +%endmacro + +;--- +;void pixelavg_pp(pixel dst, intptr_t dstride, const pixel src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int) +;--- +%if ARCH_X86_64 +INIT_YMM avx2 +cglobal pixel_avg_8x4, 6,10,4 +add r1d, r1d +add r3d, r3d +add r5d, r5d +lea r6, [r3 * 3] +lea r7, [r5 * 3] +lea r8, [r1 * 3] +pixel_avg_W8 +RET + +cglobal pixel_avg_8x8, 6,10,4 +add r1d, r1d +add r3d, r3d +add r5d, r5d +lea r6, [r3 * 3] +lea r7, [r5 * 3] +lea r8, [r1 * 3] +mov r9d, 2 +.loop +pixel_avg_W8 +dec r9d +jnz .loop +RET + +cglobal pixel_avg_8x16, 6,10,4 +add r1d, r1d +add r3d, r3d +add r5d, r5d +lea r6, [r3 * 3] +lea r7, [r5 * 3] +lea r8, [r1 * 3] +mov r9d, 4 +.loop +pixel_avg_W8 +dec r9d +jnz .loop +RET + +cglobal pixel_avg_8x32, 6,10,4 +add r1d, r1d +add r3d, r3d +add r5d, r5d +lea r6, [r3 * 3] +lea r7, [r5 * 3] +lea r8, [r1 * 3] +mov r9d, 8 +.loop +pixel_avg_W8 +dec r9d +jnz .loop +RET +%endif + %macro pixel_avg_H4 0 movum0, [r2] movum1, [r4] ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] Fwd: [PATCH] asm: pixelavg_pp[8xN] avx2 code for 10bpp
ahh, width is just 8*16 = 128, two rows at a time will need vextracti128 as well while storing, which goes to port5, a bottleneck port. pavgw is much cheaper than it. You may try to combine 16XN sizes. Regards, Praveen On Fri, Jun 26, 2015 at 3:40 PM, Rajesh Paulraj raj...@multicorewareinc.com wrote: I tried using vinserti128. But that reduces the performance than this one. So i kept this version. On Fri, Jun 26, 2015 at 3:37 PM, Praveen Tiwari prav...@multicorewareinc.com wrote: -- Forwarded message -- From: raj...@multicorewareinc.com Date: Fri, Jun 26, 2015 at 3:14 PM Subject: [x265] [PATCH] asm: pixelavg_pp[8xN] avx2 code for 10bpp To: x265-devel@videolan.org # HG changeset patch # User Rajesh Paulrajraj...@multicorewareinc.com # Date 1435311076 -19800 # Fri Jun 26 15:01:16 2015 +0530 # Node ID 956401f1a679f1e71181b704d64e4acdb6f1a93f # Parent d64227e54233d1646c55bcb4b0b831e5340009ed asm: pixelavg_pp[8xN] avx2 code for 10bpp avx2: avg_pp[ 8x4] 4.39x145.09 636.75 avg_pp[ 8x8] 5.33x215.27 1146.55 avg_pp[ 8x16] 6.50x336.88 2190.68 avg_pp[ 8x32] 7.71x579.86 4470.84 sse2: avg_pp[ 8x4] 2.31x287.63 663.94 avg_pp[ 8x8] 3.26x370.21 1205.26 avg_pp[ 8x16] 3.99x581.63 2323.25 avg_pp[ 8x32] 4.78x995.79 4755.58 diff -r d64227e54233 -r 956401f1a679 source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cpp Thu Jun 25 16:25:51 2015 +0530 +++ b/source/common/x86/asm-primitives.cpp Fri Jun 26 15:01:16 2015 +0530 @@ -1362,6 +1362,10 @@ p.cu[BLOCK_32x32].intra_pred[33]= PFX(intra_pred_ang32_33_avx2); p.cu[BLOCK_32x32].intra_pred[34]= PFX(intra_pred_ang32_2_avx2); +p.pu[LUMA_8x4].pixelavg_pp = PFX(pixel_avg_8x4_avx2); +p.pu[LUMA_8x8].pixelavg_pp = PFX(pixel_avg_8x8_avx2); +p.pu[LUMA_8x16].pixelavg_pp = PFX(pixel_avg_8x16_avx2); +p.pu[LUMA_8x32].pixelavg_pp = PFX(pixel_avg_8x32_avx2); p.pu[LUMA_12x16].pixelavg_pp = PFX(pixel_avg_12x16_avx2); p.pu[LUMA_16x4].pixelavg_pp = PFX(pixel_avg_16x4_avx2); p.pu[LUMA_16x8].pixelavg_pp = PFX(pixel_avg_16x8_avx2); diff -r d64227e54233 -r 956401f1a679 source/common/x86/mc-a.asm --- a/source/common/x86/mc-a.asmThu Jun 25 16:25:51 2015 +0530 +++ b/source/common/x86/mc-a.asmFri Jun 26 15:01:16 2015 +0530 @@ -4490,6 +4490,88 @@ RET %endif +%macro pixel_avg_W8 0 +movuxm0, [r2] +movuxm1, [r4] +pavgw xm0, xm1 +movu[r0], xm0 +movuxm2, [r2 + r3] +movuxm3, [r4 + r5] +pavgw xm2, xm3 +movu[r0 + r1], xm2 + Your macro is not using avx2 capabilities, did you check the performance of two rows combined ? It will reduce your pavgw and movu instruction by half. You can use vinserti128 to combine two rows at a time. +movuxm0, [r2 + r3 * 2] +movuxm1, [r4 + r5 * 2] +pavgw xm0, xm1 +movu[r0 + r1 * 2], xm0 +movuxm2, [r2 + r6] +movuxm3, [r4 + r7] +pavgw xm2, xm3 +movu[r0 + r8], xm2 + +lea r0, [r0 + 4 * r1] +lea r2, [r2 + 4 * r3] +lea r4, [r4 + 4 * r5] +%endmacro + +;--- +;void pixelavg_pp(pixel dst, intptr_t dstride, const pixel src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int) +;--- +%if ARCH_X86_64 +INIT_YMM avx2 +cglobal pixel_avg_8x4, 6,10,4 +add r1d, r1d +add r3d, r3d +add r5d, r5d +lea r6, [r3 * 3] +lea r7, [r5 * 3] +lea r8, [r1 * 3] +pixel_avg_W8 +RET + +cglobal pixel_avg_8x8, 6,10,4 +add r1d, r1d +add r3d, r3d +add r5d, r5d +lea r6, [r3 * 3] +lea r7, [r5 * 3] +lea r8, [r1 * 3] +mov r9d, 2 +.loop +pixel_avg_W8 +dec r9d +jnz .loop +RET + +cglobal pixel_avg_8x16, 6,10,4 +add r1d, r1d +add r3d, r3d +add r5d, r5d +lea r6, [r3 * 3] +lea r7, [r5 * 3] +lea r8, [r1 * 3] +mov r9d, 4 +.loop +pixel_avg_W8 +dec r9d +jnz .loop +RET + +cglobal pixel_avg_8x32, 6,10,4 +add r1d, r1d +add r3d, r3d +add r5d, r5d +lea r6, [r3 * 3] +lea r7, [r5 * 3] +lea r8, [r1 * 3] +mov r9d, 8 +.loop +pixel_avg_W8 +dec r9d +jnz .loop +RET +%endif + %macro pixel_avg_H4 0 movum0, [r2] movum1, [r4] ___ x265-devel mailing list x265-devel@videolan.org
[x265] Fwd: [PATCH] asm: pixelavg_pp[8xN] avx2 code for 10bpp
-- Forwarded message -- From: raj...@multicorewareinc.com Date: Fri, Jun 26, 2015 at 3:14 PM Subject: [x265] [PATCH] asm: pixelavg_pp[8xN] avx2 code for 10bpp To: x265-devel@videolan.org # HG changeset patch # User Rajesh Paulrajraj...@multicorewareinc.com # Date 1435311076 -19800 # Fri Jun 26 15:01:16 2015 +0530 # Node ID 956401f1a679f1e71181b704d64e4acdb6f1a93f # Parent d64227e54233d1646c55bcb4b0b831e5340009ed asm: pixelavg_pp[8xN] avx2 code for 10bpp avx2: avg_pp[ 8x4] 4.39x145.09 636.75 avg_pp[ 8x8] 5.33x215.27 1146.55 avg_pp[ 8x16] 6.50x336.88 2190.68 avg_pp[ 8x32] 7.71x579.86 4470.84 sse2: avg_pp[ 8x4] 2.31x287.63 663.94 avg_pp[ 8x8] 3.26x370.21 1205.26 avg_pp[ 8x16] 3.99x581.63 2323.25 avg_pp[ 8x32] 4.78x995.79 4755.58 Basically, our macro pixel_avg_8xN just SSE (just simple syntax conversion for avx2, not using 256 bit capability) so, fundamentally there should be no major improvement in speed. But improvements 287.63c - 145.09c, 370.21c - 215.27 etc are quite good. Does it means SSE2 codes are not optimize well ? Can you revisit SSE code using this algorithm? diff -r d64227e54233 -r 956401f1a679 source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cpp Thu Jun 25 16:25:51 2015 +0530 +++ b/source/common/x86/asm-primitives.cpp Fri Jun 26 15:01:16 2015 +0530 @@ -1362,6 +1362,10 @@ p.cu[BLOCK_32x32].intra_pred[33]= PFX(intra_pred_ang32_33_avx2); p.cu[BLOCK_32x32].intra_pred[34]= PFX(intra_pred_ang32_2_avx2); +p.pu[LUMA_8x4].pixelavg_pp = PFX(pixel_avg_8x4_avx2); +p.pu[LUMA_8x8].pixelavg_pp = PFX(pixel_avg_8x8_avx2); +p.pu[LUMA_8x16].pixelavg_pp = PFX(pixel_avg_8x16_avx2); +p.pu[LUMA_8x32].pixelavg_pp = PFX(pixel_avg_8x32_avx2); p.pu[LUMA_12x16].pixelavg_pp = PFX(pixel_avg_12x16_avx2); p.pu[LUMA_16x4].pixelavg_pp = PFX(pixel_avg_16x4_avx2); p.pu[LUMA_16x8].pixelavg_pp = PFX(pixel_avg_16x8_avx2); diff -r d64227e54233 -r 956401f1a679 source/common/x86/mc-a.asm --- a/source/common/x86/mc-a.asmThu Jun 25 16:25:51 2015 +0530 +++ b/source/common/x86/mc-a.asmFri Jun 26 15:01:16 2015 +0530 @@ -4490,6 +4490,88 @@ RET %endif +%macro pixel_avg_W8 0 +movuxm0, [r2] +movuxm1, [r4] +pavgw xm0, xm1 +movu[r0], xm0 +movuxm2, [r2 + r3] +movuxm3, [r4 + r5] +pavgw xm2, xm3 +movu[r0 + r1], xm2 + +movuxm0, [r2 + r3 * 2] +movuxm1, [r4 + r5 * 2] +pavgw xm0, xm1 +movu[r0 + r1 * 2], xm0 +movuxm2, [r2 + r6] +movuxm3, [r4 + r7] +pavgw xm2, xm3 +movu[r0 + r8], xm2 + +lea r0, [r0 + 4 * r1] +lea r2, [r2 + 4 * r3] +lea r4, [r4 + 4 * r5] +%endmacro + +;--- +;void pixelavg_pp(pixel dst, intptr_t dstride, const pixel src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int) +;--- +%if ARCH_X86_64 +INIT_YMM avx2 +cglobal pixel_avg_8x4, 6,10,4 +add r1d, r1d +add r3d, r3d +add r5d, r5d +lea r6, [r3 * 3] +lea r7, [r5 * 3] +lea r8, [r1 * 3] +pixel_avg_W8 +RET + +cglobal pixel_avg_8x8, 6,10,4 +add r1d, r1d +add r3d, r3d +add r5d, r5d +lea r6, [r3 * 3] +lea r7, [r5 * 3] +lea r8, [r1 * 3] +mov r9d, 2 +.loop +pixel_avg_W8 +dec r9d +jnz .loop +RET + +cglobal pixel_avg_8x16, 6,10,4 +add r1d, r1d +add r3d, r3d +add r5d, r5d +lea r6, [r3 * 3] +lea r7, [r5 * 3] +lea r8, [r1 * 3] +mov r9d, 4 +.loop +pixel_avg_W8 +dec r9d +jnz .loop +RET + +cglobal pixel_avg_8x32, 6,10,4 +add r1d, r1d +add r3d, r3d +add r5d, r5d +lea r6, [r3 * 3] +lea r7, [r5 * 3] +lea r8, [r1 * 3] +mov r9d, 8 +.loop +pixel_avg_W8 +dec r9d +jnz .loop +RET +%endif + %macro pixel_avg_H4 0 movum0, [r2] movum1, [r4] ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH 1 of 3] asm: intra_pred_ang32_33 improved by ~35% over SSE4
Please ignore duplicate patch (second), send my mistake. Regards, Praveen On Fri, Mar 27, 2015 at 10:41 AM, prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari prav...@multicorewareinc.com # Date 1427356204 -19800 # Thu Mar 26 13:20:04 2015 +0530 # Branch stable # Node ID 24bdb3e594556ca6e12ee9dae58100a6bd115d2a # Parent 3d0f23cb0e58585e490362587022e67cfded143a asm: intra_pred_ang32_33 improved by ~35% over SSE4 AVX2: intra_ang_32x32[33] 11.11x 2618.69 29084.27 SSE4: intra_ang_32x32[33] 7.59x4055.42 30792.64 diff -r 3d0f23cb0e58 -r 24bdb3e59455 source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cpp Thu Mar 26 15:09:51 2015 -0500 +++ b/source/common/x86/asm-primitives.cpp Thu Mar 26 13:20:04 2015 +0530 @@ -1642,6 +1642,7 @@ p.cu[BLOCK_32x32].intra_pred[30] = x265_intra_pred_ang32_30_avx2; p.cu[BLOCK_32x32].intra_pred[31] = x265_intra_pred_ang32_31_avx2; p.cu[BLOCK_32x32].intra_pred[32] = x265_intra_pred_ang32_32_avx2; +p.cu[BLOCK_32x32].intra_pred[33] = x265_intra_pred_ang32_33_avx2; // copy_sp primitives p.cu[BLOCK_16x16].copy_sp = x265_blockcopy_sp_16x16_avx2; diff -r 3d0f23cb0e58 -r 24bdb3e59455 source/common/x86/intrapred.h --- a/source/common/x86/intrapred.h Thu Mar 26 15:09:51 2015 -0500 +++ b/source/common/x86/intrapred.h Thu Mar 26 13:20:04 2015 +0530 @@ -212,6 +212,7 @@ void x265_intra_pred_ang32_30_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_intra_pred_ang32_31_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_intra_pred_ang32_32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang32_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); void x265_all_angs_pred_16x16_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); diff -r 3d0f23cb0e58 -r 24bdb3e59455 source/common/x86/intrapred8.asm --- a/source/common/x86/intrapred8.asm Thu Mar 26 15:09:51 2015 -0500 +++ b/source/common/x86/intrapred8.asm Thu Mar 26 13:20:04 2015 +0530 @@ -376,6 +376,37 @@ db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11 db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 + +ALIGN 32 +c_ang32_mode_33: db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 + db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 + db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 + db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 + db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 + db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 + db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10 + db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 + db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 + db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 + db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 + db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 + db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 + db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 + db 18, 14, 18, 14, 18, 14
Re: [x265] [PATCH 2 of 3] asm: intra_pred_ang32_25 improved by ~53% over SSE4
Please ignore duplicate patch (second), send my mistake. Regards, Praveen On Fri, Mar 27, 2015 at 10:41 AM, prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari prav...@multicorewareinc.com # Date 142736 -19800 # Thu Mar 26 14:23:20 2015 +0530 # Branch stable # Node ID 39c139322fde1f8c62545fd8bbed9cc8198e540c # Parent 24bdb3e594556ca6e12ee9dae58100a6bd115d2a asm: intra_pred_ang32_25 improved by ~53% over SSE4 AVX2: intra_ang_32x32[25] 23.11x 1293.83 29904.12 SSE4: intra_ang_32x32[25] 10.31x 2759.33 28451.26 diff -r 24bdb3e59455 -r 39c139322fde source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cpp Thu Mar 26 13:20:04 2015 +0530 +++ b/source/common/x86/asm-primitives.cpp Thu Mar 26 14:23:20 2015 +0530 @@ -1643,6 +1643,7 @@ p.cu[BLOCK_32x32].intra_pred[31] = x265_intra_pred_ang32_31_avx2; p.cu[BLOCK_32x32].intra_pred[32] = x265_intra_pred_ang32_32_avx2; p.cu[BLOCK_32x32].intra_pred[33] = x265_intra_pred_ang32_33_avx2; +p.cu[BLOCK_32x32].intra_pred[25] = x265_intra_pred_ang32_25_avx2; // copy_sp primitives p.cu[BLOCK_16x16].copy_sp = x265_blockcopy_sp_16x16_avx2; diff -r 24bdb3e59455 -r 39c139322fde source/common/x86/intrapred.h --- a/source/common/x86/intrapred.h Thu Mar 26 13:20:04 2015 +0530 +++ b/source/common/x86/intrapred.h Thu Mar 26 14:23:20 2015 +0530 @@ -213,6 +213,7 @@ void x265_intra_pred_ang32_31_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_intra_pred_ang32_32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_intra_pred_ang32_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang32_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); void x265_all_angs_pred_16x16_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); diff -r 24bdb3e59455 -r 39c139322fde source/common/x86/intrapred8.asm --- a/source/common/x86/intrapred8.asm Thu Mar 26 13:20:04 2015 +0530 +++ b/source/common/x86/intrapred8.asm Thu Mar 26 14:23:20 2015 +0530 @@ -407,6 +407,26 @@ db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 + +ALIGN 32 +c_ang32_mode_25: db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 + db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 + db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 + db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 + db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 + db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 + db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 + db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 + db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 + db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 + db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 + db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 + db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 + db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 + + ALIGN 32 ;; (blkSize - 1 - x) pw_planar4_0: dw 3, 2, 1
Re: [x265] [PATCH] asm: intra_pred_ang16_25
Please ignore, need to add performance data in commit message. Regards, Praveen On Thu, Mar 12, 2015 at 6:50 PM, prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari prav...@multicorewareinc.com # Date 1426165765 -19800 # Node ID e4204ceeb011a009455cde620c346729d80ac822 # Parent d012e125bdb1299ba29b9c0680931e148981a42e asm: intra_pred_ang16_25 diff -r d012e125bdb1 -r e4204ceeb011 source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cpp Thu Mar 12 18:40:23 2015 +0530 +++ b/source/common/x86/asm-primitives.cpp Thu Mar 12 18:39:25 2015 +0530 @@ -1504,6 +1504,7 @@ p.cu[BLOCK_8x8].intra_pred[12] = x265_intra_pred_ang8_12_avx2; p.cu[BLOCK_8x8].intra_pred[24] = x265_intra_pred_ang8_24_avx2; p.cu[BLOCK_8x8].intra_pred[11] = x265_intra_pred_ang8_11_avx2; +p.cu[BLOCK_16x16].intra_pred[25] = x265_intra_pred_ang16_25_avx2; // copy_sp primitives p.cu[BLOCK_16x16].copy_sp = x265_blockcopy_sp_16x16_avx2; diff -r d012e125bdb1 -r e4204ceeb011 source/common/x86/intrapred.h --- a/source/common/x86/intrapred.h Thu Mar 12 18:40:23 2015 +0530 +++ b/source/common/x86/intrapred.h Thu Mar 12 18:39:25 2015 +0530 @@ -182,6 +182,7 @@ void x265_intra_pred_ang8_12_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_intra_pred_ang8_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_intra_pred_ang8_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang16_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); void x265_all_angs_pred_16x16_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); diff -r d012e125bdb1 -r e4204ceeb011 source/common/x86/intrapred8.asm --- a/source/common/x86/intrapred8.asm Thu Mar 12 18:40:23 2015 +0530 +++ b/source/common/x86/intrapred8.asm Thu Mar 12 18:39:25 2015 +0530 @@ -113,6 +113,17 @@ db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2 db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 +ALIGN 32 +c_ang16_mode_25: db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 + db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 + db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 + db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 + db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 + db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 + db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 + db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 + +ALIGN 32 ;; (blkSize - 1 - x) pw_planar4_0: dw 3, 2, 1, 0, 3, 2, 1, 0 pw_planar4_1: dw 3, 3, 3, 3, 3, 3, 3, 3 @@ -10368,6 +10379,47 @@ movhps[r0 + r3], xm2 RET +%macro INTRA_PRED_ANG16_MC0 3 +pmaddubsw m3, m1, [r4 + %3 * mmsize] +pmulhrsw m3, m0 +pmaddubsw m4, m2, [r4 + %3 * mmsize] +pmulhrsw m4, m0 +packuswb m3, m4 +movu [%1], xm3 +vextracti128 xm4, m3, 1 +movu [%2], xm4 +%endmacro + +%macro INTRA_PRED_ANG16_25 1 +INTRA_PRED_ANG16_MC0 r0, r0 + r1, %1 +INTRA_PRED_ANG16_MC0 r0 + 2 * r1, r0 + r3, (%1 + 1) +%endmacro + +INIT_YMM avx2 +cglobal intra_pred_ang16_25, 3, 5, 5 +mova m0, [pw_1024] + +vbroadcasti128m1, [r2] +pshufbm1, [intra_pred_shuff_0_8] +vbroadcasti128m2, [r2 + 8] +pshufbm2, [intra_pred_shuff_0_8] + +lea r3, [3 * r1] +lea r4, [c_ang16_mode_25] + +INTRA_PRED_ANG16_25 0 + +lear0, [r0 + 4 * r1] +INTRA_PRED_ANG16_25 2 + +lear0, [r0 + 4 * r1] +INTRA_PRED_ANG16_25 4 + +lear0, [r0 + 4 * r1
Re: [x265] [PATCH] asm-avx2: inra_pred, align const
Updated this patch on tip. Thanks, Praveen On Tue, Mar 10, 2015 at 10:53 AM, prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari prav...@multicorewareinc.com # Date 1425964751 -19800 # Node ID f97dfb483647d573cbcab9a4f007ac2aa89c9066 # Parent 726fe4088f31710af174c18b1e26fdc759efb300 asm-avx2: inra_pred, align const diff -r 726fe4088f31 -r f97dfb483647 source/common/x86/intrapred8.asm --- a/source/common/x86/intrapred8.asm Mon Mar 09 19:21:25 2015 -0500 +++ b/source/common/x86/intrapred8.asm Tue Mar 10 10:49:11 2015 +0530 @@ -26,6 +26,8 @@ SECTION_RODATA 32 +intra_pred_shuff_0_8:times 2 db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8 + pb_0_8times 8 db 0, 8 pb_unpackbw1 times 2 db 1, 8, 2, 8, 3, 8, 4, 8 pb_swap8: times 2 db 7, 6, 5, 4, 3, 2, 1, 0 @@ -83,7 +85,6 @@ c_ang8_7_20: db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 c_ang8_1_14: db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 c_ang8_27_8: db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 -c_ang8_src1_9_1_9:db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8 c_ang8_src2_10_2_10: db 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9 c_ang8_src3_11_3_11: db 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10 @@ -9968,7 +9969,7 @@ mova m3, [pw_1024] vbroadcasti128m0, [r2 + 17] -pshufbm1, m0, [c_ang8_src1_9_1_9] +pshufbm1, m0, [intra_pred_shuff_0_8] pshufbm2, m0, [c_ang8_src2_10_2_10] pshufbm4, m0, [c_ang8_src3_11_3_11] pshufbm0, [c_ang8_src3_11_4_12] @@ -10013,7 +10014,7 @@ mova m3, [pw_1024] vbroadcasti128m0, [r2 + 1] -pshufbm1, m0, [c_ang8_src1_9_1_9] +pshufbm1, m0, [intra_pred_shuff_0_8] pshufbm2, m0, [c_ang8_src2_10_2_10] pshufbm4, m0, [c_ang8_src3_11_3_11] pshufbm0, [c_ang8_src3_11_4_12] @@ -10045,12 +10046,11 @@ INIT_YMM avx2 -cglobal intra_pred_ang8_9, 3, 5, 6 +cglobal intra_pred_ang8_9, 3, 5, 5 mova m3, [pw_1024] vbroadcasti128m0, [r2 + 17] -movu m5, [c_ang8_src1_9_1_9] - -pshufbm0, m5 + +pshufbm0, [intra_pred_shuff_0_8] lea r4, [c_ang8_mode_27] pmaddubsw m1, m0, [r4] @@ -10089,12 +10089,11 @@ RET INIT_YMM avx2 -cglobal intra_pred_ang8_27, 3, 5, 6 +cglobal intra_pred_ang8_27, 3, 5, 5 mova m3, [pw_1024] vbroadcasti128m0, [r2 + 1] -movu m5, [c_ang8_src1_9_1_9] - -pshufbm0, m5 + +pshufbm0, [intra_pred_shuff_0_8] lea r4, [c_ang8_mode_27] pmaddubsw m1, m0, [r4] @@ -10123,12 +10122,11 @@ RET INIT_YMM avx2 -cglobal intra_pred_ang8_25, 3, 5, 6 +cglobal intra_pred_ang8_25, 3, 5, 5 mova m3, [pw_1024] vbroadcasti128m0, [r2] -mova m5, [c_ang8_src1_9_1_9] - -pshufbm0, m5 + +pshufbm0, [intra_pred_shuff_0_8] lea r4, [c_ang8_mode_25] pmaddubsw m1, m0, [r4] @@ -10162,7 +10160,7 @@ mova m3, [pw_1024] vbroadcasti128m0, [r2 + 17] -pshufbm1, m0, [c_ang8_src1_9_1_9] +pshufbm1, m0, [intra_pred_shuff_0_8] pshufbm2, m0, [c_ang8_src1_9_2_10] pshufbm4, m0, [c_ang8_src2_10_2_10] pshufbm0, [c_ang8_src2_10_3_11] @@ -10207,7 +10205,7 @@ mova m3, [pw_1024] vbroadcasti128m0, [r2 + 1] -pshufbm1, m0, [c_ang8_src1_9_1_9] +pshufbm1, m0, [intra_pred_shuff_0_8] pshufbm2, m0, [c_ang8_src1_9_2_10] pshufbm4, m0, [c_ang8_src2_10_2_10] pshufbm0, [c_ang8_src2_10_3_11] @@ -10242,7 +10240,7 @@ cglobal intra_pred_ang8_8, 3, 4, 6 mova m3, [pw_1024] vbroadcasti128m0, [r2 + 17] -movu m5, [c_ang8_src1_9_1_9] +mova m5, [intra_pred_shuff_0_8] pshufbm1, m0, m5 pshufbm2, m0, m5 @@ -10288,7 +10286,7 @@ cglobal intra_pred_ang8_28, 3, 4, 6 mova m3, [pw_1024] vbroadcasti128m0, [r2 + 1] -movu m5, [c_ang8_src1_9_1_9] +mova m5, [intra_pred_shuff_0_8] pshufbm1, m0, m5
[x265] Fwd: [PATCH] asm: avx2 code for sad[32x32] for 8bpp
-- Forwarded message -- From: sumala...@multicorewareinc.com Date: Wed, Mar 11, 2015 at 2:24 PM Subject: [x265] [PATCH] asm: avx2 code for sad[32x32] for 8bpp To: x265-devel@videolan.org # HG changeset patch # User Sumalatha Polureddysumala...@multicorewareinc.com # Date 1426064050 -19800 # Node ID 01bfd365bf5f5317874b5c0315736ca76176f3df # Parent 800f8ecd1e7393756f4bb58e536497162dc32150 asm: avx2 code for sad[32x32] for 8bpp SSE3 sad[32x32] 230.81x 745.76 172131.92 AVX2 sad[32x32] 330.38x 496.68 164091.02 Are you comparing the debug mode performance numbers? 230.81x ??? SSE3 sad[32x32] 31.96x 770.39 24623.33 on i7-4770k CPU. Please check the issue. diff -r 800f8ecd1e73 -r 01bfd365bf5f source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cpp Tue Mar 10 10:49:11 2015 +0530 +++ b/source/common/x86/asm-primitives.cpp Wed Mar 11 14:24:10 2015 +0530 @@ -1442,6 +1442,8 @@ p.pu[LUMA_8x16].satd = x265_pixel_satd_8x16_avx2; p.pu[LUMA_8x8].satd = x265_pixel_satd_8x8_avx2; +p.pu[LUMA_32x32].sad = x265_pixel_sad_32x32_avx2; + p.pu[LUMA_8x4].sad_x3 = x265_pixel_sad_x3_8x4_avx2; p.pu[LUMA_8x8].sad_x3 = x265_pixel_sad_x3_8x8_avx2; p.pu[LUMA_8x16].sad_x3 = x265_pixel_sad_x3_8x16_avx2; diff -r 800f8ecd1e73 -r 01bfd365bf5f source/common/x86/sad-a.asm --- a/source/common/x86/sad-a.asm Tue Mar 10 10:49:11 2015 +0530 +++ b/source/common/x86/sad-a.asm Wed Mar 11 14:24:10 2015 +0530 @@ -3897,5 +3897,31 @@ movq[r6 + 8], xm1 RET +INIT_YMM avx2 +cglobal pixel_sad_32x32, 4,4,5 +xorps m0, m0 +%assign x 0 +%rep 16 +movu m1, [r0] ; row 0 of pix0 +movu m2, [r2] ; row 0 of pix1 +movu m3, [r0 + r1] ; row 1 of pix0 +movu m4, [r2 + r3] ; row 1 of pix1 + +psadbw m1, m2 +psadbw m3, m4 +paddd m0, m1 +paddd m0, m3 +%assign x x+1 + %if x 16 +lea r2, [r2 + 2 * r3] +lea r0, [r0 + 2 * r1] + %endif +%endrep +vextracti128 xm1, m0, 1 +paddd xm0, xm1 +pshufd xm1, xm0, 2 +paddd xm0,xm1 +movd eax, xm0 +RET %endif ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
[x265] Fwd: [PATCH] asm-avx2: intra_pred_ang8_11
-- Forwarded message -- From: chen chenm...@163.com Date: Wed, Mar 11, 2015 at 2:33 AM Subject: Re: [x265] [PATCH] asm-avx2: intra_pred_ang8_11 To: Development for x265 x265-devel@videolan.org its right now, just a little problem, [trans8_shuf] just use 2 times, buffer into register will same speed with more code size. Do you mean instead of, mova m0, [trans8_shuf] vpermdm1, m0, m1 vpermdm4, m0, m4 we should use this, vpermdm1, [trans8_shuf], m1 vpermdm4, [trans8_shuf], m4 Does the compiler will not use two 'mova' instruction internally rather than just using once? Can be depend on the compiler here for this optimization? Even syntax of 'vpermd' does not allows this. At 2015-03-10 13:58:50,prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari prav...@multicorewareinc.com # Date 1425967049 -19800 # Node ID 810995b991eba3f7dcd9014db3b58a6b07723be3 # Parent f97dfb483647d573cbcab9a4f007ac2aa89c9066 asm-avx2: intra_pred_ang8_11 diff -r f97dfb483647 -r 810995b991eb source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cpp Tue Mar 10 10:49:11 2015 +0530 +++ b/source/common/x86/asm-primitives.cpp Tue Mar 10 11:27:29 2015 +0530 @@ -1496,6 +1496,7 @@ p.cu[BLOCK_8x8].intra_pred[9] = x265_intra_pred_ang8_9_avx2; p.cu[BLOCK_8x8].intra_pred[27] = x265_intra_pred_ang8_27_avx2; p.cu[BLOCK_8x8].intra_pred[25] = x265_intra_pred_ang8_25_avx2; +p.cu[BLOCK_8x8].intra_pred[11] = x265_intra_pred_ang8_11_avx2; // copy_sp primitives p.cu[BLOCK_16x16].copy_sp = x265_blockcopy_sp_16x16_avx2; diff -r f97dfb483647 -r 810995b991eb source/common/x86/intrapred.h --- a/source/common/x86/intrapred.hTue Mar 10 10:49:11 2015 +0530 +++ b/source/common/x86/intrapred.hTue Mar 10 11:27:29 2015 +0530 @@ -179,6 +179,7 @@ void x265_intra_pred_ang8_9_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_intra_pred_ang8_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_intra_pred_ang8_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); void x265_all_angs_pred_16x16_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); diff -r f97dfb483647 -r 810995b991eb source/common/x86/intrapred8.asm --- a/source/common/x86/intrapred8.asm Tue Mar 10 10:49:11 2015 +0530 +++ b/source/common/x86/intrapred8.asm Tue Mar 10 11:27:29 2015 +0530 @@ -10317,3 +10317,47 @@ movhps[r0 + 2 * r1], xm4 movhps[r0 + r3], xm2 RET + +INIT_YMM avx2 +cglobal intra_pred_ang8_11, 3, 5, 5 +mova m3, [pw_1024] +movu xm1, [r2 + 16] +pinsrbxm1, [r2], 0 +pshufbxm1, [intra_pred_shuff_0_8] +vinserti128 m0, m1, xm1, 1 + +lea r4, [c_ang8_mode_25] +pmaddubsw m1, m0, [r4] +pmulhrsw m1, m3 +pmaddubsw m2, m0, [r4 + mmsize] +pmulhrsw m2, m3 +pmaddubsw m4, m0, [r4 + 2 * mmsize] +pmulhrsw m4, m3 +pmaddubsw m0, [r4 + 3 * mmsize] +pmulhrsw m0, m3 +packuswb m1, m2 +packuswb m4, m0 + +vperm2i128m2, m1, m4, 0010b +vperm2i128m1, m1, m4, 00110001b +punpcklbw m4, m2, m1 +punpckhbw m2, m1 +punpcklwd m1, m4, m2 +punpckhwd m4, m2 +mova m0, [trans8_shuf] +vpermdm1, m0, m1 +vpermdm4, m0, m4 + +lea r3, [3 * r1] +movq [r0], xm1 +movhps[r0 + r1], xm1 +vextracti128 xm2, m1, 1 +movq [r0 + 2 * r1], xm2 +movhps[r0 + r3], xm2 +lea r0, [r0 + 4 * r1] +movq [r0], xm4 +movhps[r0 + r1], xm4 +vextracti128 xm2, m4, 1 +movq [r0 + 2 * r1], xm2 +movhps[r0 + r3], xm2 +RET ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
[x265] Fwd: [PATCH] asm: intra_pred_ang16_34
-- Forwarded message -- From: chen chenm...@163.com Date: Wed, Mar 11, 2015 at 6:32 AM Subject: Re: [x265] [PATCH] asm: intra_pred_ang16_34 To: Development for x265 x265-devel@videolan.org same speed to old version This avx2 version of asm code eliminates following instruction on cost of one vextracti128 instruction as compare to SSEE3, may not be a visible impact in testBench but seems worth to push. add r2, 34 cmp r3m, byte 34 cmove r2, r4 movum1, [r2 + 16] Regards, Praveen ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
[x265] Fwd: [PATCH] asm: intra_pred_ang16_2
-- Forwarded message -- From: chen chenm...@163.com Date: Wed, Mar 11, 2015 at 6:32 AM Subject: Re: [x265] [PATCH] asm: intra_pred_ang16_2 To: Development for x265 x265-devel@videolan.org same speed to old version This avx2 version of asm code eliminates following instruction on cost of one vextracti128 instruction as compare to SSEE3, may not be a visible impact in testBench but seems worth to push. add r2, 34 cmp r3m, byte 34 cmove r2, r4 movum1, [r2 + 16] Regards, Praveen ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
[x265] Fwd: [PATCH] asm: intra_pred_ang8_24 8bpp, improved 206.33c - 177.70c over SSE version
-- Forwarded message -- From: chen chenm...@163.com Date: Wed, Mar 11, 2015 at 6:09 AM Subject: Re: [x265] [PATCH] asm: intra_pred_ang8_24 8bpp, improved 206.33c - 177.70c over SSE version To: Development for x265 x265-devel@videolan.org +c_ang8_mode_24: db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, \ + 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, \ we'd better a new 'db' in every line. [Praveen] You have to explain me, how it is better? What difference does it makes, does it help to achieve more performance or it is just for coding style. Regards, Praveen ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] asm-avx2: intra_pred_ang8_25, (42.92x)
Updated the code with more optimization. Regards, Praveen On Sat, Mar 7, 2015 at 3:31 AM, chen chenm...@163.com wrote: right At 2015-03-06 14:16:23,prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari prav...@multicorewareinc.com # Date 1425622433 -19800 # Node ID b48efcbe1b196593d572dbbd4dd220f215f97321 # Parent fe9c058f216d4315ea995b09384aab2b1a28d1ec asm-avx2: intra_pred_ang8_25, (42.92x) intra_ang_8x8[25] 42.92x 210.61 9039.28 ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] asm-avx2: intra_pred_ang8_11, (51.84x)
Update the patch with more optimization. Regards, Praveen On Sat, Mar 7, 2015 at 3:40 AM, chen chenm...@163.com wrote: right At 2015-03-06 15:50:38,prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari prav...@multicorewareinc.com # Date 1425628229 -19800 # Node ID 25b01a20389e8e4297e004d500871263ca349d15 # Parent b48efcbe1b196593d572dbbd4dd220f215f97321 asm-avx2: intra_pred_ang8_11, (51.84x) intra_ang_8x8[11] 51.84x 295.15 15301.57 ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] asm-avx2: intra_pred_ang8_24, (40.05x)
Updated the patch as per suggestions. Regards, Praveen On Sat, Mar 7, 2015 at 3:57 AM, chen chenm...@163.com wrote: At 2015-03-06 17:24:05,prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari prav...@multicorewareinc.com # Date 1425633836 -19800 # Node ID 2da3a6431f94e1dce3c6bc739e7c457f90b12369 # Parent 25b01a20389e8e4297e004d500871263ca349d15 asm-avx2: intra_pred_ang8_24, (40.05x) intra_ang_8x8[24] 40.05x 244.28 9782.73 diff -r 25b01a20389e -r 2da3a6431f94 source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cpp Fri Mar 06 13:20:29 2015 +0530 +++ b/source/common/x86/asm-primitives.cpp Fri Mar 06 14:53:56 2015 +0530 @@ -1514,6 +1514,7 @@ p.cu[BLOCK_8x8].intra_pred[27] = x265_intra_pred_ang8_27_avx2; p.cu[BLOCK_8x8].intra_pred[11] = x265_intra_pred_ang8_11_avx2; p.cu[BLOCK_8x8].intra_pred[25] = x265_intra_pred_ang8_25_avx2; +p.cu[BLOCK_8x8].intra_pred[24] = x265_intra_pred_ang8_24_avx2; // copy_sp primitives p.cu[BLOCK_16x16].copy_sp = x265_blockcopy_sp_16x16_avx2; diff -r 25b01a20389e -r 2da3a6431f94 source/common/x86/intrapred.h --- a/source/common/x86/intrapred.h Fri Mar 06 13:20:29 2015 +0530 +++ b/source/common/x86/intrapred.h Fri Mar 06 14:53:56 2015 +0530 @@ -177,6 +177,7 @@ void x265_intra_pred_ang8_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_intra_pred_ang8_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_intra_pred_ang8_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); void x265_all_angs_pred_16x16_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); diff -r 25b01a20389e -r 2da3a6431f94 source/common/x86/intrapred8.asm --- a/source/common/x86/intrapred8.asm Fri Mar 06 13:20:29 2015 +0530 +++ b/source/common/x86/intrapred8.asm Fri Mar 06 14:53:56 2015 +0530 @@ -105,6 +105,11 @@ 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, \ 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 +c_ang8_mode_24: db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, \ + 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, \ + 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, \ + 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 + ;; (blkSize - 1 - x) pw_planar4_0: dw 3, 2, 1, 0, 3, 2, 1, 0 pw_planar4_1: dw 3, 3, 3, 3, 3, 3, 3, 3 @@ -33145,3 +33150,41 @@ movhps[r0 + 2 * r1], xm4 movhps[r0 + r3], xm2 RET + +INIT_YMM avx2 +cglobal intra_pred_ang8_24, 3, 5, 6 +mova m3, [pw_1024] +vbroadcasti128m0, [r2] +movu m5, [c_ang8_src1_9_1_9] unalgined? + +pshufbm0, m5 + +lea r4, [c_ang8_mode_24] +pmaddubsw m1, m0, [r4] +pmulhrsw m1, m3 +pmaddubsw m2, m0, [r4 + mmsize] +pmulhrsw m2, m3 +pmaddubsw m4, m0, [r4 + 2 * mmsize] +pmulhrsw m4, m3 +pslldqxm0, 2 +pinsrbxm0, [r2 + 16 + 6], 0 +pinsrbxm0, [r2 + 0], 1 +vinserti128 m0, m0, xm0, 1 +pmaddubsw m0, [r4 + 3 * mmsize] +pmulhrsw m0, m3 +packuswb m1, m2 +packuswb m4, m0 + +lea r3, [3 * r1] +movq [r0], xm1 +vextracti128 xm2, m1, 1 +movq [r0 + r1], xm2 +movhps[r0 + 2 * r1], xm1 +movhps[r0 + r3], xm2 +lea r0, [r0 + 4 * r1] +movq [r0], xm4 +vextracti128 xm2, m4, 1 +movq [r0 + r1], xm2 +movhps[r0 + 2 * r1], xm4 +movhps[r0 + r3], xm2 +RET ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
[x265] Fwd: Fwd: [PATCH Review Only] asm-avx2: intra_pred_ang8_33, improved 265.79c - 185.43c over sse4 asm code
-- Forwarded message -- From: chen chenm...@163.com Date: Thu, Feb 26, 2015 at 3:15 PM Subject: Re: [x265] Fwd: [PATCH Review Only] asm-avx2: intra_pred_ang8_33, improved 265.79c - 185.43c over sse4 asm code To: Development for x265 x265-devel@videolan.org At 2015-02-26 14:24:54,Praveen Tiwari prav...@multicorewareinc.com wrote: -- Forwarded message -- From: chen chenm...@163.com Date: Wed, Feb 25, 2015 at 7:38 PM Subject: Re: [x265] [PATCH Review Only] asm-avx2: intra_pred_ang8_33, improved 265.79c - 185.43c over sse4 asm code To: Development for x265 x265-devel@videolan.org At 2015-02-25 16:52:00,prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari prav...@multicorewareinc.com # Date 1424854196 -19800 # Node ID 177fe9372668b4824c291e967349664766688179 # Parent 02bac78bde961d60d180e59b5260fad93b98d9b4 asm-avx2: intra_pred_ang8_33, improved 265.79c - 185.43c over sse4 asm code intra_ang_8x8[33] 10.56x 185.43 1957.47 diff -r 02bac78bde96 -r 177fe9372668 source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cpp Wed Feb 25 13:46:58 2015 +0530 +++ b/source/common/x86/asm-primitives.cpp Wed Feb 25 14:19:56 2015 +0530 @@ -1813,6 +1813,7 @@ // intra_pred functions p.cu[BLOCK_8x8].intra_pred[3] = x265_intra_pred_ang8_3_avx2; +p.cu[BLOCK_8x8].intra_pred[33] = x265_intra_pred_ang8_33_avx2; } } #endif // if HIGH_BIT_DEPTH diff -r 02bac78bde96 -r 177fe9372668 source/common/x86/intrapred.h --- a/source/common/x86/intrapred.hWed Feb 25 13:46:58 2015 +0530 +++ b/source/common/x86/intrapred.hWed Feb 25 14:19:56 2015 +0530 @@ -158,6 +158,7 @@ #undef DECL_ANG void x265_intra_pred_ang8_3_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); void x265_all_angs_pred_16x16_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); diff -r 02bac78bde96 -r 177fe9372668 source/common/x86/intrapred8.asm --- a/source/common/x86/intrapred8.asm Wed Feb 25 13:46:58 2015 +0530 +++ b/source/common/x86/intrapred8.asm Wed Feb 25 14:19:56 2015 +0530 @@ -32087,3 +32087,39 @@ movq [r0 + 2 * r1], xm2 movhps[r0 + r3], xm2 RET + +INIT_YMM avx2 +cglobal intra_pred_ang8_33, 3,4,5 +movu m3, [pw_1024] Why constant are unaligned? [Praveen] Seems alignment issue here, mova crashing on avx2 machine. [MC] it is global constant, we may use ALIGN32 before pw_1024 to avoid crash and get more performance [Praveen] why it needs special care ? why not other constants needs ALIGN32. +vbroadcasti128m0, [r2 + 1] it is Exception Type 6, please check and confirm it compatible with unaligned address [Praveen] Sadly most of documents don't talk about alignment regarding this instruction including Intel® Architecture Instruction Set Extensions Programming Reference but I verified with encoder seems it works fine with unaligned address too. [MC] ok, if you try to assign unaligned address (manual in debug mode) and it work fine, we may ignore it. + +pshufbm1, m0, [c_ang8_src1_9_2_10] +pshufbm2, m0, [c_ang8_src3_11_4_12] +pshufbm4, m0, [c_ang8_src5_13_5_13] +pshufbm4, m0, [c_ang8_src5_13_5_13] Why duplicated? [Praveen] Yeah, duplicate code here, has been fixed locally. +pshufbm0, [c_ang8_src6_14_7_15] + +pmaddubsw m1, [c_ang8_26_20] +pmulhrsw m1, m3 +pmaddubsw m2, [c_ang8_14_8] +pmulhrsw m2, m3 +pmaddubsw m4, [c_ang8_2_28] +pmulhrsw m4, m3 +pmaddubsw m0, [c_ang8_22_16] +pmulhrsw m0, m3 +packuswb m1, m2 +packuswb m4, m0 + +lea r3, [3 * r1] +movq [r0], xm1 +vextracti128 xm2, m1, 1 +movq [r0 + r1], xm2 +movhps[r0 + 2 * r1], xm1 +movhps[r0 + r3], xm2 +lea r0, [r0 + 4 * r1] +movq [r0], xm4 +vextracti128 xm2, m4, 1 +movq [r0 + r1], xm2 +movhps[r0 + 2 * r1], xm4 +movhps[r0 + r3], xm2 +RET ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265
[x265] Fwd: [PATCH Review Only] asm-avx2: intra_pred_ang8_33, improved 265.79c - 185.43c over sse4 asm code
-- Forwarded message -- From: chen chenm...@163.com Date: Wed, Feb 25, 2015 at 7:38 PM Subject: Re: [x265] [PATCH Review Only] asm-avx2: intra_pred_ang8_33, improved 265.79c - 185.43c over sse4 asm code To: Development for x265 x265-devel@videolan.org At 2015-02-25 16:52:00,prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari prav...@multicorewareinc.com # Date 1424854196 -19800 # Node ID 177fe9372668b4824c291e967349664766688179 # Parent 02bac78bde961d60d180e59b5260fad93b98d9b4 asm-avx2: intra_pred_ang8_33, improved 265.79c - 185.43c over sse4 asm code intra_ang_8x8[33] 10.56x 185.43 1957.47 diff -r 02bac78bde96 -r 177fe9372668 source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cpp Wed Feb 25 13:46:58 2015 +0530 +++ b/source/common/x86/asm-primitives.cpp Wed Feb 25 14:19:56 2015 +0530 @@ -1813,6 +1813,7 @@ // intra_pred functions p.cu[BLOCK_8x8].intra_pred[3] = x265_intra_pred_ang8_3_avx2; +p.cu[BLOCK_8x8].intra_pred[33] = x265_intra_pred_ang8_33_avx2; } } #endif // if HIGH_BIT_DEPTH diff -r 02bac78bde96 -r 177fe9372668 source/common/x86/intrapred.h --- a/source/common/x86/intrapred.hWed Feb 25 13:46:58 2015 +0530 +++ b/source/common/x86/intrapred.hWed Feb 25 14:19:56 2015 +0530 @@ -158,6 +158,7 @@ #undef DECL_ANG void x265_intra_pred_ang8_3_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); +void x265_intra_pred_ang8_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); void x265_all_angs_pred_16x16_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); diff -r 02bac78bde96 -r 177fe9372668 source/common/x86/intrapred8.asm --- a/source/common/x86/intrapred8.asm Wed Feb 25 13:46:58 2015 +0530 +++ b/source/common/x86/intrapred8.asm Wed Feb 25 14:19:56 2015 +0530 @@ -32087,3 +32087,39 @@ movq [r0 + 2 * r1], xm2 movhps[r0 + r3], xm2 RET + +INIT_YMM avx2 +cglobal intra_pred_ang8_33, 3,4,5 +movu m3, [pw_1024] Why constant are unaligned? [Praveen] Seems alignment issue here, mova crashing on avx2 machine. +vbroadcasti128m0, [r2 + 1] it is Exception Type 6, please check and confirm it compatible with unaligned address [Praveen] Sadly most of documents don't talk about alignment regarding this instruction including Intel® Architecture Instruction Set Extensions Programming Reference but I verified with encoder seems it works fine with unaligned address too. + +pshufbm1, m0, [c_ang8_src1_9_2_10] +pshufbm2, m0, [c_ang8_src3_11_4_12] +pshufbm4, m0, [c_ang8_src5_13_5_13] +pshufbm4, m0, [c_ang8_src5_13_5_13] Why duplicated? [Praveen] Yeah, duplicate code here, has been fixed locally. +pshufbm0, [c_ang8_src6_14_7_15] + +pmaddubsw m1, [c_ang8_26_20] +pmulhrsw m1, m3 +pmaddubsw m2, [c_ang8_14_8] +pmulhrsw m2, m3 +pmaddubsw m4, [c_ang8_2_28] +pmulhrsw m4, m3 +pmaddubsw m0, [c_ang8_22_16] +pmulhrsw m0, m3 +packuswb m1, m2 +packuswb m4, m0 + +lea r3, [3 * r1] +movq [r0], xm1 +vextracti128 xm2, m1, 1 +movq [r0 + r1], xm2 +movhps[r0 + 2 * r1], xm1 +movhps[r0 + r3], xm2 +lea r0, [r0 + 4 * r1] +movq [r0], xm4 +vextracti128 xm2, m4, 1 +movq [r0 + r1], xm2 +movhps[r0 + 2 * r1], xm4 +movhps[r0 + r3], xm2 +RET ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
[x265] Fwd: [PATCH] blockcopy_pp_12x32: SSE2 asm code optimization
-- Forwarded message -- From: chen chenm...@163.com Date: Thu, Feb 5, 2015 at 5:55 PM Subject: Re: [x265] [PATCH] blockcopy_pp_12x32: SSE2 asm code optimization To: Development for x265 x265-devel@videolan.org this code is right but could you try use general register move (rN, rNd) in x64 mode? I applied your idea of using general register as buffer in x64 for 4x8 (easy to test with) but surprisingly using SIMD registers is faster. here I have the code and performance numbers: copy_pp[ 4x8] 2.67x*139.98 * 374.18 [using general register move (rN, rNd)] copy_pp[ 4x8] 3.34x*109.60 * 366.35 [SIMD registers as buffer] codes: [using general register move (rN, rNd)] ;- ; void blockcopy_pp_4x8(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride) ;- INIT_XMM sse2 cglobal blockcopy_pp_4x8, 4, 10, 0 lea r4,[3 * r1] lea r5,[3 * r3] mov r6d, [r2] mov r7d, [r2 + r3] mov r8d, [r2 + 2 * r3] mov r9d, [r2 + r5] mov [r0], r6d mov [r0 + r1], r7d mov [r0 + 2 * r1], r8d mov [r0 + r4], r9d lea r2, [r2 + 4 * r3] mov r6d, [r2] mov r7d, [r2 + r3] mov r8d, [r2 + 2 * r3] mov r9d, [r2 + r5] lea r0,[r0 + 4 * r1] mov [r0], r6d mov [r0 + r1], r7d mov [r0 + 2 * r1], r8d mov [r0 + r4], r9d RET code [SIMD registers as buffer] INIT_XMM sse2 cglobal blockcopy_pp_4x8, 4, 6, 4 lea r4,[3 * r1] lea r5,[3 * r3] movd m0, [r2] movd m1, [r2 + r3] movd m2, [r2 + 2 * r3] movd m3, [r2 + r5] movd [r0], m0 movd [r0 + r1], m1 movd [r0 + 2 * r1], m2 movd [r0 + r4], m3 lea r2, [r2 + 4 * r3] movd m0, [r2] movd m1, [r2 + r3] movd m2, [r2 + 2 * r3] movd m3, [r2 + r5] lea r0,[r0 + 4 * r1] movd [r0], m0 movd [r0 + r1], m1 movd [r0 + 2 * r1], m2 movd [r0 + r4], m3 RET ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] blockfill_s_8x8 sse2 asm code optimization
Sent updated patch. Thanks. Regards, Praveen On Mon, Feb 2, 2015 at 4:39 PM, chen chenm...@163.com wrote: At 2015-02-02 16:55:16,prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari # Date 1422867249 -19800 # Branch stable # Node ID 2618352a21d5917ee8c1f79bcc159e858dd19daa # Parent e2c958ff874e2bf8992ba22605e993530e8a2d8c blockfill_s_8x8 sse2 asm code optimization improved, 100.04c - 90.05c diff -r e2c958ff874e -r 2618352a21d5 source/common/x86/blockcopy8.asm --- a/source/common/x86/blockcopy8.asm Sat Jan 31 13:48:34 2015 -0600 +++ b/source/common/x86/blockcopy8.asm Mon Feb 02 14:24:09 2015 +0530 @@ -1748,9 +1748,10 @@ ; void blockfill_s_8x8(int16_t* dst, intptr_t dstride, int16_t val) ;- INIT_XMM sse2 -cglobal blockfill_s_8x8, 3, 3, 1, dst, dstStride, val +cglobal blockfill_s_8x8, 3, 4, 1, dst, dstStride, val addr1,r1 +lear3,[3 * r1] movd m0,r2d pshuflwm0,m0, 0 @@ -1760,17 +1761,13 @@ movu [r0 + r1], m0 movu [r0 + 2 * r1], m0 -lear0,[r0 + 2 * r1] +movu [r0 + r3], m0 +movu [r0 + 4 * r1], m0 + +lear0,[r0 + 4 * r1] swap LEA and above movu, you will get less bytes on binary code movu [r0 + r1], m0 movu [r0 + 2 * r1], m0 - -lear0,[r0 + 2 * r1] -movu [r0 + r1], m0 -movu [r0 + 2 * r1], m0 - -lear0,[r0 + 2 * r1] -movu [r0 + r1], m0 - +movu [r0 + r3], m0 RET ;- ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] add testbench for psyCost_ss and asm for psyCost_ss_4x4: improve 1989c-515c
If it is only 64x64, then definitely it is range issue when we are finally accumulating sum of all sad calculations. It make more obvious with 64x64 because more number of accumulation is here. Algorithm issue must have reflected in other partition also. Regards, Praveen On Fri, Jan 9, 2015 at 4:05 PM, Steve Borho st...@borho.org wrote: On 01/09, Divya Manivannan wrote: # HG changeset patch # User Divya Manivannan di...@multicorewareinc.com # Date 1420790181 -19800 # Fri Jan 09 13:26:21 2015 +0530 # Node ID 0f4b677cea64254d0b8f77ccc84c785bf832698d # Parent c99e1a309bd1690be9a0a407050d97d95ccab05a add testbench for psyCost_ss and asm for psyCost_ss_4x4: improve 1989c-515c I get an error with a 10bit build: steve@zeppelin ./test/TestBench Using random seed 54AFAEC9 16bpp Testing primitives: SSE2 Testing primitives: SSE3 Testing primitives: SSSE3 Testing primitives: SSE4 psy_cost_ss[64x64] failed! diff -r c99e1a309bd1 -r 0f4b677cea64 source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cppFri Jan 09 13:09:39 2015 +0530 +++ b/source/common/x86/asm-primitives.cppFri Jan 09 13:26:21 2015 +0530 @@ -1430,6 +1430,7 @@ p.psy_cost_pp[BLOCK_32x32] = x265_psyCost_pp_32x32_sse4; p.psy_cost_pp[BLOCK_64x64] = x265_psyCost_pp_64x64_sse4; #endif +p.psy_cost_ss[BLOCK_4x4] = x265_psyCost_ss_4x4_sse4; } if (cpuMask X265_CPU_XOP) { @@ -1716,6 +1717,7 @@ p.psy_cost_pp[BLOCK_32x32] = x265_psyCost_pp_32x32_sse4; p.psy_cost_pp[BLOCK_64x64] = x265_psyCost_pp_64x64_sse4; #endif +p.psy_cost_ss[BLOCK_4x4] = x265_psyCost_ss_4x4_sse4; } if (cpuMask X265_CPU_AVX) { diff -r c99e1a309bd1 -r 0f4b677cea64 source/common/x86/pixel-a.asm --- a/source/common/x86/pixel-a.asm Fri Jan 09 13:09:39 2015 +0530 +++ b/source/common/x86/pixel-a.asm Fri Jan 09 13:26:21 2015 +0530 @@ -7569,3 +7569,157 @@ RET %endif ; HIGH_BIT_DEPTH %endif + +;- +;int psyCost_ss(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride) +;- +INIT_XMM sse4 +cglobal psyCost_ss_4x4, 4, 5, 8 + +add r1, r1 +lea r4, [3 * r1] +movddup m0, [r0] +movddup m1, [r0 + r1] +movddup m2, [r0 + r1 * 2] +movddup m3, [r0 + r4] + +pabsw m4, m0 +pabsw m5, m1 +paddw m5, m4 +pabsw m4, m2 +paddw m5, m4 +pabsw m4, m3 +paddw m5, m4 +pmaddwd m5, [pw_1] +psrldq m4, m5, 4 +paddd m5, m4 +psrld m6, m5, 2 + +movam4, [hmul_8w] +pmaddwd m0, m4 +pmaddwd m1, m4 +pmaddwd m2, m4 +pmaddwd m3, m4 + +psrldq m4, m0, 4 +psubd m5, m0, m4 +paddd m0, m4 +shufps m0, m5, 10001000b + +psrldq m4, m1, 4 +psubd m5, m1, m4 +paddd m1, m4 +shufps m1, m5, 10001000b + +psrldq m4, m2, 4 +psubd m5, m2, m4 +paddd m2, m4 +shufps m2, m5, 10001000b + +psrldq m4, m3, 4 +psubd m5, m3, m4 +paddd m3, m4 +shufps m3, m5, 10001000b + +movam4, m0 +paddd m0, m1 +psubd m1, m4 +movam4, m2 +paddd m2, m3 +psubd m3, m4 +movam4, m0 +paddd m0, m2 +psubd m2, m4 +movam4, m1 +paddd m1, m3 +psubd m3, m4 + +pabsd m0, m0 +pabsd m2, m2 +pabsd m1, m1 +pabsd m3, m3 +paddd m0, m2 +paddd m1, m3 +paddd m0, m1 +movhlps m1, m0 +paddd m0, m1 +psrldq m1, m0, 4 +paddd m0, m1 +psrld m0, 1 +psubd m7, m0, m6 + +add r3, r3 +lea r4, [3 * r3] +movddup m0, [r2] +movddup m1, [r2 + r3] +movddup m2, [r2 + r3 * 2] +movddup m3, [r2 + r4] + +pabsw m4, m0 +pabsw m5, m1 +paddw m5, m4 +pabsw m4, m2 +paddw m5, m4 +pabsw m4, m3 +paddw m5, m4 +pmaddwd m5, [pw_1] +psrldq m4, m5, 4 +
Re: [x265] [PATCH] asm: luma_vpp[16x32, 16x64] in avx2: improve 3875c-2488c, 7499c-4915c
tab_LumaCoeffVer_32 table of this name is already in file, redefining here will cause build error. Please, verify and update patch. On Thu, Nov 20, 2014 at 2:49 PM, Divya Manivannan di...@multicorewareinc.com wrote: # HG changeset patch # User Divya Manivannan di...@multicorewareinc.com # Date 1416475133 -19800 # Thu Nov 20 14:48:53 2014 +0530 # Node ID 49c99a85531358e1b0624edd8082b6945d4e187e # Parent 3649fabf90d348c51d7e155989d1bf629ec27f6e asm: luma_vpp[16x32, 16x64] in avx2: improve 3875c-2488c, 7499c-4915c diff -r 3649fabf90d3 -r 49c99a855313 source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cpp Thu Nov 20 14:27:53 2014 +0530 +++ b/source/common/x86/asm-primitives.cpp Thu Nov 20 14:48:53 2014 +0530 @@ -1798,6 +1798,8 @@ p.transpose[BLOCK_16x16] = x265_transpose16_avx2; p.transpose[BLOCK_32x32] = x265_transpose32_avx2; p.transpose[BLOCK_64x64] = x265_transpose64_avx2; +p.luma_vpp[LUMA_16x32] = x265_interp_8tap_vert_pp_16x32_avx2; +p.luma_vpp[LUMA_16x64] = x265_interp_8tap_vert_pp_16x64_avx2; #endif p.luma_hpp[LUMA_4x4] = x265_interp_8tap_horiz_pp_4x4_avx2; p.luma_vpp[LUMA_4x4] = x265_interp_8tap_vert_pp_4x4_avx2; diff -r 3649fabf90d3 -r 49c99a855313 source/common/x86/ipfilter8.asm --- a/source/common/x86/ipfilter8.asm Thu Nov 20 14:27:53 2014 +0530 +++ b/source/common/x86/ipfilter8.asm Thu Nov 20 14:48:53 2014 +0530 @@ -122,6 +122,27 @@ times 8 db 58, -10 times 8 db 4, -1 +ALIGN 32 +tab_LumaCoeffVer_32: times 16 db 0, 0 + times 16 db 0, 64 + times 16 db 0, 0 + times 16 db 0, 0 + + times 16 db -1, 4 + times 16 db -10, 58 + times 16 db 17, -5 + times 16 db 1, 0 + + times 16 db -1, 4 + times 16 db -11, 40 + times 16 db 40, -11 + times 16 db 4, -1 + + times 16 db 0, 1 + times 16 db -5, 17 + times 16 db 58, -10 + times 16 db 4, -1 + tab_c_64_n64: times 8 db 64, -64 const interp4_shuf, times 2 db 0, 1, 8, 9, 4, 5, 12, 13, 2, 3, 10, 11, 6, 7, 14, 15 @@ -3755,6 +3776,312 @@ ;- FILTER_VER_LUMA_12xN 12, 16, ps +%macro FILTER_VER_LUMA_AVX2_16xN 2 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_8tap_vert_pp_%1x%2, 4, 7, 15 +mov r4d, r4m +shl r4d, 7 + +%ifdef PIC +lea r5, [tab_LumaCoeffVer_32] +add r5, r4 +%else +lea r5, [tab_LumaCoeffVer_32 + r4] +%endif + +lea r4, [r1 * 3] +sub r0, r4 +lea r6, [r1 * 4] +movam14, [pw_512] +mov word [rsp], %2 / 16 + +.loop: +movuxm0, [r0] ; m0 = row 0 +movuxm1, [r0 + r1] ; m1 = row 1 +punpckhbw xm2, xm0, xm1 +punpcklbw xm0, xm1 +vinserti128 m0, m0, xm2, 1 +pmaddubsw m0, [r5] +movuxm2, [r0 + r1 * 2] ; m2 = row 2 +punpckhbw xm3, xm1, xm2 +punpcklbw xm1, xm2 +vinserti128 m1, m1, xm3, 1 +pmaddubsw m1, [r5] +movuxm3, [r0 + r4] ; m3 = row 3 +punpckhbw xm4, xm2, xm3 +punpcklbw xm2, xm3 +vinserti128 m2, m2, xm4, 1 +pmaddubsw m4, m2, [r5 + 1 * mmsize] +paddw m0, m4 +pmaddubsw m2, [r5] +lea r0, [r0 + r1 * 4] +movuxm4, [r0] ; m4 = row 4 +punpckhbw xm5, xm3, xm4 +punpcklbw xm3, xm4 +vinserti128 m3, m3, xm5, 1 +pmaddubsw m5, m3, [r5 + 1 * mmsize] +paddw m1, m5 +pmaddubsw m3, [r5] +movuxm5, [r0 + r1] ; m5 = row 5 +punpckhbw xm6, xm4, xm5 +punpcklbw xm4, xm5 +vinserti128 m4, m4, xm6, 1 +pmaddubsw m6, m4, [r5 + 2 * mmsize] +paddw m0, m6 +pmaddubsw m6, m4, [r5 + 1 * mmsize] +paddw m2, m6 +pmaddubsw m4, [r5] +movuxm6, [r0 + r1 * 2] ; m6 = row 6 +punpckhbw xm7, xm5, xm6 +punpcklbw xm5, xm6 +vinserti128 m5, m5, xm7, 1 +pmaddubsw m7, m5, [r5 + 2 * mmsize] +paddw m1, m7 +pmaddubsw m7, m5, [r5 + 1 * mmsize] +paddw m3, m7 +pmaddubsw m5, [r5] +movuxm7, [r0 + r4] ; m7 = row 7 +punpckhbw xm8, xm6, xm7 +punpcklbw
[x265] Fwd: [PATCH] refactorizaton of the transform/quant path
-- Forwarded message -- From: Steve Borho st...@borho.org Date: Tue, Nov 18, 2014 at 11:31 PM Subject: Re: [x265] [PATCH] refactorizaton of the transform/quant path To: Development for x265 x265-devel@videolan.org On 11/18, prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari # Date 1416299427 -19800 # Node ID 706fa4af912bc1610478de8f09a651ae3e58624c # Parent 2f0062f0791b822fa932712a56e6b0a14e976d91 refactorizaton of the transform/quant path. This patch involves scaling down the DCT/IDCT coefficients from int32_t to int16_t as they can be accommodated on int16_t without any introduction of encode error, this allows us to clean up lots of DCT/IDCT intermediated buffers, optimize enode efficiency for different cli options including noise reduction by reducing data movement operations, accommodating more number of coefficients in a single register for SIMD operations. This patch include all necessary changes for the transfor/quant path including unit test code. snip for (int pass = 0; pass 2; pass++) @@ -1564,7 +1418,7 @@ * still somewhat rare on end-user PCs we still compile and link these SSE3 * intrinsic SIMD functions */ #if !HIGH_BIT_DEPTH -p.idct[IDCT_8x8] = idct8; +//p.idct[IDCT_8x8] = idct8; p.idct[IDCT_16x16] = idct16; p.idct[IDCT_32x32] = idct32; #endif Getting the intrinsic idct8 re-enabled or coded in assembly should be a priority. [MC] We don't have any sse version of assembly code for IDCT_16x16 and IDCT_32x32, only avx2 asm codes this is why intrinsic version is enabled. (We have AVX2 assembly for these two functions, but since AVX2 is still somewhat rare on end-user PCs we still compile and link these SSE3 intrinsic SIMD functions). Further I will clean up idct8 intrinsic (disabled) code as we have sse and avx2 asm code for it so, I think it is no longer useful. -- Steve Borho ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
[x265] Fwd: [PATCH] refactorizaton of the transform/quant path
-- Forwarded message -- From: Steve Borho st...@borho.org Date: Tue, Nov 18, 2014 at 11:35 PM Subject: Re: [x265] [PATCH] refactorizaton of the transform/quant path To: Development for x265 x265-devel@videolan.org On 11/18, prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari # Date 1416299427 -19800 # Node ID 706fa4af912bc1610478de8f09a651ae3e58624c # Parent 2f0062f0791b822fa932712a56e6b0a14e976d91 refactorizaton of the transform/quant path. This patch involves scaling down the DCT/IDCT coefficients from int32_t to int16_t as they can be accommodated on int16_t without any introduction of encode error, this allows us to clean up lots of DCT/IDCT intermediated buffers, optimize enode efficiency for different cli options including noise reduction by reducing data movement operations, accommodating more number of coefficients in a single register for SIMD operations. This patch include all necessary changes for the transfor/quant path including unit test code. Testbench failure with this patch applied: $ ./test/TestBench Using random seed 546B89D8 8bpp Testing primitives: SSE2 Testing primitives: SSE3 Testing primitives: SSSE3 Testing primitives: SSE4 denoiseDct: Failed! Mac OS X x86_64 8bpp I'm going to hold this patch until you can send a new patch to resolve this issue. [MC] Can we disable this single assembly code and push the patches so that this and other patches don't have to wait, once we done with this issue we can enable denoise asm code. -- Steve Borho ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] disable denoiseDct asm code until fixed for Mac OS
My code does not involve any filter function modification, it's surprising. I remember few week back some typo mistake was in filter AVX2 code . I think it's same issue. On Wed, Nov 19, 2014 at 11:37 PM, Steve Borho st...@borho.org wrote: On 11/19, prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari # Date 1416402744 -19800 # Node ID 0ef14321fb144362b609d51f2d7c58f7db757ceb # Parent 706fa4af912bc1610478de8f09a651ae3e58624c disable denoiseDct asm code until fixed for Mac OS with denoise disabled, it finds the next failing primitive: $ ./test/TestBench Using random seed 546CDBE7 8bpp Testing primitives: SSE2 Testing primitives: SSE3 Testing primitives: SSSE3 Testing primitives: SSE4 Testing primitives: AVX Testing primitives: AVX2 x265: asm primitive has failed. Go and fix that Right Now! luma_hpp[ 4x4]⏎ -- Steve Borho ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH 3 of 3] asm: AVX2 version luma_vpp[4x4], improve 391c - 302c
Crashing on vc11-x86-8bpp, Release mode. Min, can you check your code ? Regards, Praveen On Fri, Oct 31, 2014 at 4:16 AM, Min Chen chenm...@163.com wrote: # HG changeset patch # User Min Chen chenm...@163.com # Date 1414709200 25200 # Node ID 5d0b20f6e4de0b59b8c3306793c7267e01b9a41b # Parent 529ff7eca135838dc50c227d52db97725a79f0db asm: AVX2 version luma_vpp[4x4], improve 391c - 302c diff -r 529ff7eca135 -r 5d0b20f6e4de source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cpp Thu Oct 30 15:46:23 2014 -0700 +++ b/source/common/x86/asm-primitives.cpp Thu Oct 30 15:46:40 2014 -0700 @@ -1799,6 +1799,7 @@ p.transpose[BLOCK_64x64] = x265_transpose64_avx2; #endif p.luma_hpp[LUMA_4x4] = x265_interp_8tap_horiz_pp_4x4_avx2; +p.luma_vpp[LUMA_4x4] = x265_interp_8tap_vert_pp_4x4_avx2; } #endif // if HIGH_BIT_DEPTH } diff -r 529ff7eca135 -r 5d0b20f6e4de source/common/x86/ipfilter8.asm --- a/source/common/x86/ipfilter8.asm Thu Oct 30 15:46:23 2014 -0700 +++ b/source/common/x86/ipfilter8.asm Thu Oct 30 15:46:40 2014 -0700 @@ -3420,6 +3420,88 @@ RET %endmacro + +INIT_YMM avx2 +cglobal interp_8tap_vert_pp_4x4, 4,6,8 +mov r4d, r4m +lea r5, [r1 * 3] +sub r0, r5 + +; TODO: VPGATHERDD +movdxm1, [r0] ; m1 = row0 +movdxm2, [r0 + r1] ; m2 = row1 +punpcklbw xm1, xm2; m1 = [13 03 12 02 11 01 10 00] + +movdxm3, [r0 + r1 * 2] ; m3 = row2 +punpcklbw xm2, xm3; m2 = [23 13 22 12 21 11 20 10] +movdxm4, [r0 + r5] +punpcklbw xm3, xm4; m3 = [33 23 32 22 31 21 30 20] +punpcklwd xm1, xm3; m1 = [33 23 13 03 32 22 12 02 31 21 11 01 30 20 10 00] + +lea r0, [r0 + r1 * 4] +movdxm5, [r0] ; m5 = row4 +punpcklbw xm4, xm5; m4 = [43 33 42 32 41 31 40 30] +punpcklwd xm2, xm4; m2 = [43 33 21 13 42 32 22 12 41 31 21 11 40 30 20 10] +vinserti128 m1, m1, xm2, 1 ; m1 = [43 33 21 13 42 32 22 12 41 31 21 11 40 30 20 10] - [33 23 13 03 32 22 12 02 31 21 11 01 30 20 10 00] +movdxm2, [r0 + r1] ; m2 = row5 +punpcklbw xm5, xm2; m5 = [53 43 52 42 51 41 50 40] +punpcklwd xm3, xm5; m3 = [53 43 44 23 52 42 32 22 51 41 31 21 50 40 30 20] +movdxm6, [r0 + r1 * 2] ; m6 = row6 +punpcklbw xm2, xm6; m2 = [63 53 62 52 61 51 60 50] +punpcklwd xm4, xm2; m4 = [63 53 43 33 62 52 42 32 61 51 41 31 60 50 40 30] +vinserti128 m3, m3, xm4, 1 ; m3 = [63 53 43 33 62 52 42 32 61 51 41 31 60 50 40 30] - [53 43 44 23 52 42 32 22 51 41 31 21 50 40 30 20] +movdxm4, [r0 + r5] ; m4 = row7 +punpcklbw xm6, xm4; m6 = [73 63 72 62 71 61 70 60] +punpcklwd xm5, xm6; m5 = [73 63 53 43 72 62 52 42 71 61 51 41 70 60 50 40] + +lea r0, [r0 + r1 * 4] +movdxm7, [r0] ; m7 = row8 +punpcklbw xm4, xm7; m4 = [83 73 82 72 81 71 80 70] +punpcklwd xm2, xm4; m2 = [83 73 63 53 82 72 62 52 81 71 61 51 80 70 60 50] +vinserti128 m5, m5, xm2, 1 ; m5 = [83 73 63 53 82 72 62 52 81 71 61 51 80 70 60 50] - [73 63 53 43 72 62 52 42 71 61 51 41 70 60 50 40] +movdxm2, [r0 + r1] ; m2 = row9 +punpcklbw xm7, xm2; m7 = [93 83 92 82 91 81 90 80] +punpcklwd xm6, xm7; m6 = [93 83 73 63 92 82 72 62 91 81 71 61 90 80 70 60] +movdxm7, [r0 + r1 * 2] ; m7 = rowA +punpcklbw xm2, xm7; m2 = [A3 93 A2 92 A1 91 A0 90] +punpcklwd xm4, xm2; m4 = [A3 93 83 73 A2 92 82 72 A1 91 81 71 A0 90 80 70] +vinserti128 m6, m6, xm4, 1 ; m6 = [A3 93 83 73 A2 92 82 72 A1 91 81 71 A0 90 80 70] - [93 83 73 63 92 82 72 62 91 81 71 61 90 80 70 60] + +; load filter coeff +%ifdef PIC +lea r5, [tab_LumaCoeff] +vpbroadcastdm0, [r5 + r4 * 8 + 0] +vpbroadcastdm2, [r5 + r4 * 8 + 4] +%else +vpbroadcastqm0, [tab_LumaCoeff + r4 * 8 + 0] +vpbroadcastdm2, [tab_LumaCoeff + r4 * 8 + 4] +%endif + +pmaddubsw m1, m0 +pmaddubsw m3, m0 +pmaddubsw m5, m2 +pmaddubsw m6, m2 +
[x265] Fwd: [PATCH] weight_pp avx2 asm code, improved from 8608.65 cycles to 5138.09 cycles over sse version of asm code
-- Forwarded message -- From: chen chenm...@163.com Date: Fri, Oct 17, 2014 at 3:11 AM Subject: Re: [x265] [PATCH] weight_pp avx2 asm code, improved from 8608.65 cycles to 5138.09 cycles over sse version of asm code To: Development for x265 x265-devel@videolan.org At 2014-10-16 17:20:13,prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari # Date 1413451199 -19800 # Node ID 858be8d7d7176ab6c6d01cf92d00c8478fe99b34 # Parent 79702581ec824a2a375aebe228d69c3930aeea96 weight_pp avx2 asm code, improved from 8608.65 cycles to 5138.09 cycles over sse version of asm code diff -r 79702581ec82 -r 858be8d7d717 source/common/x86/pixel-util8.asm --- a/source/common/x86/pixel-util8.asmWed Oct 15 17:49:35 2014 -0500 +++ b/source/common/x86/pixel-util8.asmThu Oct 16 14:49:59 2014 +0530 @@ -1375,6 +1375,60 @@ RET +INIT_YMM avx2 +cglobal weight_pp, 6, 7, 6 + +mov r6d, r6m +shl r6d, 6 ; m0 = [w06] +movd xm0, r6d + +movd xm1, r7m ; m1 = [round] +punpcklwdxm0, xm1 +pshufd xm0, xm0, 0 +vinserti128 m0, m0, xm0, 1 ; assuming both (w06) and round are using maximum of 16 bits each, m0 = [w06 round] vpbroadcastd is better Yeah, exactly I tried to replace (pshufd xm0, xm0, 0) + (vinserti128 m0, m0, xm0, 1) with vpbroadcastd m0, xm0 (as per document syntax, __m256i _mm256_broadcastd_epi32 (__m128i a)) but it throwing build error: invalid combination of opcode and operands. and we just use weight_pp in four position, all of them have same stride in r2 r3, so we can simplify interface and free more register here, you can combo W0 and Round in general register to improve performance. + +movd xm1, r8m +vpbroadcastd m2, r9m +mova m5, [pw_1] +sub r2d, r4d +sub r3d, r4d + +.loopH: +mov r6d, r4d +shr r6d, 4 why do Shr every time? +.loopW: +movuxm4, [r0] +pmovzxbwm4, xm4 pmovzxbw didn't need aligned address +punpcklwd m3, m4, m5 +pmaddwd m3, m0 +psrad m3, xm1 +paddd m3, m2 + +punpckhwd m4, m5 +pmaddwd m4, m0 +psrad m4, xm1 +paddd m4, m2 + +packssdwm3, m4 +vextracti128 xm4, m3, 1 +packuswbm3, m4 How about vpermq+packuswb(xm3)? +movu[r1], xm3 + +add r0, 16 +add r1, 16 + +dec r6d +jnz .loopW + +lea r0, [r0 + r2] +lea r1, [r1 + r3] + +dec r5d +jnz .loopH + +RET ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] noiseReduction: make noiseReduction deterministic for a given number of frameEncoders
Seems we missed out something here, I tested this patch at my end outputs are deterministic with --pmode but still non-deterministic without --pmode option. Steve/Deepthi please verify at your end before pushing it. I used the following cli: y4mInputs\park_joy_1280x720p50.y4m --tune=ssim --psnr --asm=false --nr=1000 --hash 1 --input-depth 8 --preset ultrafast -o outputFiles\park_joy-c2_nr.out [*Non-deterministic*] y4mInputs\park_joy_1280x720p50.y4m --tune=ssim --psnr --asm=false --nr=1000 --hash 1 --input-depth 8 --preset ultrafast *--pmode* -o outputFiles\park_joy-c1_nr.out [*deterministic*] Regards, Praveen On Tue, Oct 14, 2014 at 4:54 PM, deep...@multicorewareinc.com wrote: # HG changeset patch # User Deepthi Nandakumar deep...@multicorewareinc.com # Date 1413278604 -19800 # Tue Oct 14 14:53:24 2014 +0530 # Node ID c6e786dbbfaa39822799d17e6c32d49c6141a7fb # Parent 38b5733cc629dd16db770e6a93b4f994e13336f3 noiseReduction: make noiseReduction deterministic for a given number of frameEncoders. diff -r 38b5733cc629 -r c6e786dbbfaa source/common/frame.cpp --- a/source/common/frame.cpp Tue Oct 14 14:35:30 2014 +0530 +++ b/source/common/frame.cpp Tue Oct 14 14:53:24 2014 +0530 @@ -43,6 +43,7 @@ m_picSym = NULL; m_reconRowCount.set(0); m_countRefEncoders = 0; +m_frameEncoderID = 0; memset(m_lowres, 0, sizeof(m_lowres)); m_next = NULL; m_prev = NULL; diff -r 38b5733cc629 -r c6e786dbbfaa source/common/frame.h --- a/source/common/frame.h Tue Oct 14 14:35:30 2014 +0530 +++ b/source/common/frame.h Tue Oct 14 14:53:24 2014 +0530 @@ -50,6 +50,7 @@ TComPicSym* m_picSym; TComPicYuv* m_reconPicYuv; int m_POC; +int m_frameEncoderID; // To identify the ID of the frameEncoder processing this frame //** Frame Parallelism - notification between FrameEncoders of available motion reference rows ** ThreadSafeInteger m_reconRowCount; // count of CTU rows completely reconstructed and extended for motion reference diff -r 38b5733cc629 -r c6e786dbbfaa source/common/quant.cpp --- a/source/common/quant.cpp Tue Oct 14 14:35:30 2014 +0530 +++ b/source/common/quant.cpp Tue Oct 14 14:53:24 2014 +0530 @@ -156,6 +156,7 @@ m_resiDctCoeff = NULL; m_fencDctCoeff = NULL; m_fencShortBuf = NULL; +m_nr = NULL; } bool Quant::init(bool useRDOQ, double psyScale, const ScalingList scalingList, Entropy entropy) diff -r 38b5733cc629 -r c6e786dbbfaa source/encoder/analysis.cpp --- a/source/encoder/analysis.cpp Tue Oct 14 14:35:30 2014 +0530 +++ b/source/encoder/analysis.cpp Tue Oct 14 14:53:24 2014 +0530 @@ -292,7 +292,11 @@ if (!jobId || m_param-rdLevel 4) { slave-m_quant.setQPforQuant(cu); -slave-m_quant.m_nr = m_quant.m_nr; +if(m_param-noiseReduction) +{ +int frameEncoderID = cu-m_pic-m_frameEncoderID; +slave-m_quant.m_nr = m_tld[threadId].m_nr[frameEncoderID]; +} slave-m_rdContexts[depth].cur.load(m_rdContexts[depth].cur); } } diff -r 38b5733cc629 -r c6e786dbbfaa source/encoder/analysis.h --- a/source/encoder/analysis.h Tue Oct 14 14:35:30 2014 +0530 +++ b/source/encoder/analysis.h Tue Oct 14 14:53:24 2014 +0530 @@ -172,7 +172,9 @@ struct ThreadLocalData { Analysis analysis; - +NoiseReduction *m_nr; + +ThreadLocalData() { m_nr = NULL; } ~ThreadLocalData() { analysis.destroy(); } }; diff -r 38b5733cc629 -r c6e786dbbfaa source/encoder/encoder.cpp --- a/source/encoder/encoder.cppTue Oct 14 14:35:30 2014 +0530 +++ b/source/encoder/encoder.cppTue Oct 14 14:53:24 2014 +0530 @@ -74,6 +74,7 @@ m_csvfpt = NULL; m_param = NULL; m_threadPool = 0; +m_numThreadLocalData = 0; } void Encoder::create() @@ -162,15 +163,17 @@ /* Allocate thread local data, one for each thread pool worker and * if --no-wpp, one for each frame encoder */ -int numLocalData = poolThreadCount; +m_numThreadLocalData = poolThreadCount; if (!m_param-bEnableWavefront) -numLocalData += m_param-frameNumThreads; -m_threadLocalData = new ThreadLocalData[numLocalData]; -for (int i = 0; i numLocalData; i++) +m_numThreadLocalData += m_param-frameNumThreads; +m_threadLocalData = new ThreadLocalData[m_numThreadLocalData]; +for (int i = 0; i m_numThreadLocalData; i++) { m_threadLocalData[i].analysis.setThreadPool(m_threadPool); m_threadLocalData[i].analysis.initSearch(m_param, m_scalingList); m_threadLocalData[i].analysis.create(g_maxCUDepth + 1, g_maxCUSize, m_threadLocalData); +if(m_param-noiseReduction) +m_threadLocalData[i].m_nr = new NoiseReduction[m_param-frameNumThreads]; } if
[x265] Fwd: [PATCH] denoiseDct: unit test code
-- Forwarded message -- From: Steve Borho st...@borho.org Date: Mon, Sep 15, 2014 at 4:28 PM Subject: Re: [x265] [PATCH] denoiseDct: unit test code To: Development for x265 x265-devel@videolan.org On 09/15, prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari # Date 1410775657 -19800 # Node ID 36f5477f54ba8047f9abc1b42c5b56c6d223dc5a # Parent 184e56afa951815f4e295b4fcce094ee03361a2e denoiseDct: unit test code a few nits and questions diff -r 184e56afa951 -r 36f5477f54ba source/test/mbdstharness.cpp --- a/source/test/mbdstharness.cppFri Sep 12 12:02:46 2014 +0530 +++ b/source/test/mbdstharness.cppMon Sep 15 15:37:37 2014 +0530 @@ -66,14 +66,17 @@ short_test_buff[0][i]= (rand() PIXEL_MAX) - (rand() PIXEL_MAX); int_test_buff[0][i] = rand() % PIXEL_MAX; int_idct_test_buff[0][i] = (rand() % (SHORT_MAX - SHORT_MIN)) - SHORT_MAX; +int_denoise_test_buff1[0][i] = int_denoise_test_buff2[0][i] = (rand() UNSIGNED_SHORT_MAX) - (rand() UNSIGNED_SHORT_MAX); short_test_buff[1][i]= -PIXEL_MAX; int_test_buff[1][i] = -PIXEL_MAX; int_idct_test_buff[1][i] = SHORT_MIN; +int_denoise_test_buff1[1][i] = int_denoise_test_buff2[1][i] = -UNSIGNED_SHORT_MAX; short_test_buff[2][i]= PIXEL_MAX; int_test_buff[2][i] = PIXEL_MAX; int_idct_test_buff[2][i] = SHORT_MAX; +int_denoise_test_buff1[2][i] = int_denoise_test_buff2[1][i] = UNSIGNED_SHORT_MAX; mbuf1[i] = rand() PIXEL_MAX; mbufdct[i] = (rand() PIXEL_MAX) - (rand() PIXEL_MAX); @@ -313,6 +316,46 @@ return true; } +bool MBDstHarness::check_denoise_dct_primitive(denoiseDct_t ref, denoiseDct_t opt) +{ +int j = 0; + +for (int i = 0; i 4; i++) +{ +int log2TrSize = i + 2; +int num = 1 (log2TrSize * 2); This loop second confuses me? what's the point of it? +for (int n = 0; n = num; n++) +{ +memset(mubuf1, 0, num * sizeof(uint32_t)); +memset(mubuf2, 0, num * sizeof(uint32_t)); +memset(mushortbuf1, 0, num * sizeof(uint16_t)); + +for (int k = 0; k n; j++) +{ +mushortbuf1[k] = rand() % UNSIGNED_SHORT_MAX; +} we don't use braces for single-line expressions +int index = rand() % TEST_CASES; +int cmp_size = sizeof(int) * num; + +ref(int_denoise_test_buff1[index] + j, mubuf1, mushortbuf1, num); +checked(opt, int_denoise_test_buff2[index] + j, mubuf2, mushortbuf1, num); + +if (memcmp(int_denoise_test_buff1[index] + j, int_denoise_test_buff2[index] + j, cmp_size)) +return false; white-space +if (memcmp(mubuf1, mubuf2, cmp_size)) +return false; + +reportfail(); +j += INCR; is this bounds safe? TEST_BUF_SIZE is allocated for a max of ITERS iterations (128). It seems like num can be 32*32. +} +} + +return true; +} + bool MBDstHarness::testCorrectness(const EncoderPrimitives ref, const EncoderPrimitives opt) { for (int i = 0; i NUM_DCTS; i++) @@ -393,6 +436,15 @@ } } +if (opt.denoiseDct) +{ +if (!check_denoise_dct_primitive(ref.denoiseDct, opt.denoiseDct)) +{ +printf(denoiseDct: Failed!\n); +return false; +} +} + return true; } @@ -448,4 +500,10 @@ REPORT_SPEEDUP(opt.count_nonzero, ref.count_nonzero, mbuf1, i * i) } } + +if (opt.denoiseDct) +{ +printf(denoiseDct\t\t); +REPORT_SPEEDUP(opt.denoiseDct, ref.denoiseDct, int_denoise_test_buff1[0], mubuf1, mushortbuf1, 32 * 32); +} } diff -r 184e56afa951 -r 36f5477f54ba source/test/mbdstharness.h --- a/source/test/mbdstharness.h Fri Sep 12 12:02:46 2014 +0530 +++ b/source/test/mbdstharness.h Mon Sep 15 15:37:37 2014 +0530 @@ -44,6 +44,10 @@ int16_t mbufdct[TEST_BUF_SIZE]; int mbufidct[TEST_BUF_SIZE]; +ALIGN_VAR_32(uint32_t, mubuf1[MAX_TU_SIZE]); +ALIGN_VAR_32(uint32_t, mubuf2[MAX_TU_SIZE]); +ALIGN_VAR_32(uint16_t, mushortbuf1[MAX_TU_SIZE]); does denoise need all new buffers? can it reuse existing buffers? I need unsigned buffers, so I prepared to attain new ones over interpreting sign buffer as unsign using type casting, the residuum of the things I have update in my patch. There's no need to declare them aligned here. The first array is declared aligned and since all below it are also aligned in size every array is implicitly aligned. int16_t mshortbuf2[MAX_TU_SIZE]; int16_t mshortbuf3[MAX_TU_SIZE]; @@ -56,6 +60,9 @@ int int_test_buff[TEST_CASES][TEST_BUF_SIZE]; int int_idct_test_buff[TEST_CASES][TEST_BUF_SIZE]; +int int_denoise_test_buff1[TEST_CASES
Re: [x265] [PATCH] copy_cnt: enable avx2 version of asm code
You can push 16x16 and 32x32 also they are good in performance but they need a bit more improvement, I will be sending improvement patch soon. Regards, Praveen Tiwari On Thu, Sep 11, 2014 at 11:29 AM, Deepthi Nandakumar deep...@multicorewareinc.com wrote: Would be better to combine this asm enable with the corresponding asm patch itself. I have pushed copy_cnt8, and enabled only that for now. On Wed, Sep 10, 2014 at 3:28 PM, prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari # Date 1410343073 -19800 # Node ID 2cd4a13086740728559fde3a176953e9aa4c0782 # Parent 7bc4db02ccc728f6e2ddedd036c96e3d37b90f22 copy_cnt: enable avx2 version of asm code diff -r 7bc4db02ccc7 -r 2cd4a1308674 source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cpp Wed Sep 10 14:45:33 2014 +0530 +++ b/source/common/x86/asm-primitives.cpp Wed Sep 10 15:27:53 2014 +0530 @@ -1724,14 +1724,10 @@ p.sad_x4[LUMA_16x32] = x265_pixel_sad_x4_16x32_avx2; p.ssd_s[BLOCK_32x32] = x265_pixel_ssd_s_32_avx2; -/* Need to update assembly code as per changed interface of the copy_cnt primitive, once - * code is updated, avx2 version will be enabled */ -/* p.copy_cnt[BLOCK_4x4] = x265_copy_cnt_4_avx2; p.copy_cnt[BLOCK_8x8] = x265_copy_cnt_8_avx2; p.copy_cnt[BLOCK_16x16] = x265_copy_cnt_16_avx2; p.copy_cnt[BLOCK_32x32] = x265_copy_cnt_32_avx2; -*/ p.cvt32to16_shl[BLOCK_4x4] = x265_cvt32to16_shl_4_avx2; p.cvt32to16_shl[BLOCK_8x8] = x265_cvt32to16_shl_8_avx2; ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] removed copy_cnt_4 avx2 asm code: SSE version is eualy faster
Ignore It, need to correct commit message. Regards, Praveen Tiwari On Thu, Sep 11, 2014 at 4:41 PM, prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari # Date 1410433904 -19800 # Node ID 5740ec22db67267bfca97fbba07ef9239802d2b0 # Parent 012f315d3eda8044f5a49865e15ba2943fbab094 removed copy_cnt_4 avx2 asm code: SSE version is eualy faster diff -r 012f315d3eda -r 5740ec22db67 source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cpp Wed Sep 10 17:27:20 2014 +0200 +++ b/source/common/x86/asm-primitives.cpp Thu Sep 11 16:41:44 2014 +0530 @@ -1730,7 +1730,6 @@ /* Need to update assembly code as per changed interface of the copy_cnt primitive, once * code is updated, avx2 version will be enabled */ -// p.copy_cnt[BLOCK_4x4] = x265_copy_cnt_4_avx2; p.copy_cnt[BLOCK_8x8] = x265_copy_cnt_8_avx2; // p.copy_cnt[BLOCK_16x16] = x265_copy_cnt_16_avx2; // p.copy_cnt[BLOCK_32x32] = x265_copy_cnt_32_avx2; diff -r 012f315d3eda -r 5740ec22db67 source/common/x86/blockcopy8.asm --- a/source/common/x86/blockcopy8.asm Wed Sep 10 17:27:20 2014 +0200 +++ b/source/common/x86/blockcopy8.asm Thu Sep 11 16:41:44 2014 +0530 @@ -3987,35 +3987,6 @@ %endif RET - -INIT_YMM avx2 -cglobal copy_cnt_4, 3,3,3 -add r2d, r2d -xorpd xm2, xm2 - -; row 0 1 -movqxm0, [r1] -movhps xm0, [r1 + r2] - -; row 2 3 -movqxm1, [r1 + r2 * 2] -lea r2, [r2 * 3] -movhps xm1, [r1 + r2] - -vinserti128 m0, m0, xm1, 1 -movu[r0], m0 - -vextractf128 xm1, m0, 1 -packsswb xm0, xm1 -pcmpeqb xm0, xm2 - -; get count -pmovmskbeax, xm0 -not ax -popcnt ax, ax -RET - - ;-- ; uint32_t copy_cnt(int16_t *dst, int16_t *src, intptr_t stride); ;-- ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
[x265] Fwd: Fwd: [PATCH] copy_cnt_4: faster AVX2 code
-- Forwarded message -- From: chen chenm...@163.com Date: Wed, Sep 10, 2014 at 12:14 PM Subject: Re: [x265] Fwd: [PATCH] copy_cnt_4: faster AVX2 code To: Development for x265 x265-devel@videolan.org At 2014-09-10 09:34:31,Praveen Tiwari prav...@multicorewareinc.com wrote: -- Forwarded message -- From: chen chenm...@163.com Date: Tue, Sep 9, 2014 at 10:17 AM Subject: Re: [x265] [PATCH] copy_cnt_4: faster AVX2 code To: Development for x265 x265-devel@videolan.org Most operator is SSE2, just one movu, why we need AVX2 version on 4x4? what about vinserti128 ? you want to use vinserti128 combin 128bits to 256 bits, is it more cost than two of movu I tested both sse and avx2 code on HASWELL-I5 machine, avx2 code seems a bit faster so, I think we should keep both versions. Here is result of 3 runs: *SSE VERSION:-* copy_cnt[4x4] 4.21x110.16 463.86 copy_cnt[4x4] 4.18x104.64 437.08 copy_cnt[4x4] 4.17x110.23 460.02 *AVX2 VERSION:-* copy_cnt[4x4] 4.71x99.23 467.63 copy_cnt[4x4] 4.39x104.46 458.58 copy_cnt[4x4] 4.71x99.27 467.91 ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
[x265] Fwd: [PATCH] copy_cnt_4: faster AVX2 code
-- Forwarded message -- From: chen chenm...@163.com Date: Tue, Sep 9, 2014 at 10:17 AM Subject: Re: [x265] [PATCH] copy_cnt_4: faster AVX2 code To: Development for x265 x265-devel@videolan.org Most operator is SSE2, just one movu, why we need AVX2 version on 4x4? what about vinserti128 ? At 2014-09-09 16:37:23,prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari # Date 1410251834 -19800 # Node ID d011073f35258cb2f0ad95db6038c2d9fb840b27 # Parent ebb84e9dbb0fa0e8c4c9304b2efd57f8ac3d0c05 copy_cnt_4: faster AVX2 code diff -r ebb84e9dbb0f -r d011073f3525 source/common/x86/blockcopy8.asm --- a/source/common/x86/blockcopy8.asm Tue Sep 09 11:36:58 2014 +0530 +++ b/source/common/x86/blockcopy8.asm Tue Sep 09 14:07:14 2014 +0530 @@ -3990,7 +3990,7 @@ INIT_YMM avx2 cglobal copy_cnt_4, 3,3,3 add r2d, r2d -xorpd xm2, xm2 +xorpd m2, m2; row 0 1 movqxm0, [r1] @@ -4004,11 +4004,9 @@ vinserti128 m0, m0, xm1, 1 movu[r0], m0 -vextractf128 xm1, m0, 1 -packsswb xm0, xm1 -pcmpeqb xm0, xm2 - ; get count +packsswbxm0, xm1 +pcmpeqb xm0, xm2 pmovmskbeax, xm0 not ax popcnt ax, ax ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] count_nonzero primitive, downscaling quantCoeff from int32_t* to int16_t*
Thanks, just sent a fix for it. Regards, Praveen On Tue, Aug 12, 2014 at 7:18 PM, chen chenm...@163.com wrote: -X265_CHECK((int)numSig == primitives.count_nonzero(coeff, 1 log2TrSize * 2), numSig differ\n); +/* This section of code is to safely convert int32_t coefficients to int16_t, once the caller function is + * optimize to take coefficients as int16_t*, it will be cleanse.*/ +int numCoeff = (1 (log2TrSize * 2)); +assert(numCoeff = 1024); +ALIGN_VAR_16(int16_t, qCoeff[32 * 32]); +for (int i = 0; i numCoeff; i++) +{ +qCoeff[i] = ( coeff[i] 0x); +} I suggest use clip on it, to avoid value problem (eg: 0x1 become zero) and asm instruction match to clip ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] x265: uncommon behavior by changing the 8-point DCT matrix
I think you are testing with asm code enabled. Assembly code has it's own table, it nothing to do with constant 'g_t8' at source/Lib/TLibCommon/TComRom.cpp (only for C code). Check dct8.asm file for asm tables. Regards, Praveen Tiwari On Wed, May 28, 2014 at 5:15 AM, Paulo André Oliveira oliveirapa...@globo.com wrote: Dear x265 development team, I am trying to conduct the following experiment: assess the change in the compressed video's quality by changing only the 8-point DCT matrix, which I suppose is the constant 'g_t8' at source/Lib/TLibCommon/TComRom.cpp However, the video's quality, which I am monitoring by the PSNR and SSIM metrics, keeps the same with any random matrix that I define in 'g_t8'. I am using the last version of x265 as of today. Sincerely, Paulo Oliveira ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] Fwd: [PATCH] noise reduction feature, ported from x264
Yes that is true, thanks for your suggestions, I scan through few papers to find from where following constant values (to generate the weight table) are coming. #define W(i) (i==0 ? FIX8(*1.*) :\ i==1 ? FIX8(*0.8859*) :\ i==2 ? FIX8(*1.6000*) :\ i==3 ? FIX8(*0.9415*) :\ i==4 ? FIX8(*1.2651*) :\ i==5 ? FIX8(*1.1910*) :0) it seems these values depends on dct coefficients too, so we need new weight table for x265. I found these are generated through formula:- Qstep ≈ Vi8 / (Si8 * 2^8 ) (for 8x8 block) where rescaling matrix Vi8 is (32, 28, 51, 30, 40, 38) (qp = 4 from following table) QP vm0 vm1 vm2 vm3 vm4 vm5 0 2018321925 24 1 2219 35212826 2 2623 42243331 3 2825 45263533 432 28 51304038 536 32 58 344643 Si8 = 1/8 (0.125) (basically Si is also a matrix but it seems first element is chosen for transform normalization) So, if we will apply the above formula then:- W(0) = 32 / (0.125 * 256) = 1 ≈ 1. W(1) = 28 / (0.125 * 256) = 0.875≈ 0.8859 W(2) = 51 / (0.125 * 256) = 1.59 ≈ 1.6000 W(3) = 30 / (0.125 * 256) = 0.9375 ≈ 0.9415 W(4) = 40 / (0.125 * 256) = 1.25 ≈ 1.265 W(5) = 38 / (0.125 * 256) = 1.1875 ≈ 1.1910 Does my analysis is in right direction? if it is why Vi8 is chosen corresponding to qp = 4 why not any other qp ? Finally weight table is arranged as W(0), W(3), W(4), W(3), W(0), W(3), W(4), W(3), W(3), W(1), W(5), W(1), W(3), W(1), W(5), W(1), W(4), W(5), W(2), W(5), W(4), W(5), W(2), W(5), W(3), W(1), W(5), W(1), W(3), W(1), W(5), W(1), W(0), W(3), W(4), W(3), W(0), W(3), W(4), W(3), W(3), W(1), W(5), W(1), W(3), W(1), W(5), W(1), W(4), W(5), W(2), W(5), W(4), W(5), W(2), W(5), W(3), W(1), W(5), W(1), W(3), W(1), W(5), W(1) what is logic behind such arrangement ? Regards, Praveen Tiwari On Sat, May 10, 2014 at 8:12 AM, Jason Garrett-Glaser ja...@x264.comwrote: That isn't correct at all; the weights depend on the transforms, which depend on the video format. You can't just build a 16x16 out of 8x8s or 4x4s; you need to match the way the format works. Jason ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
[x265] Fwd: [PATCH] noise reduction feature, ported from x264
-- Forwarded message -- From: Jason Garrett-Glaser ja...@x264.com Date: Thu, May 8, 2014 at 5:08 PM Subject: Re: [x265] [PATCH] noise reduction feature, ported from x264 To: Development for x265 x265-devel@videolan.org This only seems to have 4x4 and 8x8 transform sizes; how does this work given that H.265 has many other transform sizes? What does it do for other transform sizes? 4x4 and 8x8 transform sizes are used as basic blocks to generate the bigger sizes (16x16, 32x32), as we have weight tables only for 4x4 and 8x8 (taken from x264). Jason ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] all_angs_pred_32x32, asm code improvement
This is new patch same changes in other modes, but I have given same commit message perhaps that's why it seems confusing. Do I need to send as an attachment ? On Thu, Feb 27, 2014 at 4:28 PM, Deepthi Nandakumar deep...@multicorewareinc.com wrote: The earlier patch was pushed, Praveen. Can you send a new patch which just removes the unused statements? ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] all_angs_pred_32x32, asm code improvement
Oh, just left by mistake. I commented old code to test correctness of new code, I will update the patch. On Thu, Feb 27, 2014 at 3:33 AM, chen chenm...@163.com wrote: At 2014-02-26 20:28:52,prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari # Date 1393417704 -19800 # Node ID 7de2875c614058648475618d2b9faa5a9611225b # Parent 53c7e3e789435a3e7b51f1ad61e9425f59ea6cf7 all_angs_pred_32x32, asm code improvement @@ -23679,8 +23563,9 @@ pmaddubsw m3,m1, m6 pmulhrsw m3,m7 pslldqm4,2 -pinsrbm4,[r4 + 8], 1 -pinsrbm4,[r4 + 7], 0 +;pinsrbm4,[r4 + 8], 1 +;pinsrbm4,[r4 + 7], 0 +pinsrwm4, [r4 + 7], 0 please remove unused comment line ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] all_angs_pred_4x4, mova replace with pxor
Min, I have sent the updated full patch. Regards, Praveen Tiwari On Wed, Dec 4, 2013 at 8:58 PM, chen chenm...@163.com wrote: can you send a full patch, not patch to patch At 2013-12-04 22:50:05,prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari # Date 1386168592 -19800 # Node ID 52d604b17f7b6c7dedee4a5defcb8f089221b02b # Parent c31e28cd26aa8a3f07ba0023a5923931cc687a2d all_angs_pred_4x4, mova replace with pxor diff -r c31e28cd26aa -r 52d604b17f7b source/common/x86/intrapred8.asm --- a/source/common/x86/intrapred8.asm Wed Dec 04 20:05:57 2013 +0530 +++ b/source/common/x86/intrapred8.asm Wed Dec 04 20:19:52 2013 +0530 @@ -34,8 +34,6 @@ c_trans_4x4 db 0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15 -tab_Zero: db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 - const ang_table %assign x 0 %rep 32 @@ -945,7 +943,7 @@ pshufd m3, m2,0 movu [r0 + 128], m3 -mova m3, [tab_Zero] +pxor m3, m3 pshufb m4, m2, m3 punpcklbwm4, m3 @@ -1347,7 +1345,7 @@ pshufd m2, m1,0 movu [r0 + 384], m2 -mova m2, [tab_Zero] +pxor m2, m2 pshufb m3, m1, m2 punpcklbwm3, m2 ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] asm-primitives.cpp, removed temporary function pointer initialization, generated through macro calls
sorry, I removed wrong pointer initialization, I will fix it in next patch, don't merge it. On Fri, Nov 22, 2013 at 4:34 PM, prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari # Date 1385118266 -19800 # Node ID f2b8bcaf435c00d835cd4389063ed09d22e7be28 # Parent 87a797d1c03afaea0b3cf9a2dfcac2c7e2950efc asm-primitives.cpp, removed temporary function pointer initialization, generated through macro calls diff -r 87a797d1c03a -r f2b8bcaf435c source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cpp Fri Nov 22 15:47:02 2013 +0530 +++ b/source/common/x86/asm-primitives.cpp Fri Nov 22 16:34:26 2013 +0530 @@ -145,7 +145,8 @@ p.chroma[X265_CSP_I420].filter_hpp[CHROMA_ ## W ## x ## H] = x265_interp_4tap_horiz_pp_ ## W ## x ## H ## cpu; \ p.chroma[X265_CSP_I420].filter_hps[CHROMA_ ## W ## x ## H] = x265_interp_4tap_horiz_ps_ ## W ## x ## H ## cpu; \ p.chroma[X265_CSP_I420].filter_vpp[CHROMA_ ## W ## x ## H] = x265_interp_4tap_vert_pp_ ## W ## x ## H ## cpu; \ -p.chroma[X265_CSP_I420].filter_vps[CHROMA_ ## W ## x ## H] = x265_interp_4tap_vert_ps_ ## W ## x ## H ## cpu; +p.chroma[X265_CSP_I420].filter_vps[CHROMA_ ## W ## x ## H] = x265_interp_4tap_vert_ps_ ## W ## x ## H ## cpu; \ +p.chroma[X265_CSP_I420].add_ps[CHROMA_ ## W ## x ## H] = x265_pixel_add_ps_ ## W ## x ## H ## cpu; #define SETUP_CHROMA_SP_FUNC_DEF(W, H, cpu) \ p.chroma[X265_CSP_I420].filter_vsp[CHROMA_ ## W ## x ## H] = x265_interp_4tap_vert_sp_ ## W ## x ## H ## cpu; @@ -234,7 +235,8 @@ p.luma_vpp[LUMA_ ## W ## x ## H] = x265_interp_8tap_vert_pp_ ## W ## x ## H ## cpu; \ p.luma_vps[LUMA_ ## W ## x ## H] = x265_interp_8tap_vert_ps_ ## W ## x ## H ## cpu; \ p.luma_copy_ps[LUMA_ ## W ## x ## H] = x265_blockcopy_ps_ ## W ## x ## H ## cpu; \ -p.luma_sub_ps[LUMA_ ## W ## x ## H] = x265_pixel_sub_ps_ ## W ## x ## H ## cpu; +p.luma_sub_ps[LUMA_ ## W ## x ## H] = x265_pixel_sub_ps_ ## W ## x ## H ## cpu; \ +p.luma_add_ps[LUMA_ ## W ## x ## H] = x265_pixel_add_ps_ ## W ## x ## H ## cpu; #define SETUP_LUMA_SP_FUNC_DEF(W, H, cpu) \ p.luma_vsp[LUMA_ ## W ## x ## H] = x265_interp_8tap_vert_sp_ ## W ## x ## H ## cpu; @@ -477,40 +479,6 @@ CHROMA_SS_FILTERS(_sse2); LUMA_SS_FILTERS(_sse2); -// This function pointer initialization is temporary will be removed -// later with macro definitions. It is used to avoid linker errors -// until all partitions are coded and commit smaller patches, easier to -// review. - -p.chroma[X265_CSP_I420].copy_sp[CHROMA_4x2] = x265_blockcopy_sp_4x2_sse2; -p.chroma[X265_CSP_I420].copy_sp[CHROMA_4x4] = x265_blockcopy_sp_4x4_sse2; -p.chroma[X265_CSP_I420].copy_sp[CHROMA_4x8] = x265_blockcopy_sp_4x8_sse2; -p.chroma[X265_CSP_I420].copy_sp[CHROMA_4x16] = x265_blockcopy_sp_4x16_sse2; -p.chroma[X265_CSP_I420].copy_sp[CHROMA_8x2] = x265_blockcopy_sp_8x2_sse2; -p.chroma[X265_CSP_I420].copy_sp[CHROMA_8x4] = x265_blockcopy_sp_8x4_sse2; -p.chroma[X265_CSP_I420].copy_sp[CHROMA_8x6] = x265_blockcopy_sp_8x6_sse2; -p.chroma[X265_CSP_I420].copy_sp[CHROMA_8x8] = x265_blockcopy_sp_8x8_sse2; -p.chroma[X265_CSP_I420].copy_sp[CHROMA_8x16] = x265_blockcopy_sp_8x16_sse2; -p.chroma[X265_CSP_I420].copy_sp[CHROMA_12x16] = x265_blockcopy_sp_12x16_sse2; -p.chroma[X265_CSP_I420].copy_sp[CHROMA_16x4] = x265_blockcopy_sp_16x4_sse2; -p.chroma[X265_CSP_I420].copy_sp[CHROMA_16x8] = x265_blockcopy_sp_16x8_sse2; -p.chroma[X265_CSP_I420].copy_sp[CHROMA_16x12] = x265_blockcopy_sp_16x12_sse2; -p.chroma[X265_CSP_I420].copy_sp[CHROMA_16x16] = x265_blockcopy_sp_16x16_sse2; -p.chroma[X265_CSP_I420].copy_sp[CHROMA_16x32] = x265_blockcopy_sp_16x32_sse2; -p.chroma[X265_CSP_I420].copy_sp[CHROMA_24x32] = x265_blockcopy_sp_24x32_sse2; -p.chroma[X265_CSP_I420].copy_sp[CHROMA_32x8] = x265_blockcopy_sp_32x8_sse2; -p.chroma[X265_CSP_I420].copy_sp[CHROMA_32x16] = x265_blockcopy_sp_32x16_sse2; -p.chroma[X265_CSP_I420].copy_sp[CHROMA_32x24] = x265_blockcopy_sp_32x24_sse2; -p.chroma[X265_CSP_I420].copy_sp[CHROMA_32x32] = x265_blockcopy_sp_32x32_sse2; - -p.luma_copy_sp[LUMA_32x64] = x265_blockcopy_sp_32x64_sse2; -p.luma_copy_sp[LUMA_16x64] = x265_blockcopy_sp_16x64_sse2; -p.luma_copy_sp[LUMA_48x64] = x265_blockcopy_sp_48x64_sse2; -p.luma_copy_sp[LUMA_64x16] = x265_blockcopy_sp_64x16_sse2; -p.luma_copy_sp[LUMA_64x32] = x265_blockcopy_sp_64x32_sse2; -p.luma_copy_sp[LUMA_64x48] = x265_blockcopy_sp_64x48_sse2; -p.luma_copy_sp[LUMA_64x64] = x265_blockcopy_sp_64x64_sse2; - p.blockfill_s[BLOCK_4x4] = x265_blockfill_s_4x4_sse2; p.blockfill_s[BLOCK_8x8] = x265_blockfill_s_8x8_sse2
Re: [x265] [PATCH] asm code for pixeladd_ps_4x4 and testbench integration
Merged, sent implementation. Regards, Praveen Tiwari On Wed, Nov 20, 2013 at 6:08 PM, chen chenm...@163.com wrote: At 2013-11-20 19:45:24,prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari # Date 1384947915 -19800 # Node ID c1e556f54d61422d153ff67f4830dc62ddd9 # Parent a7fb47a7eddf18634449a5ac898f7c2d029048e9 asm code for pixeladd_ps_4x4 and testbench integration diff -r a7fb47a7eddf -r c1e556f54d61 source/common/CMakeLists.txt --- a/source/common/CMakeLists.txt Wed Nov 20 12:57:57 2013 +0530 +++ b/source/common/CMakeLists.txt Wed Nov 20 17:15:15 2013 +0530 @@ -113,7 +113,7 @@ if(ENABLE_PRIMITIVES_ASM) set(C_SRCS asm-primitives.cpp pixel.h mc.h ipfilter8.h blockcopy8.h) -set(A_SRCS pixel-a.asm const-a.asm cpu-a.asm sad-a.asm mc-a.asm mc-a2.asm ipfilter8.asm pixel-util.asm blockcopy8.asm) +set(A_SRCS pixel-a.asm const-a.asm cpu-a.asm sad-a.asm mc-a.asm mc-a2.asm ipfilter8.asm pixel-util.asm blockcopy8.asm pixeladd8.asm) if (NOT X64) set(A_SRCS ${A_SRCS} pixel-32.asm) endif() diff -r a7fb47a7eddf -r c1e556f54d61 source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cpp Wed Nov 20 12:57:57 2013 +0530 +++ b/source/common/x86/asm-primitives.cpp Wed Nov 20 17:15:15 2013 +0530 @@ -633,6 +633,13 @@ p.calcrecon[BLOCK_32x32] = x265_calcRecons32_sse4; p.calcresidual[BLOCK_16x16] = x265_getResidual16_sse4; p.calcresidual[BLOCK_32x32] = x265_getResidual32_sse4; + +// This function pointer initialization is temporary will be removed +// later with macro definitions. It is used to avoid linker errors +// until all partitions are coded and commit smaller patches, easier to +// review. + +p.chroma_add_ps[X265_CSP_I420][CHROMA_4x4] = x265_pixel_add_ps_4x4_sse4; } if (cpuMask X265_CPU_AVX) { diff -r a7fb47a7eddf -r c1e556f54d61 source/common/x86/pixel.h --- a/source/common/x86/pixel.h Wed Nov 20 12:57:57 2013 +0530 +++ b/source/common/x86/pixel.h Wed Nov 20 17:15:15 2013 +0530 @@ -313,7 +313,8 @@ SETUP_CHROMA_PIXELSUB_PS_FUNC(8, 32, cpu); #define SETUP_LUMA_PIXELSUB_PS_FUNC(W, H, cpu) \ -void x265_pixel_sub_ps_ ## W ## x ## H ## cpu(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1); +void x265_pixel_sub_ps_ ## W ## x ## H ## cpu(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1);\ +void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel *dest, int destride, pixel *src0, int16_t *scr1, int srcStride0, int srcStride1); #define LUMA_PIXELSUB_DEF(cpu) \ SETUP_LUMA_PIXELSUB_PS_FUNC(4, 4, cpu); \ @@ -342,6 +343,8 @@ SETUP_LUMA_PIXELSUB_PS_FUNC(64, 16, cpu); \ SETUP_LUMA_PIXELSUB_PS_FUNC(16, 64, cpu); +//void x265_pixeladd_ps_4x4_sse4(pixel *dest, int destride, pixel *src0, int16_t *scr1, int srcStride0, int srcStride1); + remove unused line CHROMA_PIXELSUB_DEF(_sse4); LUMA_PIXELSUB_DEF(_sse4); diff -r a7fb47a7eddf -r c1e556f54d61 source/common/x86/pixeladd8.asm --- /dev/nullThu Jan 01 00:00:00 1970 + +++ b/source/common/x86/pixeladd8.asmWed Nov 20 17:15:15 2013 +0530 @@ -0,0 +1,79 @@ +;* +;* Copyright (C) 2013 x265 project +;* +;* Authors: Praveen Kumar Tiwari prav...@multicorewareinc.com +;* +;* This program is free software; you can redistribute it and/or modify +;* it under the terms of the GNU General Public License as published by +;* the Free Software Foundation; either version 2 of the License, or +;* (at your option) any later version. +;* +;* This program is distributed in the hope that it will be useful, +;* but WITHOUT ANY WARRANTY; without even the implied warranty of +;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +;* GNU General Public License for more details. +;* +;* You should have received a copy of the GNU General Public License +;* along with this program; if not, write to the Free Software +;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. +;* +;* This program is also available under a commercial proprietary license. +;* For more information, contact us at licens...@multicorewareinc.com. +;*/ + +%include x86inc.asm +%include x86util.asm + +SECTION_RODATA 32 + +SECTION .text + +;- +; void pixel_add_ps_4x4(pixel *dest, int destride, pixel *src0, int16_t *scr1, int srcStride0, int srcStride1) +;- +INIT_XMM sse4 +cglobal pixel_add_ps_4x4, 6, 6, 2, dest, destride, src0, scr1, srcStride0
Re: [x265] [PATCH] bug fix in blockcopy_pp_4x4
Please, ignore this patch old code is also fine. Some other bug. Regards, Praveen Tiwari On Tue, Nov 12, 2013 at 3:09 PM, prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari # Date 1384249182 -19800 # Node ID 40695de368b6c890fa27a08c8e5a277c9682149c # Parent d5e30ab8c8b756dd5de2a6e8f455210cb517e28b bug fix in blockcopy_pp_4x4 diff -r d5e30ab8c8b7 -r 40695de368b6 source/common/x86/blockcopy8.asm --- a/source/common/x86/blockcopy8.asm Tue Nov 12 14:14:04 2013 +0530 +++ b/source/common/x86/blockcopy8.asm Tue Nov 12 15:09:42 2013 +0530 @@ -113,13 +113,13 @@ movd m0, [r2] movd m1, [r2 + r3] movd m2, [r2 + 2 * r3] -lea r3, [r3 + r3 * 2] +lea r2, [r2 + 2 * r3] movd m3, [r2 + r3] movd [r0],m0 movd [r0 + r1], m1 movd [r0 + 2 * r1], m2 -lea r1, [r1 + 2 * r1] +lea r0, [r0 + 2 * r1] movd [r0 + r1], m3 RET ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] asm code for blockcopy_ps, 8x6, 8x16 and 8x32
I mistyped one partition size, instead of 8x6 it will be 8x8, rest are correct. Regards, Praveen Tiwari On Mon, Nov 11, 2013 at 2:58 PM, prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari # Date 1384162089 -19800 # Node ID 6da0a0291ed8d10dc3dfdb3df396cd1a8c74ceeb # Parent da0b44e67fe07caa7ed113ec4946a371d96801be asm code for blockcopy_ps, 8x6, 8x16 and 8x32 diff -r da0b44e67fe0 -r 6da0a0291ed8 source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cpp Mon Nov 11 14:36:21 2013 +0530 +++ b/source/common/x86/asm-primitives.cpp Mon Nov 11 14:58:09 2013 +0530 @@ -459,6 +459,9 @@ p.chroma_copy_ps[CHROMA_8x2] = x265_blockcopy_ps_8x2_sse4; p.chroma_copy_ps[CHROMA_8x4] = x265_blockcopy_ps_8x4_sse4; p.chroma_copy_ps[CHROMA_8x6] = x265_blockcopy_ps_8x6_sse4; +p.chroma_copy_ps[CHROMA_8x8] = x265_blockcopy_ps_8x8_sse4; +p.chroma_copy_ps[CHROMA_8x16] = x265_blockcopy_ps_8x16_sse4; +p.chroma_copy_ps[CHROMA_8x32] = x265_blockcopy_ps_8x32_sse4; } if (cpuMask X265_CPU_AVX) { diff -r da0b44e67fe0 -r 6da0a0291ed8 source/common/x86/blockcopy8.asm --- a/source/common/x86/blockcopy8.asm Mon Nov 11 14:36:21 2013 +0530 +++ b/source/common/x86/blockcopy8.asm Mon Nov 11 14:58:09 2013 +0530 @@ -1743,3 +1743,46 @@ movu [r0 + r1], m0 RET + +;- +; void blockcopy_ps_%1x%2(int16_t *dest, intptr_t destStride, pixel *src, intptr_t srcStride); +;- +%macro BLOCKCOPY_PS_W8_H4 2 +INIT_XMM sse4 +cglobal blockcopy_ps_%1x%2, 4, 5, 1, dest, destStride, src, srcStride + +add r1, r1 +movr4d, %2/4 + +.loop + movh m0,[r2] + pmovzxbw m0,m0 + movu [r0], m0 + + movh m0,[r2 + r3] + pmovzxbw m0,m0 + movu [r0 + r1], m0 + + movh m0,[r2 + 2 * r3] + pmovzxbw m0,m0 + movu [r0 + 2 * r1], m0 + + lear2,[r2 + 2 * r3] + lear0,[r0 + 2 * r1] + + movh m0,[r2 + r3] + pmovzxbw m0,m0 + movu [r0 + r1], m0 + + lear0,[r0 + 2 * r1] + lear2,[r2 + 2 * r3] + + decr4d + jnz.loop + +RET +%endmacro + +BLOCKCOPY_PS_W8_H4 8, 8 +BLOCKCOPY_PS_W8_H4 8, 16 +BLOCKCOPY_PS_W8_H4 8, 32 diff -r da0b44e67fe0 -r 6da0a0291ed8 source/common/x86/blockcopy8.h --- a/source/common/x86/blockcopy8.hMon Nov 11 14:36:21 2013 +0530 +++ b/source/common/x86/blockcopy8.hMon Nov 11 14:58:09 2013 +0530 @@ -96,7 +96,10 @@ #define CHROMA_BLOCKCOPY_DEF_SSE4(cpu) \ SETUP_CHROMA_BLOCKCOPY_FUNC_SSE4(8, 2, cpu); \ SETUP_CHROMA_BLOCKCOPY_FUNC_SSE4(8, 4, cpu); \ -SETUP_CHROMA_BLOCKCOPY_FUNC_SSE4(8, 6, cpu); +SETUP_CHROMA_BLOCKCOPY_FUNC_SSE4(8, 6, cpu); \ +SETUP_CHROMA_BLOCKCOPY_FUNC_SSE4(8, 8, cpu); \ +SETUP_CHROMA_BLOCKCOPY_FUNC_SSE4(8, 16, cpu); \ +SETUP_CHROMA_BLOCKCOPY_FUNC_SSE4(8, 32, cpu); CHROMA_BLOCKCOPY_DEF_SSE4(_sse4); ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] asm code for blockcopy_ps_16x4
Fixed. Regards, Praveen Tiwari On Mon, Nov 11, 2013 at 4:06 PM, chen chenm...@163.com wrote: +movu m1, [r2] +punpcklbw m2, m1,m0 Here have a hide register copy, try to avoid it by SSE4.1 pmovzxbw m2, m1 +movu [r0], m2 +punpckhbw m1, m0 +movu [r0 + 16], m1 ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] asm code for blockcopy_ps_2x4
Replaced. Regards, Praveen Tiwari On Mon, Nov 11, 2013 at 7:02 PM, chen chenm...@163.com wrote: +movd m0,[r2] +pmovzxbw m0,m0 +pextrd [r0], m0, 0 same as movd ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] asm code for blockcopy_ps_24x32
Sent Patch. Regards, Praveen Tiwari On Mon, Nov 11, 2013 at 6:54 PM, chen chenm...@163.com wrote: +;- +; void blockcopy_ps_%1x%2(int16_t *dest, intptr_t destStride, pixel *src, intptr_t srcStride); +;- +%macro BLOCKCOPY_PS_W24_H2 2 +INIT_XMM sse4 +cglobal blockcopy_ps_%1x%2, 4, 5, 3, dest, destStride, src, srcStride + +addr1, r1 +movr4d, %2/2 +pxor m0, m0 + +.loop + movu m1, [r2] + pmovzxbw m2, m1 + movu [r0], m2 + punpckhbw m1, m0 + movu [r0 + 16], m1 + + movu m1, [r2 + 16] movh + pmovzxbw m1, m1 + movu [r0 + 32], m1 ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
[x265] Fwd: [PATCH] blockcopy_sp_4x8, optimized asm code
-- Forwarded message -- From: chen chenm...@163.com Date: Fri, Nov 8, 2013 at 3:29 PM Subject: Re: [x265] [PATCH] blockcopy_sp_4x8, optimized asm code To: Development for x265 x265-devel@videolan.org At 2013-11-08 17:34:19,prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari # Date 1383903250 -19800 # Node ID 1e6bf52b6e3471b81e636569daa667f6dec9838a # Parent 44ac213169c906eab5cba6b4aba876391b81da99 blockcopy_sp_4x8, optimized asm code diff -r 44ac213169c9 -r 1e6bf52b6e34 source/common/x86/blockcopy8.asm --- a/source/common/x86/blockcopy8.asm Fri Nov 08 14:46:07 2013 +0530 +++ b/source/common/x86/blockcopy8.asm Fri Nov 08 15:04:10 2013 +0530 @@ -948,45 +948,42 @@ ; void blockcopy_sp_4x8(pixel *dest, intptr_t destStride, int16_t *src, intptr_t srcStride) ;- INIT_XMM sse2 -cglobal blockcopy_sp_4x8, 4, 6, 8, dest, destStride, src, srcStride +cglobal blockcopy_sp_4x8, 4, 4, 8, dest, destStride, src, srcStride you have used r5 Min, r5 was in old code I have removed that. I think you are talking about [ -lear5, [r4 + 2 * r3] ]. In new code I have used just 4 registers. ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
[x265] Fwd: [PATCH] blockcopy_sp_8x2, optimized asm code
-- Forwarded message -- From: chen chenm...@163.com Date: Fri, Nov 8, 2013 at 4:30 PM Subject: Re: [x265] [PATCH] blockcopy_sp_8x2, optimized asm code To: Development for x265 x265-devel@videolan.org +movh [r0], m0 +movhps [r0 + r1], m0 change movh to movlps is better, movh+movhps is mixed float and integer path Will movh+movhps cause any problem ? I thought movh will be faster. ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
[x265] Fwd: [PATCH] blockcopy_sp_16xN, optimized asm code
-- Forwarded message -- From: chen chenm...@163.com Date: Fri, Nov 8, 2013 at 7:10 PM Subject: Re: [x265] [PATCH] blockcopy_sp_16xN, optimized asm code To: Development for x265 x265-devel@videolan.org code is right, but need uncrustify it, ex: add r3, r3 Does uncrustify work for .asm files? t 2013-11-08 21:32:05,prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari # Date 1383917516 -19800 # Node ID 662664f0863b38b838a15867745c5564f574fb09 # Parent 227a5666e08869d36e07a75f3db95dd94c774715 blockcopy_sp_16xN, optimized asm code diff -r 227a5666e088 -r 662664f0863b source/common/x86/blockcopy8.asm --- a/source/common/x86/blockcopy8.asm Fri Nov 08 17:38:24 2013 +0530 +++ b/source/common/x86/blockcopy8.asm Fri Nov 08 19:01:56 2013 +0530 @@ -1325,51 +1325,38 @@ ;- %macro BLOCKCOPY_SP_W16_H4 2 INIT_XMM sse2 -cglobal blockcopy_sp_%1x%2, 4, 7, 7, dest, destStride, src, srcStride +cglobal blockcopy_sp_%1x%2, 4, 5, 8, dest, destStride, src, srcStride -mov r6d,%2 +mov r4d, %2/4 -addr3, r3 - -mova m0, [tab_Vm] +add r3, r3 .loop - movu m1, [r2] - movu m2, [r2 + 16] - movu m3, [r2 + r3] - movu m4, [r2 + r3 + 16] - movu m5, [r2 + 2 * r3] - movu m6, [r2 + 2 * r3 + 16] + movu m0, [r2] + movu m1, [r2 + 16] + movu m2, [r2 + r3] + movu m3, [r2 + r3 + 16] + movu m4, [r2 + 2 * r3] + movu m5, [r2 + 2 * r3 + 16] + lear2, [r2 + 2 * r3] + movu m6, [r2 + r3] + movu m7, [r2 + r3 + 16] - pshufb m1, m0 - pshufb m2, m0 - pshufb m3, m0 - pshufb m4, m0 - pshufb m5, m0 - pshufb m6, m0 + packuswb m0, m1 + packuswb m2, m3 + packuswb m4, m5 + packuswb m6, m7 - movh [r0], m1 - movh [r0 + 8], m2 - movh [r0 + r1], m3 - movh [r0 + r1 + 8], m4 - movh [r0 + 2 * r1], m5 - movh [r0 + 2 * r1 + 8], m6 + movu [r0], m0 + movu [r0 + r1], m2 + movu [r0 + 2 * r1], m4 + lear0,[r0 + 2 * r1] + movu [r0 + r1], m6 - lear4, [r2 + 2 * r3] - movu m1, [r4 + r3] - movu m2, [r4 + r3 + 16] + lear0,[r0 + 2 * r1] + lear2,[r2 + 2 * r3] - pshufb m1, m0 - pshufb m2, m0 - - lear5,[r0 + 2 * r1] - movh [r5 + r1], m1 - movh [r5 + r1 + 8], m2 - - lear0, [r5 + 2 * r1] - lear2, [r4 + 2 * r3] - - subr6d, 4 + decr4d jnz.loop RET ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
[x265] Fwd: [PATCH] added pixelsub_ps C primitive and function pointer creation
-- Forwarded message -- From: Steve Borho st...@borho.org Date: Thu, Nov 7, 2013 at 1:51 PM Subject: Re: [x265] [PATCH] added pixelsub_ps C primitive and function pointer creation To: Development for x265 x265-devel@videolan.org On Thu, Nov 7, 2013 at 1:01 AM, prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari # Date 1383807695 -19800 # Node ID 34ba8955747b66dcf3471fa216d15b97a3b07e0c # Parent 93cccbe49a93dd4c054ef06aca76974948793613 added pixelsub_ps C primitive and function pointer creation diff -r 93cccbe49a93 -r 34ba8955747b source/common/pixel.cpp --- a/source/common/pixel.cpp Wed Nov 06 19:49:38 2013 -0600 +++ b/source/common/pixel.cpp Thu Nov 07 12:31:35 2013 +0530 @@ -790,6 +790,22 @@ b += strideb; } } + +templateint bx, int by +void pixelsub_ps_c(int16_t *a, intptr_t dstride, pixel *b0, pixel *b1, intptr_t sstride0, intptr_t sstride1) +{ +for (int y = 0; y by; y++) +{ +for (int x = 0; x bx; x++) +{ +a[x] = (int16_t)(b0[x] - b1[x]); +} + +b0 += sstride0; +b1 += sstride1; +a += dstride; +} +} } // end anonymous namespace namespace x265 { @@ -832,10 +848,12 @@ #define CHROMA(W, H) \ p.chroma_copy_pp[CHROMA_ ## W ## x ## H] = blockcopy_pp_cW, H; \ -p.chroma_copy_sp[CHROMA_ ## W ## x ## H] = blockcopy_sp_cW, H; +p.chroma_copy_sp[CHROMA_ ## W ## x ## H] = blockcopy_sp_cW, H;\ +p.chroma_pixelsub_ps[CHROMA_ ## W ## x ## H] = pixelsub_ps_cW, H; #define LUMA(W, H) \ p.luma_copy_pp[LUMA_ ## W ## x ## H] = blockcopy_pp_cW, H; \ -p.luma_copy_sp[LUMA_ ## W ## x ## H] = blockcopy_sp_cW, H; +p.luma_copy_sp[LUMA_ ## W ## x ## H] = blockcopy_sp_cW, H;\ +p.luma_pixelsub_ps[LUMA_ ## W ## x ## H] = pixelsub_ps_cW, H; LUMA(4, 4); LUMA(8, 8); diff -r 93cccbe49a93 -r 34ba8955747b source/common/primitives.h --- a/source/common/primitives.hWed Nov 06 19:49:38 2013 -0600 +++ b/source/common/primitives.hThu Nov 07 12:31:35 2013 +0530 @@ -216,6 +216,8 @@ typedef void (*copy_pp_t)(pixel *dst, intptr_t dstride, pixel *src, intptr_t sstride); // dst is aligned typedef void (*copy_sp_t)(pixel *dst, intptr_t dstStride, int16_t *src, intptr_t srcStride); +typedef void (*pixelsub_ps_t)(int16_t *dst, intptr_t dstStride, pixel *src0, pixel *src1, intptr_t srcStride0, intptr_t srcStride1); there's already a function typedef with the same name, that one needs to be removed or this one needs to be renamed I can see only, pixelsub_sp_t from old function typedef and I have created typedef void pixelsub_ps_t (pixel to short). + /* Define a structure containing function pointers to optimized encoder * primitives. Each pointer can reference either an assembly routine, * a vectorized primitive, or a C function. */ @@ -283,6 +285,9 @@ pixeladd_pp_t pixeladd_pp; pixelavg_pp_t pixelavg_pp[NUM_LUMA_PARTITIONS]; +pixelsub_ps_t chroma_pixelsub_ps[NUM_CHROMA_PARTITIONS]; +pixelsub_ps_t luma_pixelsub_ps[NUM_LUMA_PARTITIONS]; + scale_t scale1D_128to64; scale_t scale2D_64to32; downscale_t frame_init_lowres_core; ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel -- Steve Borho ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] asm code for blockfil_s, 16x16
Applied to code. Regards, Praveen Tiwari On Thu, Nov 7, 2013 at 8:09 PM, chen chenm...@163.com wrote: +movr3d, %2 %2/8 + + subr3d,8 + jnz.loop dec r3d ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
[x265] Fwd: [PATCH] asm code for blockfil_s, 4x4
-- Forwarded message -- From: Steve Borho st...@borho.org Date: 2013/11/8 Subject: Re: [x265] [PATCH] asm code for blockfil_s, 4x4 To: Development for x265 x265-devel@videolan.org On Thu, Nov 7, 2013 at 6:56 AM, prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari # Date 1383828996 -19800 # Node ID f2af7af43dfcb08135a08e755f654314a89efae7 # Parent d71f86b1c58b4fc9f8a3ffeaaef45c60f8bcc468 asm code for blockfil_s, 4x4 blockfill has two l Actually I named all pointers with blockfill (two I) and function with blockfil (one I), perhaps matching naming convention from old code but seems odd, I will take care off it. diff -r d71f86b1c58b -r f2af7af43dfc source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cpp Thu Nov 07 18:16:22 2013 +0530 +++ b/source/common/x86/asm-primitives.cpp Thu Nov 07 18:26:36 2013 +0530 @@ -361,6 +361,8 @@ p.luma_copy_sp[LUMA_64x32] = x265_blockcopy_sp_64x32_sse2; p.luma_copy_sp[LUMA_64x48] = x265_blockcopy_sp_64x48_sse2; p.luma_copy_sp[LUMA_64x64] = x265_blockcopy_sp_64x64_sse2; + +p.blockfill_s[BLOCK_4x4] = x265_blockfil_s_4x4_sse2; #if X86_64 p.satd[LUMA_8x32] = x265_pixel_satd_8x32_sse2; p.satd[LUMA_16x4] = x265_pixel_satd_16x4_sse2; diff -r d71f86b1c58b -r f2af7af43dfc source/common/x86/blockcopy8.asm --- a/source/common/x86/blockcopy8.asm Thu Nov 07 18:16:22 2013 +0530 +++ b/source/common/x86/blockcopy8.asm Thu Nov 07 18:26:36 2013 +0530 @@ -1646,3 +1646,22 @@ BLOCKCOPY_SP_W64_H1 64, 32 BLOCKCOPY_SP_W64_H1 64, 48 BLOCKCOPY_SP_W64_H1 64, 64 + +;- +; void blockfil_s_4x4(int16_t *dest, intptr_t destride, int16_t val) +;- +INIT_XMM sse2 +cglobal blockfil_s_4x4, 3, 3, 1, dest, destStride, val + +addr1,r1 + +movd m0,r2d +pshuflwm0,m0, 0 + +movh [r0], m0 +movh [r0 + r1], m0 +movh [r0 + 2 * r1], m0 +lear0,[r0 + 2 * r1] +movh [r0 + r1], m0 + +RET diff -r d71f86b1c58b -r f2af7af43dfc source/common/x86/pixel.h --- a/source/common/x86/pixel.h Thu Nov 07 18:16:22 2013 +0530 +++ b/source/common/x86/pixel.h Thu Nov 07 18:26:36 2013 +0530 @@ -266,6 +266,8 @@ DECL_ADS(2, avx2) DECL_ADS(1, avx2) +void x265_blockfil_s_4x4_sse2(int16_t *dst, intptr_t dstride, int16_t val); + this belongs in blockcopy8.h Will be moved to blockcopy8.h. #undef DECL_PIXELS #undef DECL_SUF #undef DECL_HEVC_SSD ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel -- Steve Borho ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] asm code for blockcopy_sp, 6x8
Fixed. Regards, Praveen Tiwari On Wed, Nov 6, 2013 at 8:09 PM, chen chenm...@163.com wrote: + movd [r0 + 2 * r1], m3 + pextrwr6,m3,2 + mov [r0 + 2 * r1 + 4], r6w SSE4.1 support below: pextrw[r0 + 2 * r1 + 4], m3,2 ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] asm: assembly code for pixel_sad_12x16
-- Forwarded message -- From: dnyanesh...@multicorewareinc.com Date: Wed, Oct 30, 2013 at 7:47 PM Subject: [x265] [PATCH] asm: assembly code for pixel_sad_12x16 To: x265-devel@videolan.org # HG changeset patch # User Dnyaneshwar Gorade dnyanesh...@multicorewareinc.com # Date 1383142575 -19800 # Wed Oct 30 19:46:15 2013 +0530 # Node ID 5037cc891114619e32ceeff332884d0abfd138fd # Parent 62a51fe2fcbfd76fc8476a6f714f961b3f3f23ef asm: assembly code for pixel_sad_12x16 diff -r 62a51fe2fcbf -r 5037cc891114 source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cpp Wed Oct 30 18:11:01 2013 +0530 +++ b/source/common/x86/asm-primitives.cpp Wed Oct 30 19:46:15 2013 +0530 @@ -253,6 +253,7 @@ p.sad[LUMA_48x64] = x265_pixel_sad_48x64_sse2; p.sad[LUMA_24x32] = x265_pixel_sad_24x32_sse2; +p.sad[LUMA_12x16] = x265_pixel_sad_12x16_sse2; ASSGN_SSE(sse2); INIT2(sad, _sse2); diff -r 62a51fe2fcbf -r 5037cc891114 source/common/x86/pixel.h --- a/source/common/x86/pixel.h Wed Oct 30 18:11:01 2013 +0530 +++ b/source/common/x86/pixel.h Wed Oct 30 19:46:15 2013 +0530 @@ -53,6 +53,7 @@ ret x265_pixel_ ## name ## _64x64_ ## suffix args; \ ret x265_pixel_ ## name ## _48x64_ ## suffix args; \ ret x265_pixel_ ## name ## _24x32_ ## suffix args; \ +ret x265_pixel_ ## name ## _12x16_ ## suffix args; \ #define DECL_X1(name, suffix) \ DECL_PIXELS(int, name, suffix, (pixel *, intptr_t, pixel *, intptr_t)) diff -r 62a51fe2fcbf -r 5037cc891114 source/common/x86/sad-a.asm --- a/source/common/x86/sad-a.asm Wed Oct 30 18:11:01 2013 +0530 +++ b/source/common/x86/sad-a.asm Wed Oct 30 19:46:15 2013 +0530 @@ -31,8 +31,9 @@ SECTION_RODATA 32 +MSK: db 255,255,255,255,255,255,255,255,255,255,255,255,0,0,0,0 pb_shuf8x8c2: times 2 db 0,0,0,0,8,8,8,8,-1,-1,-1,-1,-1,-1,-1,-1 -hpred_shuf: db 0,0,2,2,8,8,10,10,1,1,3,3,9,9,11,11 +hpred_shuf: db 0,0,2,2,8,8,10,10,1,1,3,3,9,9,11,11 SECTION .text @@ -119,6 +120,39 @@ RET %endmacro +%macro PROCESS_SAD_12x4 0 +movum1, [r2] +movum2, [r0] +pandm1, m4 +pandm2, m4 +psadbw m1, m2 +paddd m0, m1 +lea r2, [r2 + r3] +lea r0, [r0 + r1] +movum1, [r2] +movum2, [r0] +pandm1, m4 +pandm2, m4 +psadbw m1, m2 +paddd m0, m1 +lea r2, [r2 + r3] +lea r0, [r0 + r1] +movum1, [r2] +movum2, [r0] we don't need to load address every time when we are adding stride to it. we should try to calculate address first using multiply by 1, 2, 4, or 8 if it not the case then we should load it. like above four instruction can be replaced with these two only. movum1, [r2 + 2 * r3] movum2, [r0 + 2 * r1] +pandm1, m4 +pandm2, m4 +psadbw m1, m2 +paddd m0, m1 +lea r2, [r2 + r3] +lea r0, [r0 + r1] +movum1, [r2] +movum2, [r0] +pandm1, m4 +pandm2, m4 +psadbw m1, m2 +paddd m0, m1 +%endmacro + %macro PROCESS_SAD_16x4 0 movum1, [r2] movum2, [r2 + r3] @@ -1007,6 +1041,29 @@ movdeax, m0 RET +;- +; int pixel_sad_12x16( uint8_t *, intptr_t, uint8_t *, intptr_t ) +;- +cglobal pixel_sad_12x16, 4,4,4 +mova m4, [MSK] +pxor m0, m0 + +PROCESS_SAD_12x4 +lea r2, [r2 + r3] +lea r0, [r0 + r1] +PROCESS_SAD_12x4 +lea r2, [r2 + r3] +lea r0, [r0 + r1] +PROCESS_SAD_12x4 +lea r2, [r2 + r3] +lea r0, [r0 + r1] +PROCESS_SAD_12x4 + +movhlps m1, m0 +paddd m0, m1 +movdeax, m0 +RET + %endmacro overuse of lea instruction please eliminate them, use available registers to save loads operations. ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
[x265] Fwd: [PATCH] assembly code for pixel_sad_x3_24x32
-- Forwarded message -- From: yuva...@multicorewareinc.com Date: Wed, Oct 30, 2013 at 2:38 PM Subject: [x265] [PATCH] assembly code for pixel_sad_x3_24x32 To: x265-devel@videolan.org # HG changeset patch # User Yuvaraj Venkatesh yuva...@multicorewareinc.com # Date 1383124045 -19800 # Wed Oct 30 14:37:25 2013 +0530 # Node ID eca1142d1cec9303afad71108494f9076586ce05 # Parent 65462024832b4498cd9f05a5a81cb6b559bf378b assembly code for pixel_sad_x3_24x32 diff -r 65462024832b -r eca1142d1cec source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cpp Wed Oct 30 01:54:16 2013 -0500 +++ b/source/common/x86/asm-primitives.cpp Wed Oct 30 14:37:25 2013 +0530 @@ -292,6 +292,7 @@ p.sad_x4[LUMA_16x32] = x265_pixel_sad_x4_16x32_ssse3; p.sad_x3[LUMA_16x64] = x265_pixel_sad_x3_16x64_ssse3; p.sad_x4[LUMA_16x64] = x265_pixel_sad_x4_16x64_ssse3; +p.sad_x3[LUMA_24x32] = x265_pixel_sad_x3_24x32_ssse3; p.luma_hvpp[LUMA_8x8] = x265_interp_8tap_hv_pp_8x8_ssse3; p.ipfilter_sp[FILTER_V_S_P_8] = x265_interp_8tap_v_sp_ssse3; @@ -325,6 +326,7 @@ p.sad_x4[LUMA_16x32] = x265_pixel_sad_x4_16x32_avx; p.sad_x3[LUMA_16x64] = x265_pixel_sad_x3_16x64_avx; p.sad_x4[LUMA_16x64] = x265_pixel_sad_x4_16x64_avx; +p.sad_x3[LUMA_24x32] = x265_pixel_sad_x3_24x32_avx; } if (cpuMask X265_CPU_XOP) { diff -r 65462024832b -r eca1142d1cec source/common/x86/pixel.h --- a/source/common/x86/pixel.h Wed Oct 30 01:54:16 2013 -0500 +++ b/source/common/x86/pixel.h Wed Oct 30 14:37:25 2013 +0530 @@ -47,6 +47,7 @@ ret x265_pixel_ ## name ## _32x24_ ## suffix args; \ ret x265_pixel_ ## name ## _32x32_ ## suffix args; \ ret x265_pixel_ ## name ## _32x64_ ## suffix args; \ +ret x265_pixel_ ## name ## _24x32_ ## suffix args; \ #define DECL_X1(name, suffix) \ DECL_PIXELS(int, name, suffix, (pixel *, intptr_t, pixel *, intptr_t)) diff -r 65462024832b -r eca1142d1cec source/common/x86/sad-a.asm --- a/source/common/x86/sad-a.asm Wed Oct 30 01:54:16 2013 -0500 +++ b/source/common/x86/sad-a.asm Wed Oct 30 14:37:25 2013 +0530 @@ -1988,6 +1988,117 @@ RET %endmacro +%macro SAD_X3_24x4 0 +movam3, [r0] +movam4, [r0 + 16] +movum5, [r1] +movum6, [r1 + 16] +psadbw m5, m3 +psadbw m6, m4 +pshufd m6, m6, 84 +paddd m5, m6 +paddd m0, m5 +movum5, [r2] +movum6, [r2 + 16] +psadbw m5, m3 +psadbw m6, m4 +pshufd m6, m6, 84 +paddd m5, m6 +paddd m1, m5 +movum5, [r3] +movum6, [r3 + 16] +psadbw m5, m3 +psadbw m6, m4 +pshufd m6, m6, 84 +paddd m5, m6 +paddd m2, m5 +lea r0, [r0 + FENC_STRIDE] +lea r1, [r1 + r4] +lea r2, [r2 + r4] +lea r3, [r3 + r4] +movam3, [r0] +movam4, [r0 + 16] +movum5, [r1] +movum6, [r1 + 16] +psadbw m5, m3 +psadbw m6, m4 +pshufd m6, m6, 84 +paddd m5, m6 +paddd m0, m5 +movum5, [r2] +movum6, [r2 + 16] +psadbw m5, m3 +psadbw m6, m4 +pshufd m6, m6, 84 +paddd m5, m6 +paddd m1, m5 +movum5, [r3] +movum6, [r3 + 16] +psadbw m5, m3 +psadbw m6, m4 +pshufd m6, m6, 84 +paddd m5, m6 +paddd m2, m5 +lea r0, [r0 + FENC_STRIDE] +lea r1, [r1 + r4] +lea r2, [r2 + r4] +lea r3, [r3 + r4] +movam3, [r0] +movam4, [r0 + 16] +movum5, [r1] +movum6, [r1 + 16] You don't need to load address every time. you can calculate it like movam4, [r0 + 2 * r4] movam4, [r0 + 4 * r4] movam4, [r0 + 8 * r4] or even like movam4, [r0 + 2 * r4 + constant] use this concept to eliminate lea instructions. Multiplication with 1, 2, 4 and 8 are allowed. +psadbw m5, m3 +psadbw m6, m4 +pshufd m6, m6, 84 +paddd m5, m6 +paddd m0, m5 +movum5, [r2] +movum6, [r2 + 16] +psadbw m5, m3 +psadbw m6, m4 +pshufd m6, m6, 84 +paddd m5, m6 +paddd m1, m5 +movum5, [r3] +movum6, [r3 + 16] +psadbw m5, m3 +psadbw m6, m4 +pshufd m6, m6, 84 +paddd m5, m6 +paddd m2, m5 +lea r0, [r0 + FENC_STRIDE] +lea r1, [r1 + r4] +lea r2, [r2 + r4] +lea r3, [r3 + r4] +movam3, [r0] +movam4, [r0 + 16] +movum5, [r1] +movum6, [r1 + 16] +psadbw m5, m3 +psadbw m6, m4 +pshufd m6, m6, 84 +paddd m5, m6 +paddd m0, m5 +movum5, [r2] +movum6, [r2 + 16] +psadbw m5, m3 +psadbw m6, m4 +pshufd m6, m6, 84 +paddd m5, m6 +paddd m1, m5 +movum5, [r3] +movum6, [r3 + 16] +psadbw m5, m3 +psadbw m6, m4 +pshufd m6, m6,
[x265] Fwd: [PATCH 4 of 4] asm: interp_8tap_v_sp for ipfilter_sp[FILTER_V_S_P_8]
-- Forwarded message -- From: Steve Borho st...@borho.org Date: Mon, Oct 28, 2013 at 11:55 PM Subject: Re: [x265] [PATCH 4 of 4] asm: interp_8tap_v_sp for ipfilter_sp[FILTER_V_S_P_8] To: Development for x265 x265-devel@videolan.org On Mon, Oct 28, 2013 at 9:24 AM, Min Chen chenm...@163.com wrote: # HG changeset patch # User Min Chen chenm...@163.com # Date 1382970234 -28800 # Node ID 41425f18efe14be468715bfa68fdebbb9a49145f # Parent 5f7b3d06d94c6aec44bfd4a7bfb6f6751182b4ed asm: interp_8tap_v_sp for ipfilter_sp[FILTER_V_S_P_8] I'm getting link errors on x86_64 from this series: error LNK2017: 'ADDR32' relocation to 'tab_LumaCoeffV' invalid without /LARGEADDRESSAWARE:NO This error is due to [register + global_constant] 64-bit does not support it. I generally use PIC macro to protect it. like %ifdef PIC lea r5,[tab_ChromaCoeff] movdm0,[r5 + r4 * 4] %else movdm0,[tab_ChromaCoeff + r4 * 4] %endif In general, I think we should drop all of the interpolation merging while we get all the assembly completed for motion compensation. When the assembly is alltogether, we can experiment and figure out if it makes sense to re-merge some of them back together. diff -r 5f7b3d06d94c -r 41425f18efe1 source/common/x86/asm-primitives.cpp --- a/source/common/x86/asm-primitives.cpp Mon Oct 28 22:23:29 2013 +0800 +++ b/source/common/x86/asm-primitives.cpp Mon Oct 28 22:23:54 2013 +0800 @@ -280,6 +280,7 @@ p.sad_x4[LUMA_16x32] = x265_pixel_sad_x4_16x32_ssse3; p.luma_hvpp[LUMA_8x8] = x265_interp_8tap_hv_pp_8x8_ssse3; +p.ipfilter_sp[FILTER_V_S_P_8] = x265_interp_8tap_v_sp_ssse3; } if (cpuMask X265_CPU_SSE4) { diff -r 5f7b3d06d94c -r 41425f18efe1 source/common/x86/ipfilter8.asm --- a/source/common/x86/ipfilter8.asm Mon Oct 28 22:23:29 2013 +0800 +++ b/source/common/x86/ipfilter8.asm Mon Oct 28 22:23:54 2013 +0800 @@ -774,3 +774,114 @@ jnz .loopV RET + + +;- +; void interp_8tap_v_sp(int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, const int coeffIdx); +;- +INIT_XMM ssse3 +cglobal interp_8tap_v_sp, 4, 7, 8, 0-(2*4 + 3*gprsize) +%define old_r0 (rsp + 2 * 4 + 0 * gprsize) +%define old_r2 (rsp + 2 * 4 + 1 * gprsize) +%define old_r3 (rsp + 2 * 4 + 2 * gprsize) +%define old_r4d (rsp + 0 * 4) +%define old_6rows (rsp + 1 * 4) + +mov r4d,r4m +mov r5d,r5m + +; load coeff table +mov r6d,r6m +shl r6, 6 +lea r6, [tab_LumaCoeffV + r6] + +mov [old_r4d], r4d +mov [old_r2], r2 + +; move to -3 +lea r1, [r1 * 2] +lea r4, [r1 + r1 * 2] +sub r0, r4 +lea r4, [r4 * 2] +mov [old_6rows], r4 + +.loopH: + +; load width +mov r4d, [old_r4d] + +; save old src +mov [old_r0], r0 + +.loopW: + +movum0, [r0] +movum1, [r0 + r1] +lea r0, [r0 + r1 * 2] +punpcklwd m2, m0, m1 +pmaddwd m2, [r6 + 0 * 16] +punpckhwd m0, m1 +pmaddwd m0, [r6 + 0 * 16] + +movum3, [r0] +movum4, [r0 + r1] +lea r0, [r0 + r1 * 2] +punpcklwd m1, m3, m4 +pmaddwd m1, [r6 + 1 * 16] +paddd m2, m1 +punpckhwd m3, m4 +pmaddwd m3, [r6 + 1 * 16] +paddd m0, m3 + +movum3, [r0] +movum4, [r0 + r1] +lea r0, [r0 + r1 * 2] +punpcklwd m1, m3, m4 +pmaddwd m1, [r6 + 2 * 16] +paddd m2, m1 +punpckhwd m3, m4 +pmaddwd m3, [r6 + 2 * 16] +paddd m0, m3 + +movum3, [r0] +movum4, [r0 + r1] +punpcklwd m1, m3, m4 +pmaddwd m1, [r6 + 3 * 16] +paddd m2, m1 +punpckhwd m3, m4 +pmaddwd m3, [r6 + 3 * 16] +paddd m0, m3 + +paddd m2, [tab_c_526336] +paddd m0, [tab_c_526336] +psrad m2, 12 +psrad m0, 12 +packssdwm2, m0 +packuswbm2, m2 + +; move to next 8 col +sub r0, [old_6rows] + +sub r4, 8 +jl .width4 +movq[r2], m2 +je .nextH +lea r0, [r0 + 16] +lea r2, [r2 + 8] +jmp .loopW + +.width4: +movd[r2], m2 +lea r0, [r0 + 4] + +.nextH: +; move to next row +mov r0, [old_r0] +lea r0, [r0 + r1] +add [old_r2], r3d +mov r2, [old_r2] + +dec r5d +jnz .loopH + +RET diff -r
[x265] Fwd: [PATCH] check_IPFilterChroma_primitive, stride made equal to min width 2, fix for 2XN block
I tried using stride 64 for both the source and dest buffers, which is perfectly reasonable, and the 2xN primitives failed their unit test which tells me they need to be fixed prior to using them in the encoder. Sent patch for fix. ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
[x265] Fwd: [PATCH] Added C primitive and unit test code for chroma filter
+templateint N, int width +void interp_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int height, int coeffIdx) +{ +int cStride = 1; +short const * coeff= g_chromaFilter[coeffIdx]; +src -= (N / 2 - 1) * cStride; +coeffIdx; +int offset; +short maxVal; +int headRoom = IF_INTERNAL_PREC - X265_DEPTH; +offset = (1 (headRoom - 1)); +maxVal = (1 X265_DEPTH) - 1; + +int row, col; +for (row = 0; row height; row++) +{ +for (col = 0; col width; col++) +{ +int sum; + +sum = src[col + 0 * cStride] * coeff[0]; +sum += src[col + 1 * cStride] * coeff[1]; +if (N = 4) +{ +sum += src[col + 2 * cStride] * coeff[2]; +sum += src[col + 3 * cStride] * coeff[3]; +} the N= 6 check seems out of place, unless we're going to instantiate a 7tap filter Actually, I wanted to add a single C primitive for chroma and luma this is why I did not change check condition as they will be required in luma functions. +if (N = 6) +{ +sum += src[col + 4 * cStride] * coeff[4]; +sum += src[col + 5 * cStride] * coeff[5]; +} +if (N == 8) +{ +sum += src[col + 6 * cStride] * coeff[6]; +sum += src[col + 7 * cStride] * coeff[7]; +} +short val = (short)(sum + offset) headRoom; + +if (val 0) val = 0; +if (val maxVal) val = maxVal; +dst[col] = (pixel)val; +} + +src += srcStride; +dst += dstStride; +} +} } namespace x265 { diff -r 1087f1f3bf5a -r 39fc3c36e1b1 source/test/ipfilterharness.cpp --- a/source/test/ipfilterharness.cpp Tue Oct 15 20:57:54 2013 +0530 +++ b/source/test/ipfilterharness.cpp Tue Oct 15 21:22:03 2013 +0530 @@ -3,6 +3,7 @@ * * Authors: Deepthi Devaki deepthidev...@multicorewareinc.com, * Rajesh Paulraj raj...@multicorewareinc.com + * Praveen Kumar Tiwari prav...@multicorewareinc.com * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -39,6 +40,18 @@ ipfilterV_pp4 }; +const char* ChromaFilterPPNames[] = +{ +interp_4tap_horiz_pp_w2, +interp_4tap_horiz_pp_w4, +interp_4tap_horiz_pp_w6, +interp_4tap_horiz_pp_w8, +interp_4tap_horiz_pp_w12, +interp_4tap_horiz_pp_w16, +interp_4tap_horiz_pp_w24, +interp_4tap_horiz_pp_w32 +}; the names should correspond with the chroma size enums, which only specify a width. This string table should be re-usable for more than just 4tap horizontal pixel to pixel interpolation. Each element should just be W2 or something similar so it can be used as: printf(chroma_hpp[%s]: , ChromaFilterName[w]); + IPFilterHarness::IPFilterHarness() { ipf_t_size = 200 * 200; @@ -262,6 +275,47 @@ return true; } +bool IPFilterHarness::check_IPFilter_primitive(filter_pp_t ref, filter_pp_t opt) there needs to be chroma and luma versions of this function for the two filter lengths, or pass filter length as an argument +{ +int rand_height = rand() % 100; // Randomly generated Height I don't see a point to testing any sizes not used by the encoder; this just prevents possible optimizations in the primitive. Primitives that have fixed dimensions should be tested with those fixed dimensions used by the encoder. +int rand_val, rand_srcStride, rand_dstStride, rand_coeffIdx; + +for (int i = 0; i = 100; i++) +{ +memset(IPF_vec_output_p, 0, ipf_t_size); // Initialize output buffer to zero +memset(IPF_C_output_p, 0, ipf_t_size);// Initialize output buffer to zero is memzero really necessary here? I don't think so + +rand_coeffIdx = rand() % 8;// Random coeffIdex in the filter chroma coeff index should be 1, 2, or 3 I think chroma table is const short g_chromaFilter[8][NTAPS_CHROMA] = { { 0, 64, 0, 0 }, { -2, 58, 10, -2 }, { -4, 54, 16, -2 }, { -6, 46, 28, -4 }, { -4, 36, 36, -4 }, { -4, 28, 46, -6 }, { -2, 16, 54, -4 }, { -2, 10, 58, -2 } }; we have coeff table also in similar fashion so I need 0 to 7 coeffIdex. +rand_val = rand() % 4; // Random offset in the filter rand_val is unused +rand_srcStride = rand() % 100; // Randomly generated srcStride +rand_dstStride = rand() % 100; // Randomly generated dstStride + +if (rand_srcStride 32) +rand_srcStride = 32; + +if (rand_dstStride 32) +rand_dstStride = 32; + +opt(pixel_buff + 3 * rand_srcStride, +rand_srcStride, +
Re: [x265] [PATCH REVIEW Only ] chroma 4XN block, coeffIdex insted of coeff pointer
I have just missed to change the line movacoef2, [tab_coeff + 16] (I was just testing for coeffIdex 1 ) I will make it for random like mova coef2, [tab_coeff + height * 16]. Please Ignore this. Regards, Praveen On Fri, Oct 11, 2013 at 10:20 PM, prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari # Date 1381510220 -19800 # Node ID 5a9160e8b0bdc3117c2417bc29453077488efd8e # Parent c6d89dc62e191f56f63dbcb1781a6494da50a70d chroma 4XN block, coeffIdex insted of coeff pointer diff -r c6d89dc62e19 -r 5a9160e8b0bd source/common/x86/ipfilter8.asm --- a/source/common/x86/ipfilter8.asm Fri Oct 11 01:47:53 2013 -0500 +++ b/source/common/x86/ipfilter8.asm Fri Oct 11 22:20:20 2013 +0530 @@ -26,107 +26,58 @@ %include x86inc.asm %include x86util.asm -%if ARCH_X86_64 == 0 - SECTION_RODATA 32 -tab_leftmask: db -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0 - tab_Tm: db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6 -db 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10 tab_c_512: times 8 dw 512 +tab_coeff:db 0, 64, 0, 0, 0, 64, 0, 0, 0, 64, 0, 0, 0, 64, 0, 0 + db -2, 58, 10, -2, -2, 58, 10, -2, -2, 58, 10, -2, -2, 58, 10, -2 + db -4, 54, 16, -2, -4, 54, 16, -2, -4, 54, 16, -2, -4, 54, 16, -2 + db -6, 46, 28, -4, -6, 46, 28, -4, -6, 46, 28, -4, -6, 46, 28, -4 + db -4, 36, 36, -4, -4, 36, 36, -4, -4, 36, 36, -4, -4, 36, 36, -4 + db -4, 28, 46, -6, -4, 28, 46, -6, -4, 28, 46, -6, -4, 28, 46, -6 + db -2, 16, 54, -4, -2, 16, 54, -4, -2, 16, 54, -4, -2, 16, 54, -4 + db -2, 10, 58, -2, -2, 10, 58, -2, -2, 10, 58, -2, -2, 10, 58, -2 + SECTION .text -%macro FILTER_H4 3 -movu%1, [src + col - 1] -pshufb %2, %1, Tm4 +%macro FILTER_H4_w4 3 +movu%1, [srcq - 1] +pshufb %2, %1, Tm0 pmaddubsw %2, coef2 -pshufb %1, %1, Tm5 -pmaddubsw %1, coef2 phaddw %2, %1 pmulhrsw%2, %3 packuswb%2, %2 %endmacro +%macro FILTER_H4_w4_CALL 0 +FILTER_H4_w4 x0, x1, x2 + +movd[dstq], x1 + +add srcq,srcstrideq +add dstq,dststrideq +%endmacro + ;- -; void filterHorizontal_p_p_4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, short const *coeff) +; void interp_4tap_horiz_pp_w4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int height, int coeffIdx) ;- INIT_XMM sse4 -cglobal filterHorizontal_p_p_4, 0, 7, 8 -%define src r0 -%define dst r1 -%define row r2 -%define col r3 -%define width r4 -%define widthleft r5 -%define mask_offset r6 -%define coef2 m7 -%define x3 m6 -%define Tm5 m5 -%define Tm4 m4 -%define x2 m3 -%define x1 m2 -%define x0 m1 -%define leftmaskm0 -%define tmp r0 -%define tmp1r1 - -mov tmp,r6m -movucoef2, [tmp] -packsswbcoef2, coef2 -pshufd coef2, coef2, 0 +cglobal interp_4tap_horiz_pp_w4, 6, 6, 5, src, srcstride, dst, dststride, height, coeffIdx +%define coef2 m4 +%define Tm0 m3 +%define x2 m2 +%define x1 m1 +%define x0 m0 -movax3, [tab_c_512] +movacoef2, [tab_coeff + 16] +movax2, [tab_c_512] +movaTm0, [tab_Tm] -mov width, r4m -mov widthleft, width -and width, ~7 -and widthleft, 7 -mov mask_offset, widthleft -neg mask_offset +.loop +FILTER_H4_w4_CALL +dec r4d +jnz .loop +RET -movqleftmask, [tab_leftmask + (7 + mask_offset)] -movaTm4,[tab_Tm] -movaTm5,[tab_Tm + 16] - -mov src,r0m -mov dst,r2m -mov row,r5m - -_loop_row: -xor col,col - -_loop_col: -FILTER_H4 x0, x1, x3 -movh[dst + col], x1 - -add col, 8 - -cmp col,width -jl _loop_col - -_end_col: -testwidthleft, widthleft -jz _next_row - -movqx2, [dst + col] -FILTER_H4 x0, x1, x3 -pblendvbx2, x2, x1, leftmask -movh[dst + col], x2 - -_next_row: -add src,r1m -add dst,r3m -dec row - -testrow,row -jz _end_row - -jmp _loop_row - -_end_row
Re: [x265] [PATCH REVIEW Only ] chroma 4XN block, coeffIdex insted of coeff pointer
ohh... It will be movacoef2, [tab_coeff + coeffIdx * 16]. On Fri, Oct 11, 2013 at 11:21 PM, Praveen Tiwari prav...@multicorewareinc.com wrote: I have just missed to change the line movacoef2, [tab_coeff + 16] (I was just testing for coeffIdex 1 ) I will make it for random like movacoef2, [tab_coeff + height * 16]. Please Ignore this. Regards, Praveen On Fri, Oct 11, 2013 at 10:20 PM, prav...@multicorewareinc.com wrote: # HG changeset patch # User Praveen Tiwari # Date 1381510220 -19800 # Node ID 5a9160e8b0bdc3117c2417bc29453077488efd8e # Parent c6d89dc62e191f56f63dbcb1781a6494da50a70d chroma 4XN block, coeffIdex insted of coeff pointer diff -r c6d89dc62e19 -r 5a9160e8b0bd source/common/x86/ipfilter8.asm --- a/source/common/x86/ipfilter8.asm Fri Oct 11 01:47:53 2013 -0500 +++ b/source/common/x86/ipfilter8.asm Fri Oct 11 22:20:20 2013 +0530 @@ -26,107 +26,58 @@ %include x86inc.asm %include x86util.asm -%if ARCH_X86_64 == 0 - SECTION_RODATA 32 -tab_leftmask: db -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0 - tab_Tm: db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6 -db 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10 tab_c_512: times 8 dw 512 +tab_coeff:db 0, 64, 0, 0, 0, 64, 0, 0, 0, 64, 0, 0, 0, 64, 0, 0 + db -2, 58, 10, -2, -2, 58, 10, -2, -2, 58, 10, -2, -2, 58, 10, -2 + db -4, 54, 16, -2, -4, 54, 16, -2, -4, 54, 16, -2, -4, 54, 16, -2 + db -6, 46, 28, -4, -6, 46, 28, -4, -6, 46, 28, -4, -6, 46, 28, -4 + db -4, 36, 36, -4, -4, 36, 36, -4, -4, 36, 36, -4, -4, 36, 36, -4 + db -4, 28, 46, -6, -4, 28, 46, -6, -4, 28, 46, -6, -4, 28, 46, -6 + db -2, 16, 54, -4, -2, 16, 54, -4, -2, 16, 54, -4, -2, 16, 54, -4 + db -2, 10, 58, -2, -2, 10, 58, -2, -2, 10, 58, -2, -2, 10, 58, -2 + SECTION .text -%macro FILTER_H4 3 -movu%1, [src + col - 1] -pshufb %2, %1, Tm4 +%macro FILTER_H4_w4 3 +movu%1, [srcq - 1] +pshufb %2, %1, Tm0 pmaddubsw %2, coef2 -pshufb %1, %1, Tm5 -pmaddubsw %1, coef2 phaddw %2, %1 pmulhrsw%2, %3 packuswb%2, %2 %endmacro +%macro FILTER_H4_w4_CALL 0 +FILTER_H4_w4 x0, x1, x2 + +movd[dstq], x1 + +add srcq,srcstrideq +add dstq,dststrideq +%endmacro + ;- -; void filterHorizontal_p_p_4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, short const *coeff) +; void interp_4tap_horiz_pp_w4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int height, int coeffIdx) ;- INIT_XMM sse4 -cglobal filterHorizontal_p_p_4, 0, 7, 8 -%define src r0 -%define dst r1 -%define row r2 -%define col r3 -%define width r4 -%define widthleft r5 -%define mask_offset r6 -%define coef2 m7 -%define x3 m6 -%define Tm5 m5 -%define Tm4 m4 -%define x2 m3 -%define x1 m2 -%define x0 m1 -%define leftmaskm0 -%define tmp r0 -%define tmp1r1 - -mov tmp,r6m -movucoef2, [tmp] -packsswbcoef2, coef2 -pshufd coef2, coef2, 0 +cglobal interp_4tap_horiz_pp_w4, 6, 6, 5, src, srcstride, dst, dststride, height, coeffIdx +%define coef2 m4 +%define Tm0 m3 +%define x2 m2 +%define x1 m1 +%define x0 m0 -movax3, [tab_c_512] +movacoef2, [tab_coeff + 16] +movax2, [tab_c_512] +movaTm0, [tab_Tm] -mov width, r4m -mov widthleft, width -and width, ~7 -and widthleft, 7 -mov mask_offset, widthleft -neg mask_offset +.loop +FILTER_H4_w4_CALL +dec r4d +jnz .loop +RET -movqleftmask, [tab_leftmask + (7 + mask_offset)] -movaTm4,[tab_Tm] -movaTm5,[tab_Tm + 16] - -mov src,r0m -mov dst,r2m -mov row,r5m - -_loop_row: -xor col,col - -_loop_col: -FILTER_H4 x0, x1, x3 -movh[dst + col], x1 - -add col, 8 - -cmp col,width -jl _loop_col - -_end_col: -testwidthleft, widthleft -jz _next_row - -movqx2, [dst + col] -FILTER_H4 x0, x1, x3 -pblendvbx2, x2, x1, leftmask -movh[dst + col], x2 - -_next_row: -add src,r1m -add
[x265] Fwd: [PATCH] replace pixelsub_sp vector class function with intrinsic
for (int x = 0; x bx; x += 16) { -Vec16uc word0, word1; -Vec8s word3, word4; -word0.load_a(src0 + x); -word1.load_a(src1 + x); -word3 = extend_low(word0) - extend_low(word1); -word4 = extend_high(word0) - extend_high(word1); -word3.store_a(dst + x); -word4.store_a(dst + x + 8); +__m128i word0, word1; +__m128i word3, word4; +__m128i mask = _mm_setzero_si128(); + +word0 = _mm_load_si128((__m128i const*)(src0 + x)); // load 16 bytes from src1 +word1 = _mm_load_si128((__m128i const*)(src1 + x)); // load 16 bytes from src2 Please, notice the variable names while writing comments, it should be src0 and src1 not src1 and src2. + +word3 = _mm_unpacklo_epi8(word0, mask);// interleave with zero extensions +word4 = _mm_unpacklo_epi8(word1, mask); +_mm_store_si128((__m128i*)dst[x], _mm_subs_epi16(word3, word4));// store block into dst + +word3 = _mm_unpackhi_epi8(word0, mask);// interleave with zero extensions +word4 = _mm_unpackhi_epi8(word1, mask); +_mm_store_si128((__m128i*)dst[x + 8], _mm_subs_epi16(word3, word4));// store block into dst } I think we should try to unroll the loop for multiple of 8 also, that may give you some more performance gain. Regards, Praveen ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] [PATCH] asm code for ipfilterH_pp, 4 tap filter
suppose, during execution width comes less than 8 like 5, then we would like to run our code section which handles the reaming width (_end_col:) not the whole code (handle multiple of 8 and renaming width part, it will computed twice in this case and corrupting some (8 - widthleft) dst[] old values which is being used with 'pblenvb' instruction.This is why we have put a check. if width is always = 8 you are right, we don't need to put the check. Regards, praveen On Fri, Sep 27, 2013 at 9:05 PM, Jason Garrett-Glaser ja...@x264.comwrote: +_loop_row: +xor col,col +cmpwidth, 0 +je _end_col I don't understand this. Why do we have to do this check? Jason ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel