Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-02-19 Thread Alan Kelly
Thanks James for spotting this. I have sent two patches fixing the valgrind error from checkasm and the unchecked av_mallocs. I do not believe that the two remaining valgrind errors come from my patch, although I may be mistaken. Using git bisect, I have identified b94cd55155d8c061f1e1faca9076afe5

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-02-18 Thread James Almer
On 2/17/2021 5:24 PM, Paul B Mahol wrote: On Tue, Feb 16, 2021 at 6:31 PM Alan Kelly < alankelly-at-google@ffmpeg.org> wrote: Looks like there are no comments, is this OK to be applied? Thanks Applied, thanks for pinging. Valgrind complains about this change. The checkasm test specific

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-02-17 Thread Paul B Mahol
On Tue, Feb 16, 2021 at 6:31 PM Alan Kelly < alankelly-at-google@ffmpeg.org> wrote: > Looks like there are no comments, is this OK to be applied? Thanks > Applied, thanks for pinging. > > On Tue, Feb 9, 2021 at 6:25 PM Paul B Mahol wrote: > > > Will apply in no comments. > > __

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-02-16 Thread Alan Kelly
Looks like there are no comments, is this OK to be applied? Thanks On Tue, Feb 9, 2021 at 6:25 PM Paul B Mahol wrote: > Will apply in no comments. > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-02-09 Thread Paul B Mahol
Will apply in no comments. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-02-09 Thread Alan Kelly
Ping! On Thu, Jan 14, 2021 at 3:47 PM Alan Kelly wrote: > --- > Replaces cpuflag(mmx) with notcpuflag(sse3) for store macro > Tests for multiple sizes in checkasm-sw_scale > checkasm-sw_scale aligns memory on 8 bytes instad of 32 to catch aligned > loads > libswscale/x86/Makefile |

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-01-14 Thread Alan Kelly
--- Replaces cpuflag(mmx) with notcpuflag(sse3) for store macro Tests for multiple sizes in checkasm-sw_scale checkasm-sw_scale aligns memory on 8 bytes instad of 32 to catch aligned loads libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c | 130 ---

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-01-14 Thread Alan Kelly
Apologies for this: when I added mmx to the yasm file, I added a macro for the stores selecting mova for mmx and movdqu for the others. if cpuflag(mmx) evaluates to true for all architectures so I replaced it with if notcpuflag(sse3). The alignment in the checkasm test has been changed to 8 from 3

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-01-13 Thread Michael Niedermayer
On Mon, Jan 11, 2021 at 05:46:31PM +0100, Alan Kelly wrote: > --- > Fixes a bug where if there is no offset and a tail which is not processed by > the > sse3/avx2 version the dither is modified > Deletes mmx/mmxext yuv2yuvX version from swscale_template and adds it > to yuv2yuvX.asm to reduce

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-01-11 Thread Alan Kelly
--- Fixes a bug where if there is no offset and a tail which is not processed by the sse3/avx2 version the dither is modified Deletes mmx/mmxext yuv2yuvX version from swscale_template and adds it to yuv2yuvX.asm to reduce code duplication and so that it may be used to process the tail from th

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-01-11 Thread Alan Kelly
It's a bug in the patch. The tail not processed by the sse3/avx2 version is done by the mmx version. I used offset to account for the src pixels already processed, however, dither is modified if offset is not 0. In cases where there is a tail and offset is 0, this bug appears. I am working on a sol

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-01-10 Thread Michael Niedermayer
On Thu, Jan 07, 2021 at 10:41:19AM +0100, Alan Kelly wrote: > --- > Replaces mova with movdqu due to alignment issues > libswscale/x86/Makefile | 1 + > libswscale/x86/swscale.c| 106 +--- > libswscale/x86/yuv2yuvX.asm | 117 ++

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-01-10 Thread Michael Niedermayer
On Thu, Jan 07, 2021 at 10:39:56AM +0100, Alan Kelly wrote: > Thanks for your patience with this, I have replaced mova with movdqu - movu > generated a compile error on ssse3. What system did this crash on? AMD Ryzen 9 3950X on linux [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF1

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-01-07 Thread Alan Kelly
Thanks for your patience with this, I have replaced mova with movdqu - movu generated a compile error on ssse3. What system did this crash on? On Wed, Jan 6, 2021 at 9:10 PM Michael Niedermayer wrote: > On Tue, Jan 05, 2021 at 01:31:25PM +0100, Alan Kelly wrote: > > Ping! > > crashes (due to ali

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-01-07 Thread Alan Kelly
--- Replaces mova with movdqu due to alignment issues libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c| 106 +--- libswscale/x86/yuv2yuvX.asm | 117 tests/checkasm/sw_scale.c | 98 ++

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-01-06 Thread Michael Niedermayer
On Tue, Jan 05, 2021 at 01:31:25PM +0100, Alan Kelly wrote: > Ping! crashes (due to alignment i think) (gdb) disassemble $rip-32,$rip+32 Dump of assembler code from 0x555730a1 to 0x555730e1: 0x555730a1 : int$0x71 0x555730a3 : out%al,$0x3 0x5557

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-01-05 Thread Alan Kelly
Ping! On Thu, Dec 17, 2020 at 11:42 AM Alan Kelly wrote: > --- > Fixes memory alignment problem in checkasm-sw_scale > Tested on Linux 32 and 64 bit and mingw32 > libswscale/x86/Makefile | 1 + > libswscale/x86/swscale.c| 106 +--- > libswscale/x86/yuv2yu

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-12-17 Thread Alan Kelly
--- Fixes memory alignment problem in checkasm-sw_scale Tested on Linux 32 and 64 bit and mingw32 libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c| 106 +--- libswscale/x86/yuv2yuvX.asm | 117 tests/checkasm/sw_sca

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-12-11 Thread Michael Niedermayer
On Thu, Dec 10, 2020 at 04:46:26PM +0100, Alan Kelly wrote: > --- > Replaces ff_sws_init_swscale_x86 with ff_getSwsFunc > Load offset if not gprsize but 8 on both 32 and 64 bit > Removes sfence as NT store no longer used > libswscale/x86/Makefile | 1 + > libswscale/x86/swscale.c| 106

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-12-10 Thread Alan Kelly
--- Replaces ff_sws_init_swscale_x86 with ff_getSwsFunc Load offset if not gprsize but 8 on both 32 and 64 bit Removes sfence as NT store no longer used libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c| 106 +--- libswscale/x86/yuv2yuvX.asm | 117 +++

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-12-10 Thread Josh Dekker
On 2020/12/09 11:19, Alan Kelly wrote: --- Activates avx2 version of yuv2yuvX Adds checkasm for yuv2yuvX Modifies ff_yuv2yuvX_* signature to match yuv2yuvX_* Replaces non-temporal stores with temporal stores libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c| 106 +++

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-12-09 Thread Alan Kelly
This function is tested by fate-filter-fps-r. I have also added a checkasm test and bench. I have done a lot more testing and benching of this code and I am now happy to activate the avx2 version because the performance is so good. On my machine I get the following results for filter size 4 and 0

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-12-09 Thread Alan Kelly
--- Activates avx2 version of yuv2yuvX Adds checkasm for yuv2yuvX Modifies ff_yuv2yuvX_* signature to match yuv2yuvX_* Replaces non-temporal stores with temporal stores libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c| 106 +--- libswscale/x86/yuv2y

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-12-04 Thread Anton Khirnov
Quoting Alan Kelly (2020-11-19 09:41:56) > --- > All of Henrik's suggestions have been implemented. Additionally, > m3 and m6 are permuted in avx2 before storing to ensure bit by bit > identical results in avx2. > libswscale/x86/Makefile | 1 + > libswscale/x86/swscale.c| 75 +++-

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-12-01 Thread Alan Kelly
Ping On Thu, Nov 19, 2020 at 9:42 AM Alan Kelly wrote: > --- > All of Henrik's suggestions have been implemented. Additionally, > m3 and m6 are permuted in avx2 before storing to ensure bit by bit > identical results in avx2. > libswscale/x86/Makefile | 1 + > libswscale/x86/swscale.c

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-11-19 Thread Alan Kelly
--- All of Henrik's suggestions have been implemented. Additionally, m3 and m6 are permuted in avx2 before storing to ensure bit by bit identical results in avx2. libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c| 75 +++ libswscale/x86/yuv2yuvX.asm | 118 ++

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-11-17 Thread Henrik Gramner
On Mon, Nov 16, 2020 at 11:03 AM Alan Kelly wrote: > +cglobal yuv2yuvX, 6, 7, 16, filter, filterSize, dest, dstW, dither, offset, > src Only 8 xmm registers are used, so 8 should be used instead of 16 here. Otherwise it causes unnecessary spilling of registers on 64-bit Windows. > +%if ARCH_X86_

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-11-16 Thread Alan Kelly
--- Fixes bug in sse3 path where m1 is not set correctly resulting in off by one errors. The results are now bit by bit identical. libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c| 75 libswscale/x86/yuv2yuvX.asm | 114 ++

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-11-13 Thread Michael Niedermayer
On Thu, Nov 12, 2020 at 09:33:18AM +0100, Alan Kelly wrote: > --- > It now works on x86-32 > libswscale/x86/Makefile | 1 + > libswscale/x86/swscale.c| 75 > libswscale/x86/yuv2yuvX.asm | 110 > 3 files changed, 121 insertio

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-11-12 Thread Alan Kelly
--- It now works on x86-32 libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c| 75 libswscale/x86/yuv2yuvX.asm | 110 3 files changed, 121 insertions(+), 65 deletions(-) create mode 100644 libswscale/x86/yuv2yuvX.asm

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-11-10 Thread Carl Eugen Hoyos
Am Fr., 6. Nov. 2020 um 09:04 Uhr schrieb Alan Kelly : > > The function was re-written in asm, this code is heavily derived from the > original code, the algorithm remains unchanged, the implementation is > optimized. Would you agree to adding the copyright from swscale.c: > * Copyright (C) 2001-20

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-11-10 Thread Michael Niedermayer
On Tue, Nov 10, 2020 at 09:43:47AM +0100, Alan Kelly wrote: > --- > yuv2yuvX.asm: Ports yuv2yuvX to asm, unrolls main loop and adds > other small optimizations for ~20% speed-up. Copyright updated to > include the original from swscale.c > swscale.c: Removes yuv2yuvX_sse3 and calls new function

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-11-10 Thread Alan Kelly
--- yuv2yuvX.asm: Ports yuv2yuvX to asm, unrolls main loop and adds other small optimizations for ~20% speed-up. Copyright updated to include the original from swscale.c swscale.c: Removes yuv2yuvX_sse3 and calls new function ff_yuv2yuvX_sse3. Calls yuv2yuvX_mmxext on remainining elements if r

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-11-06 Thread Alan Kelly
The function was re-written in asm, this code is heavily derived from the original code, the algorithm remains unchanged, the implementation is optimized. Would you agree to adding the copyright from swscale.c: * Copyright (C) 2001-2011 Michael Niedermayer to this file, having both copyrights? Th

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-10-31 Thread Carl Eugen Hoyos
Am Di., 27. Okt. 2020 um 09:56 Uhr schrieb Alan Kelly : > --- /dev/null > +++ b/libswscale/x86/yuv2yuvX.asm > @@ -0,0 +1,105 @@ > +;** > +;* x86-optimized yuv2yuvX > +;* Copyright 2020 Google LLC Either the commit message

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-10-27 Thread Alan Kelly
Thanks for the feedback Anton. The second patch incorporates changes suggested by James Almer: avx2 instructions are wrapped in if cpuflag(avx2) and movddup restored mm1 is replaced by m1 on x86_32 On Tue, Oct 27, 2020 at 10:40 AM Anton Khirnov wrote: > Hi, > Quoting Alan Kelly (2020-10-27 10

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-10-27 Thread Anton Khirnov
Hi, Quoting Alan Kelly (2020-10-27 10:10:14) > --- > libswscale/x86/Makefile | 1 + > libswscale/x86/swscale.c| 75 - > libswscale/x86/yuv2yuvX.asm | 109 > 3 files changed, 120 insertions(+), 65 deletions(-) > create mode 10

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-10-27 Thread Alan Kelly
--- libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c| 75 - libswscale/x86/yuv2yuvX.asm | 109 3 files changed, 120 insertions(+), 65 deletions(-) create mode 100644 libswscale/x86/yuv2yuvX.asm diff --git a/libswscale

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-10-27 Thread Alan Kelly
Apologies for the multiple threads, my git send-email was wrongly configured. This has been fixed. This code has been tested on AVX2 giving a significant speedup, however, until the ff_hscale* functions are ported to avx2, this should not be enabled as it results in an overall slowdown of swscale

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup. AVX2 version is ready and tested, however, although local tests show a signifi

2020-10-27 Thread Alan Kelly
Thanks for the review, I have made the required changes. As I have changed the subject the patch is in a new thread. On Fri, Oct 23, 2020 at 4:10 PM James Almer wrote: > On 10/23/2020 10:17 AM, Alan Kelly wrote: > > Fixed. The wrong step size was used causing a write passed the end of > > the

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-10-27 Thread Alan Kelly
--- libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c| 75 -- libswscale/x86/yuv2yuvX.asm | 105 3 files changed, 116 insertions(+), 65 deletions(-) create mode 100644 libswscale/x86/yuv2yuvX.asm diff --git a/libswscal

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup. AVX2 version is ready and tested, although local tests show a significant spee

2020-10-24 Thread Michael Niedermayer
On Fri, Oct 23, 2020 at 03:34:18PM +0200, Alan Kelly wrote: > Fixed. The wrong step size was used causing a write passed the end of > the buffer. yuv2yuvX_mmxext is now called if there are any remaining > pixels. > > There is currently no checkasm for these functions. Is this required for > sub

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup. AVX2 version is ready and tested, however, although local tests show a signifi

2020-10-23 Thread James Almer
On 10/23/2020 10:17 AM, Alan Kelly wrote: > Fixed. The wrong step size was used causing a write passed the end of > the buffer. yuv2yuvX_mmxext is now called if there are any remaining pixels. Please fix the commit subject (It's too long and contains commentary), and keep comments about fixes be

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup. AVX2 version is ready and tested, although local tests show a significant spee

2020-10-23 Thread Alan Kelly
Fixed. The wrong step size was used causing a write passed the end of the buffer. yuv2yuvX_mmxext is now called if there are any remaining pixels. There is currently no checkasm for these functions. Is this required for submission? (Apologies for the double mail, I used git send-email but it

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup. AVX2 version is ready and tested, however, although local tests show a significant

2020-10-23 Thread Alan Kelly
Fixed. The wrong step size was used causing a write passed the end of the buffer. yuv2yuvX_mmxext is now called if there are any remaining pixels. --- libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c| 75 -- libswscale/x86/yuv2yuvX.asm | 105

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup. AVX2 version is ready and tested, although local tests show a significant spee

2020-10-22 Thread Michael Niedermayer
On Thu, Oct 22, 2020 at 09:43:53AM +0200, Alan Kelly wrote: > Other functions to be ported to avx2 have been identified and are on > the todo list. > --- > libswscale/x86/Makefile | 1 + > libswscale/x86/swscale.c| 72 +++-- > libswscale/x86/yuv2yuvX.asm | 105 ++

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup. AVX2 version is ready and tested, although local tests show a significant spee

2020-10-22 Thread Jean-Baptiste Kempf
Do we have checkasm for those functions? On Thu, 22 Oct 2020, at 09:43, Alan Kelly wrote: > Other functions to be ported to avx2 have been identified and are on > the todo list. > --- > libswscale/x86/Makefile | 1 + > libswscale/x86/swscale.c| 72 +++-- > libswscal

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup. AVX2 version is ready and tested, although local tests show a significant speed-up

2020-10-22 Thread Alan Kelly
Other functions to be ported to avx2 have been identified and are on the todo list. --- libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c| 72 +++-- libswscale/x86/yuv2yuvX.asm | 105 3 files changed, 112 insertions(+), 66 d