[FFmpeg-devel] [aarch64] improve performance of ff_hscale_8_to_15_neon

2019-11-25 Thread Sebastian Pop
Hi, This patch implements ff_hscale_8_to_15_neon with NEON fused multiply accumulate and bumps the vectorization factor from 2 to 4. I have seen speedups up to 15% on Graviton A1 instances based on A-72 cpus. $ ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop

Re: [FFmpeg-devel] [aarch64] improve performance of ff_hscale_8_to_15_neon

2019-11-25 Thread Sebastian Pop
On Mon, Nov 25, 2019 at 4:18 PM Jean-Baptiste Kempf wrote: > Why adding a new version, in intrinsics, instead of changing the existing > implementation? > Personal preference: I like to read c code instead of asm. Also I find it much easier to experiment by changing c code rather than asm. Is t

Re: [FFmpeg-devel] [aarch64] improve performance of ff_hscale_8_to_15_neon

2019-11-27 Thread Sebastian Pop
:0.030462 min:0.030051 Tested with `make check` on aarch64-linux. Please let me know if I can make the patch better. Thank you, Sebastian From e04f9606f7ea581d8398eb2f37df2f59add8b374 Mon Sep 17 00:00:00 2001 From: Sebastian Pop Date: Sun, 17 Nov 2019 14:13:13 -0600 Subject: [PATCH] [aarch64] use

Re: [FFmpeg-devel] [aarch64] improve performance of ff_hscale_8_to_15_neon

2019-11-27 Thread Sebastian Pop
On Wed, Nov 27, 2019 at 12:37 PM Jean-Baptiste Kempf wrote: > > Please let me know if I can make the patch better. > > Remove the commented lines. Attached the updated patch. Thank you, Sebastian 0001-aarch64-use-FMA-and-increase-vector-factor-to-4.patch Description: Binary data __

Re: [FFmpeg-devel] [aarch64] improve performance of ff_hscale_8_to_15_neon

2019-11-27 Thread Sebastian Pop
On Wed, Nov 27, 2019 at 2:13 PM Clément Bœsch wrote: > Yeah I will by the end of the week. I wrote that a few years ago so I need > to take some time to get back in the context. Thanks Clément for your help. > > BTW, that's quite a huge speed improvement you're bringing in, are you > sure you ar

Re: [FFmpeg-devel] [aarch64] improve performance of ff_hscale_8_to_15_neon

2019-12-04 Thread Sebastian Pop
gt; On Wed, Nov 27, 2019 at 12:30:35PM -0600, Sebastian Pop wrote: > > [...] > >> From 9ecaa99fab4b8bedf3884344774162636eaa5389 Mon Sep 17 00:00:00 2001 > >> From: Sebastian Pop > >> Date: Sun, 17 Nov 2019 14:13:13 -0600 > >> Subject: [PATCH] [aarch64] use

Re: [FFmpeg-devel] [aarch64] improve performance of ff_hscale_8_to_15_neon

2019-12-09 Thread Sebastian Pop
On Mon, Dec 9, 2019 at 5:01 AM Clément Bœsch wrote: > > On Sun, Dec 08, 2019 at 11:08:31PM +0200, Martin Storsjö wrote: > > On Sun, 8 Dec 2019, Clément Bœsch wrote: > > > > > On Wed, Dec 04, 2019 at 05:24:46PM -0600, Sebastian Pop wrote: > > > > Hi Clémen

[FFmpeg-devel] [aarch64] improve performance of ff_yuv2planeX_8_neon

2019-12-10 Thread Sebastian Pop
Hi, This patch rewrites the innermost loop of ff_yuv2planeX_8_neon to avoid zips and horizontal adds by using fused multiply adds. The patch also uses ld1r to load one element and replicate it across all lanes of the vector. The patch also improves the clipping code by removing the shift right ins

Re: [FFmpeg-devel] [aarch64] improve performance of ff_yuv2planeX_8_neon

2019-12-25 Thread Sebastian Pop
On Mon, Dec 16, 2019 at 3:56 PM Jean-Baptiste Kempf wrote: > > On Tue, Dec 10, 2019, at 23:38, Sebastian Pop wrote: >> Please let me know how I can improve the patch. > > No remarks from me. > Clément, any further feedback to improve the patch? Ok to commi

[FFmpeg-devel] [aarch64] improve hscale by 50% with multi-threading

2020-07-17 Thread Sebastian Pop
hscale is bound by the number of multiply-adds available on a given core. The attached patch doubles the number of multiply-adds by distributing half the load to a helper thread. The performance improves up to 50% on Graviton2 Arm Neoverse-N1 processors. $ ./ffmpeg_g -nostats -f lavfi -i testsrc2

Re: [FFmpeg-devel] [aarch64] improve hscale by 50% with multi-threading

2020-07-29 Thread Sebastian Pop
On Sat, Jul 18, 2020 at 1:35 AM Michael Niedermayer wrote: > Multithreading support should be added in a architecture independant way > > Attached patch moves helper threads up from hscale to chr_h_scale and lum_h_scale in an architecture independent way. This new version of the patch improves pe

[FFmpeg-devel] [aarch64] yuv2planeX - unroll outer loop by 4 to increase performance by 6.3%

2020-08-18 Thread Sebastian Pop
Hi, Unrolling by 4 the outer loop in yuv2planeX reduces the number of cache accesses by 7.5%. The values loaded for the filter are used in the 4 unrolled iterations and avoids reloading 3 times the same values. The performance was measured on an Arm64 Neoverse-N1 Graviton2 c6g.metal instance with

Re: [FFmpeg-devel] [aarch64] yuv2planeX - unroll outer loop by 4 to increase performance by 6.3%

2020-08-19 Thread Sebastian Pop
Thanks Michael for your feedback. On Wed, Aug 19, 2020 at 6:55 AM Michael Niedermayer wrote: > faster is better obviously, so if its tested with odd sizes and arm > developers had a chance to comment. it should be ok > > The current patch was tested with `make check` on Arm64 Graviton2. I also h

Re: [FFmpeg-devel] [aarch64] yuv2planeX - unroll outer loop by 4 to increase performance by 6.3%

2020-09-03 Thread Sebastian Pop
ch? Thanks, Sebastian On Wed, Aug 19, 2020 at 1:37 PM Sebastian Pop wrote: > Thanks Michael for your feedback. > > On Wed, Aug 19, 2020 at 6:55 AM Michael Niedermayer > wrote: > >> faster is better obviously, so if its tested with odd sizes and arm >> developers ha

Re: [FFmpeg-devel] [PATCH] swscale/aarch64: add hscale specializations

2022-03-01 Thread Sebastian Pop
44 > --- a/libswscale/aarch64/hscale.S > +++ b/libswscale/aarch64/hscale.S > @@ -1,5 +1,7 @@ > /* > * Copyright (c) 2016 Clément Bœsch > + * Copyright (c) 2019-2021 Sebastian Pop > + * Co