(from mf_close) is a
no-op in that case.
Signed-off-by: Martin Storsjö
---
libavcodec/mf_utils.c | 6 --
libavcodec/mfenc.c| 2 ++
2 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/libavcodec/mf_utils.c b/libavcodec/mf_utils.c
index 48e3a63efc..50b9fdb2c4 100644
On Mon, 23 Jan 2023, Derek Buitenhuis wrote:
On 1/17/2023 9:31 AM, Martin Storsjö wrote:
Only warn if the advanced_editlist option is enabled (it is enabled
by default though) so we don't print one warning for each track, and
demote the warning to AV_LOG_LEVEL_VERBOSE; this message does get
On Wed, 18 Jan 2023, Anton Khirnov wrote:
Quoting Martin Storsjö (2023-01-15 23:47:41)
The construct of using offsetof on a (potentially anonymous) struct
defined within the offsetof expression, while supported by all
current compilers, has been declared explicitly undefined by the
C standards
On Sat, 28 Jan 2023, Martin Storsjö wrote:
Don't use "static const" for compile time float constants, but use
defines. This fixes the following error:
src/libavfilter/vf_ssim360.c(549): error C2099: initializer is not a constant
Signed-off-by: Martin Storsjö
---
libavfilter/vf_ssi
Don't use "static const" for compile time float constants, but use
defines. This fixes the following error:
src/libavfilter/vf_ssim360.c(549): error C2099: initializer is not a constant
Signed-off-by: Martin Storsjö
---
libavfilter/vf_ssim360.c | 6 +++---
1 file changed, 3 insert
On Fri, 20 Jan 2023, Cameron Gutman wrote:
mfenc sets FF_CODEC_CAP_INIT_CLEANUP, so calling mf_close() on
failure inside mf_init() results in a double-free.
Signed-off-by: Cameron Gutman
---
libavcodec/mfenc.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/libavcodec/mfenc.c
Only warn if the advanced_editlist option is enabled (it is enabled
by default though) so we don't print one warning for each track, and
demote the warning to AV_LOG_LEVEL_VERBOSE; this message does get
generated whenever parsing a fragmented MP4 file, regardless of
whether the file actually uses
] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2350.htm
[2]
https://github.com/llvm/llvm-project/commit/e327b52766ed497e4779f4e652b9ad237dfda8e6
[3] https://reviews.llvm.org/D133574#4053647
Signed-off-by: Martin Storsjö
---
libavutil/video_enc_params.c | 10 +-
1 file changed, 5
Hi Rui,
On Sat, 14 Jan 2023, Rui Ueyama wrote:
On Sat, 7 Jan 2023, Rui Ueyama wrote:
It looks like compiler-generated code always uses `b`, `bl` or `blx`
instructions for function calls. These instructions have a 24-bit
immediate and therefore can jump anywhere between PC +- 16 MiB.
This
On Mon, 9 Jan 2023, Martin Storsjö wrote:
Hi Rui,
Long time no see!
On Sat, 7 Jan 2023, Rui Ueyama wrote:
It looks like compiler-generated code always uses `b`, `bl` or `blx`
instructions for function calls. These instructions have a 24-bit
immediate and therefore can jump anywhere between
Hi Rui,
Long time no see!
On Sat, 7 Jan 2023, Rui Ueyama wrote:
It looks like compiler-generated code always uses `b`, `bl` or `blx`
instructions for function calls. These instructions have a 24-bit
immediate and therefore can jump anywhere between PC +- 16 MiB.
This hand-written assembly
On Thu, 17 Nov 2022, Martin Storsjö wrote:
Signed-off-by: Martin Storsjö
---
tests/fate/image.mak | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/tests/fate/image.mak b/tests/fate/image.mak
index 167c8ccf2c..42dd90feaa 100644
--- a/tests/fate/image.mak
+++ b/tests/fate
Signed-off-by: Martin Storsjö
---
tests/fate/image.mak | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/tests/fate/image.mak b/tests/fate/image.mak
index 167c8ccf2c..42dd90feaa 100644
--- a/tests/fate/image.mak
+++ b/tests/fate/image.mak
@@ -513,12 +513,13 @@ fate-tiff
On Tue, 8 Nov 2022, James Almer wrote:
Should fix fate failures on Windowx x86 targets, where long is 32 bits.
Signed-off-by: James Almer
---
libavutil/tx_priv.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/libavutil/tx_priv.h b/libavutil/tx_priv.h
index
On Wed, 2 Nov 2022, Michael Niedermayer wrote:
On Wed, Nov 02, 2022 at 10:16:57PM +0100, Andreas Rheinhardt wrote:
Michael Niedermayer:
On Wed, Nov 02, 2022 at 10:02:39PM +0100, Michael Niedermayer wrote:
Fixes: integer overflow
Signed-off-by: Michael Niedermayer
---
libswscale/output.c
On Fri, 28 Oct 2022, Hubert Mazur wrote:
This patchset contains arm64 neon implementation of hscale functions.
Fixed minor style issues and declared C function wrappers as static.
This patchset do not contain the patch for checkasm tool, as the
previous one did. The reason behind it was failing
On Tue, 25 Oct 2022, Martin Storsjö wrote:
Treat the 32 bit stride registers as signed.
Alternatively, we could make the stride arguments ptrdiff_t instead
of int, and changing all of the assembly to operate on these
registers with their full 64 bit width, but that's a much larger
and more
On Wed, 19 Oct 2022, Martin Storsjö wrote:
Support for building with older versions of MSVC (with the
c99wrap/c99conv frontend) was removed in
ce943dd6acbfdfc40223c0fb24d4cad438e6499c.
Signed-off-by: Martin Storsjö
---
configure | 6 --
1 file changed, 6 deletions(-)
diff --git
operation, which
would clamp the intermediates to 32 bit still).
Fixes: https://trac.ffmpeg.org/ticket/9985
Signed-off-by: Martin Storsjö
---
libswscale/aarch64/yuv2rgb_neon.S | 8
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/libswscale/aarch64/yuv2rgb_neon.S
b/libswscale
On Mon, 17 Oct 2022, Hubert Mazur wrote:
Provide arm64 neon optimized implementations for hscale16To19 with
filter sizes 4, 8 and X4.
The tests and benchmarks run on AWS Graviton 2 instances.
The results from a checkasm tool are shown below.
hscale_16_to_19__fs_4_dstW_512_c: 6216.0
On Mon, 17 Oct 2022, Hubert Mazur wrote:
Add arm64 neon implementations for hscale 8 to 19 with filter
sizes 4, 4X and 8. Both implementations are based on very similar ones
dedicated to hscale 8 to 15. The major changes refer to saving
the data - instead of writing the result as int16_t it is
On Tue, 11 Oct 2022, J. Dekker wrote:
checkasm benchmark on Ampere Altra (Neoverse N1):
put_hevc_qpel_bi_h4_8_c: 170.7
put_hevc_qpel_bi_h4_8_neon: 64.5
put_hevc_qpel_bi_h6_8_c: 373.7
put_hevc_qpel_bi_h6_8_neon: 130.2
put_hevc_qpel_bi_h8_8_c: 662.0
put_hevc_qpel_bi_h8_8_neon: 138.5
Support for building with older versions of MSVC (with the
c99wrap/c99conv frontend) was removed in
ce943dd6acbfdfc40223c0fb24d4cad438e6499c.
Signed-off-by: Martin Storsjö
---
configure | 6 --
1 file changed, 6 deletions(-)
diff --git a/configure b/configure
index 6712d045d9..ed52212f93
On Sun, 9 Oct 2022, reimar.doeffin...@gmx.de wrote:
From: Reimar Döffinger
Currently it is done in several different ways, which
might cause needless dependencies or in case of
tx_float_neon.S is incorrect.
Signed-off-by: Reimar Döffinger
---
libavcodec/aarch64/fft_neon.S | 3 +-
: Provide neon implementation of nsse8
lavc/aarch64: Provide optimized implementation of vsse8 for arm64.
lavc/aarch64: Add neon implementation for vsse_intra8
Martin Storsjö (3):
aarch64: me_cmp: Improve scheduling in ff_pix_abs8_y2_neon
aarch64: me_cmp: Fix up the prologue
Signed-off-by: Martin Storsjö
---
libavcodec/packet.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/libavcodec/packet.h b/libavcodec/packet.h
index 404d520071..f28e7e7011 100644
--- a/libavcodec/packet.h
+++ b/libavcodec/packet.h
@@ -161,7 +161,7 @@ enum
On Wed, 28 Sep 2022, Martin Storsjö wrote:
This hopefully should fix building with older toolchains, hopefully
fixing the fate failures on
http://fate.ffmpeg.org/history.cgi?slot=armel5tej-qemu-debian-gcc4.4.
Signed-off-by: Martin Storsjö
---
libavcodec/arm/vc1dsp_neon.S | 40
---
This should hopefully fix the current build failures at
http://fate.ffmpeg.org/history.cgi?slot=riscv64-linux-gnu-clang-14.
---
libavcodec/riscv/fmtconvert_init.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/libavcodec/riscv/fmtconvert_init.c
Signed-off-by: Martin Storsjö
---
libavcodec/aarch64/me_cmp_neon.S | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index 832a7cb22d..c710358ab7 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b
This avoids one redundant load per row; pix3 from the previous
iteration can be used as pix2 in the next one.
Before: Cortex A53A72A73
pix_abs_0_2_neon: 138.0 59.7 48.0
After:
pix_abs_0_2_neon: 109.7 50.2 39.5
Signed-off-by: Martin Storsjö
---
libavcodec/aarch64
On Mon, 26 Sep 2022, Grzegorz Bernacki wrote:
Provide optimized implementation for vsse_intra8 for arm64.
Performance tests are shown below.
- vsse_5_c: 87.7
- vsse_5_neon: 26.2
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
---
libavcodec/aarch64/me_cmp_init_aarch64.c |
On Mon, 26 Sep 2022, Grzegorz Bernacki wrote:
Provide optimized implementation of vsse8 for arm64.
Performance comparison tests are shown below.
- vsse_1_c: 141.5
- vsse_1_neon: 32.5
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
Signed-off-by: Grzegorz Bernacki
---
On Mon, 26 Sep 2022, Grzegorz Bernacki wrote:
Add vectorized implementation of nsse8 function.
Performance comparison tests are shown below.
- nsse_1_c: 256.0
- nsse_1_neon: 82.7
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Grzegorz Bernacki
---
On Mon, 26 Sep 2022, Grzegorz Bernacki wrote:
Provide optimized implementation of pix_abs8 function for arm64.
Performance comparison tests are shown below:
pix_abs_1_1_c: 162.5
pix_abs_1_1_neon: 27.0
pix_abs_1_2_c: 174.0
pix_abs_1_2_neon: 23.5
pix_abs_1_3_c: 203.2
pix_abs_1_3_neon: 34.7
On Wed, 28 Sep 2022, Rémi Denis-Courmont wrote:
Le 28 septembre 2022 10:13:57 GMT+03:00, "Martin Storsjö" a
écrit :
Signed-off-by: Martin Storsjö
---
This should hopefully fix the compile failures on fate,
http://fate.ffmpeg.org/report.cgi?time=20220927222508=riscv64-linux-
This hopefully should fix building with older toolchains, hopefully
fixing the fate failures on
http://fate.ffmpeg.org/history.cgi?slot=armel5tej-qemu-debian-gcc4.4.
Signed-off-by: Martin Storsjö
---
libavcodec/arm/vc1dsp_neon.S | 40 ++--
1 file changed, 20
Signed-off-by: Martin Storsjö
---
This should hopefully fix the compile failures on fate,
http://fate.ffmpeg.org/report.cgi?time=20220927222508=riscv64-linux-gnu-gcc-12
and
http://fate.ffmpeg.org/report.cgi?time=20220927225014=riscv64-linux-gnu-clang-14.
---
libavcodec/riscv/fmtconvert_rvv.S
On Mon, 26 Sep 2022, Marvin Scholz wrote:
As I am not sure who else to email about this, I'll just post it here.
I tried to register for Patchwork, however I got an error when registering.
I tried again and was told the account already exists, I tried to reset the
password for the account but
On Sat, 24 Sep 2022, Lynne wrote:
What about ac3dsp then - that one seems like it's fairly optimized for arm?
Haven't touched them, they're still being used. Unfortunately, for AC3,
the full MDCT optimizations in lavc do make a difference and the overall
decoder becomes 15% slower with this
On Sat, 24 Sep 2022, Hendrik Leppkes wrote:
On Sat, Sep 24, 2022 at 9:26 PM Hendrik Leppkes wrote:
On Sat, Sep 24, 2022 at 8:43 PM Martin Storsjö wrote:
>
> On Sat, 24 Sep 2022, Lynne wrote:
>
> > This commit changes both the encoder and decoder to use the new lavu/tx
code
On Sat, 24 Sep 2022, Lynne wrote:
This commit changes both the encoder and decoder to use the new lavu/tx code,
which has faster C transforms and more assembly optimizations.
What's the case of e.g. 32 bit arm - that does have a bunch of fft and
mdct assembly, but is that something that ends
On Tue, 20 Sep 2022, Hubert Mazur wrote:
This fixes issues addressed in previous patchset:
- move sub instruction in vsad8_intra,
- remove unnecessary mov instructions,
- remove single lane extraction in loop and place it at the end.
Removing mov instructions from pix_median_abs functions
On Tue, 13 Sep 2022, Hubert Mazur wrote:
Provide optimized implementation for pix_median_abs16 function.
Forgot to update this part of the commit message here too.
Performance comparison tests are shown below.
- median_sad_1_c: 273.7
- median_sad_1_neon: 98.2
Benchmarks and tests run with
On Tue, 13 Sep 2022, Hubert Mazur wrote:
Provide optimized implementation for pix_median_abs16 function.
You've forgot to update this part of the commit message.
Performance comparison tests are shown below.
- vsad_5_c: 94.7
- vsad_5_neon: 20.7
Benchmarks and tests run with checkasm tool
On Tue, 13 Sep 2022, Hubert Mazur wrote:
Provide optimized implementation for pix_median_abs16 function.
Performance comparison tests are shown below.
- median_sad_0_c: 722.0
- median_sad_0_neon: 144.7
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
On Sun, 25 Aug 2019, James Cowgill wrote:
When compiling FFmpeg with GCC-9, some very random segfaults were
observed in code which had previously called down into the SBC encoder
NEON assembly routines. This was caused by these functions clobbering
some of the vfp callee saved registers (d8 -
On Thu, 8 Sep 2022, Hubert Mazur wrote:
Fix minor issues in the patches.
Regarding vsse16 I didn't change saba & umlal to sub & smlal.
It doesn't affect the performance, so left it as it was.
The majority of changes refer to nsse16:
- fixed indentation (thanks for pointing out),
- applied the
On Tue, 6 Sep 2022, Hubert Mazur wrote:
Add vectorized implementation of nsse16 function.
Performance comparison tests are shown below.
- nsse_0_c: 707.0
- nsse_0_neon: 120.0
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
---
On Tue, 6 Sep 2022, Hubert Mazur wrote:
Provide optimized implementation of vsse16 for arm64.
Performance comparison tests are shown below.
- vsse_0_c: 254.4
- vsse_0_neon: 64.7
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
---
On Tue, 6 Sep 2022, Hubert Mazur wrote:
Provide optimized implementations for me_cmp functions.
This set of patches fixes all issues addressed in previous review.
Major changes:
- Remove redundant loads since the data can be reused.
- Improve style.
- Fix issues with unrecognized symbols.
On Tue, 6 Sep 2022, Lukas Fellechner wrote:
There are really two separate issues here:
1. Running out of address space in 32-bit processes
It probably makes sense to limit auto threads to 16, but it should only
be done in 32-bit processes.
FWIW, this was my first approach, until Andreas
On Tue, 6 Sep 2022, Mattias Wadman wrote:
On Sat, Sep 3, 2022 at 3:41 AM Lynne wrote:
Needed for the next patch.
We get this for the extremely small cost of a branch on _ns functions,
which wouldn't be used anyway with assembly.
Patch attached.
Hi, I have issues building on macOS
This fixes building for x86 macOS (both i386 and x86_64) and
i386 windows.
---
v2: Add mangle() in a couple more places, that weren't noticed
on i386 windows.
---
libavutil/x86/tx_float.asm | 10 +-
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/libavutil/x86/tx_float.asm
This fixes building for e.g. i386 windows.
---
libavutil/x86/tx_float.asm | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/libavutil/x86/tx_float.asm b/libavutil/x86/tx_float.asm
index 1b9131e7fa..ace19788a6 100644
--- a/libavutil/x86/tx_float.asm
+++
On Mon, 5 Sep 2022, Martin Storsjö wrote:
This matches a similar cap on the number of automatic threads
in libavcodec/pthread_slice.c.
On systems with lots of cores, this does speed things up in
general (measurable on the level of the runtime of running
"make fate"), and fixes a c
This matches a similar cap on the number of automatic threads
in libavcodec/pthread_slice.c.
On systems with lots of cores, this does speed things up in
general (measurable on the level of the runtime of running
"make fate"), and fixes a couple fate failures in 32 bit mode on
such machines (where
On Mon, 5 Sep 2022, Andreas Rheinhardt wrote:
Martin Storsjö:
Limit the returned value from av_cpu_count to sensible amounts
in 32 bit builds.
This chosen limit, 64, is somewhat arbitrary - a 32 bit process
is capable of creating much more than 64 threads. But in many
cases, multiple parts
Limit the returned value from av_cpu_count to sensible amounts
in 32 bit builds.
This chosen limit, 64, is somewhat arbitrary - a 32 bit process
is capable of creating much more than 64 threads. But in many
cases, multiple parts of the encoding pipeline (decoder, filters,
encoders) all create a
On Mon, 22 Aug 2022, Hubert Mazur wrote:
Add vectorized implementation of nsse16 function.
Performance comparison tests are shown below.
- nsse_0_c: 707.0
- nsse_0_neon: 120.0
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
---
On Mon, 22 Aug 2022, Hubert Mazur wrote:
Provide optimized implementation for vsse_intra16 for arm64.
Performance tests are shown below.
- vsse_4_c: 153.7
- vsse_4_neon: 34.2
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
---
On Mon, 22 Aug 2022, Hubert Mazur wrote:
Provide optimized implementation for vsad_intra16 function for arm64.
Performance comparison tests are shown below.
- vsad_4_c: 177.2
- vsad_4_neon: 24.5
Benchmarks and tests are run with checkasm tool on AWS Gravtion 3.
Signed-off-by: Hubert Mazur
On Mon, 22 Aug 2022, Hubert Mazur wrote:
Provide optimized implementation of vsse16 for arm64.
Performance comparison tests are shown below.
- vsse_0_c: 254.4
- vsse_0_neon: 64.7
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
---
On Sat, 3 Sep 2022, r...@remlab.net wrote:
From: Rémi Denis-Courmont
There are no particular reasons to force the compiler to use the same
register as output and input operand. This forces an extra MOV
instruction if the input value needs to be reused after the swap.
In most cases, this
On Sat, 3 Sep 2022, Andreas Rheinhardt wrote:
It is advantageous for ff_crop_tab, as the base pointer used to
access this table is not the first element of it. But the real
base pointer is still at a constant offset from the code/the GOT
and can therefore be accessed relative to the instruction
On Mon, 22 Aug 2022, Hubert Mazur wrote:
Provide optimized implementation of vsad16 function for arm64.
Performance comparison tests are shown below.
- vsad_0_c: 285.0
- vsad_0_neon: 42.5
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
---
On Mon, 22 Aug 2022, Hubert Mazur wrote:
Add vectorized implementation of nsse16 function.
Performance comparison tests are shown below.
- nsse_0_c: 707.0
- nsse_0_neon: 120.0
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
---
On Tue, 9 Aug 2022, Martin Storsjö wrote:
These were missed when h264_chroma_mc_func was changed in
e4a94d8b36c48d95a7d412c40d7b558422ff659c.
Signed-off-by: Martin Storsjö
---
libavcodec/arm/rv40dsp_init_arm.c | 8
1 file changed, 4 insertions(+), 4 deletions(-)
OK'd by Andreas
On Fri, 26 Aug 2022, Martin Storsjö wrote:
This fixes building for arm targets with optimizations disabled.
---
libavutil/arm/intmath.h | 24 ++--
1 file changed, 18 insertions(+), 6 deletions(-)
diff --git a/libavutil/arm/intmath.h b/libavutil/arm/intmath.h
index 5311a7d52b
On Sat, 27 Aug 2022, Martin Storsjö wrote:
The AArch64 assembly accesses those symbols directly, without
indirection via e.g. the GOT on ELF. In order for this not to
require text relocations, those symbols need to be resolved fully
at link time, i.e. those symbols can't be interposable
that are accessed from AArch64 assembly
as hidden, so that they are resolved fully at link time even without
the version script and -Wl,-Bsymbolic.
Signed-off-by: Martin Storsjö
---
v4: Moved the attribute definition to a new, standalone header (which
only depends on libavutil/attributes.h
This fixes building for arm targets with optimizations disabled.
---
libavutil/arm/intmath.h | 24 ++--
1 file changed, 18 insertions(+), 6 deletions(-)
diff --git a/libavutil/arm/intmath.h b/libavutil/arm/intmath.h
index 5311a7d52b..f19b21e98d 100644
---
These inline assembly functions rely on being inlined into the
caller, so that the parameter "int p" can be a known assembly time
constant, instead of a variable parameter.
__OPTIMIZE__ is a built-in define which is set by both GCC and Clang
(the two main compilers supporting our inline assembly)
On Sun, 14 Aug 2022, Lynne wrote:
The fastest fast Fourier transform in not just the west, but the world,
now for the most popular toy ISA.
On a high level, it follows the design of the AVX2 version closely,
with the exception that the input is slightly less permuted as we don't have
to do
On Thu, 18 Aug 2022, Alan Kelly wrote:
Thanks Martin for doing this.
On Thu, Aug 18, 2022 at 10:16 AM Martin Storsjö wrote:
This avoids triggering overflows in the filters, and avoids
stray
test failures in the approximate functions on x86; due to
rounding
On Tue, 16 Aug 2022, Hubert Mazur wrote:
Provide optimized implementation of pix_abs8 function for arm64.
Performance comparison tests are shown below.
- pix_abs_1_0_c: 101.2
- pix_abs_1_0_neon: 22.5
- sad_1_c: 101.2
- sad_1_neon: 22.5
Benchmarks and tests are run with checkasm tool on AWS
On Tue, 16 Aug 2022, Hubert Mazur wrote:
Provide optimized implementation of sse8 function for arm64.
Performance comparison tests are shown below.
- sse_1_c: 130.7
- sse_1_neon: 29.7
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
---
On Tue, 16 Aug 2022, Hubert Mazur wrote:
Provide optimized implementation of pix_abs16_y2 function for arm64.
Performance comparison tests are shown below.
pix_abs_0_2_c: 317.2
pix_abs_0_2_neon: 37.5
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
On Tue, 16 Aug 2022, Hubert Mazur wrote:
Provide neon implementation for sse4 function.
Performance comparison tests are shown below.
- sse_2_c: 80.7
- sse_2_neon: 31.0
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
---
On Tue, 16 Aug 2022, Hubert Mazur wrote:
Provide neon implementation for sse16 function.
Performance comparison tests are shown below.
- sse_0_c: 268.2
- sse_0_neon: 43.5
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
---
On Tue, 16 Aug 2022, Hubert Mazur wrote:
Add arm64 neon implementation for functions from motion estimation
family. All of them were tested and benchmarked using checkasm tool.
The rare code paths, e.g. when filter_size % 4 != 0 were also tested.
Instructions were manualy deinterleaved to
This avoids triggering overflows in the filters, and avoids stray
test failures in the approximate functions on x86; due to rounding
differences, one implementation might overflow while another one
doesn't.
Signed-off-by: Martin Storsjö
---
FWIW, this modification runs successfully with over
On Wed, 17 Aug 2022, Ronald S. Bultje wrote:
On Wed, Aug 17, 2022 at 4:32 PM Martin Storsjö wrote:
This avoids overflows on some inputs in the x86 case, where the
assembly version would clip/overflow differently from the
C reference function.
This doesn't seem
more realistic output pixel
values, instead of having essentially all pixels clipped to either
0 or 255.
Signed-off-by: Martin Storsjö
---
tests/checkasm/sw_scale.c | 8
1 file changed, 8 insertions(+)
diff --git a/tests/checkasm/sw_scale.c b/tests/checkasm/sw_scale.c
index d72506ed86
Don't stop directly at the first differing pixel, but find the
one that differs by more than the expected accuracy.
Also print the failing value in check_yuv2yuvX.
Signed-off-by: Martin Storsjö
---
tests/checkasm/sw_scale.c | 14 ++
1 file changed, 10 insertions(+), 4 deletions
On Wed, 17 Aug 2022, Andreas Rheinhardt wrote:
Since d69d12a5b9236b9d2f1fd247ea452f84cdd1aaf9 these av_assert2()
(or more exactly, the ones in hadamard8_diff8x8_c() and
hadamard8_intra8x8_c()) are hit. So just remove all of these asserts.
(If the test were improved to know which functions
On Tue, 16 Aug 2022, J. Dekker wrote:
hevc_add_res_4x4_12_c: 46.0
hevc_add_res_4x4_12_neon: 18.7
hevc_add_res_8x8_12_c: 194.7
hevc_add_res_8x8_12_neon: 25.2
hevc_add_res_16x16_12_c: 716.0
hevc_add_res_16x16_12_neon: 69.7
hevc_add_res_32x32_12_c: 3820.7
hevc_add_res_32x32_12_neon: 261.0
On Tue, 16 Aug 2022, J. Dekker wrote:
hevc_add_res_4x4_12_c: 46.0
hevc_add_res_4x4_12_neon: 18.7
hevc_add_res_8x8_12_c: 194.7
hevc_add_res_8x8_12_neon: 25.2
hevc_add_res_16x16_12_c: 716.0
hevc_add_res_16x16_12_neon: 69.7
hevc_add_res_32x32_12_c: 3820.7
hevc_add_res_32x32_12_neon: 261.0
On Thu, 4 Aug 2022, Martin Storsjö wrote:
On Wed, 13 Jul 2022, Martin Storsjö wrote:
Previously, the checkasm test always passed h=8, so no other cases
were tested.
Out of the me_cmp functions, in practice, some functions are hardcoded
to always assume a 8x8 block (ignoring the h parameter
On Sat, 13 Aug 2022, Swinney, Jonathan wrote:
We don't generally use stdbool in ffmpeg, even if it's C99 - just use a
plain int and 0/1.
Updated this.
Other than that, the checkasm changes look fine (I coauthored part of
them - and your cleanup of my WIP patch looks good!).
Yes, thank you
On Sat, 13 Aug 2022, Swinney, Jonathan wrote:
This specialization handles the case where filtersize is 4 mod 8, e.g.
12, 20, etc. Aarch64 was previously using the c function for this case.
This implementation speeds up that case significantly.
hscale_8_to_15__fs_12_dstW_512_c: 6234.1
This was missed in db54426975e124e98e5130ad01316cb7afd60630.
Signed-off-by: Martin Storsjö
---
In practice, ptrdiff_t and int are the same type on arm, so these
didn't cause any warnings and haven't been caught due to that.
---
libavcodec/arm/vc1dsp_init_neon.c | 12 ++--
1 file changed
These were missed when h264_chroma_mc_func was changed in
e4a94d8b36c48d95a7d412c40d7b558422ff659c.
Signed-off-by: Martin Storsjö
---
libavcodec/arm/rv40dsp_init_arm.c | 8
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/libavcodec/arm/rv40dsp_init_arm.c
b/libavcodec/arm
On Thu, 23 Jun 2022, J. Dekker wrote:
old:
hevc_idct_16x16_8_c: 5366.2
hevc_idct_16x16_8_neon: 1493.2
new:
hevc_idct_16x16_8_c: 5363.2
hevc_idct_16x16_8_neon: 943.5
Co-developed-by: Rafal Dabrowa
Signed-off-by: J. Dekker
---
libavcodec/aarch64/hevcdsp_idct_neon.S| 666
On Thu, 23 Jun 2022, J. Dekker wrote:
Signed-off-by: J. Dekker
---
libavcodec/aarch64/Makefile | 3 +-
libavcodec/aarch64/hevcdsp_deblock_neon.S | 168 ++
libavcodec/aarch64/hevcdsp_init_aarch64.c | 14 ++
3 files changed, 184 insertions(+), 1 deletion(-)
On Tue, 9 Aug 2022, Martin Storsjö wrote:
On Thu, 23 Jun 2022, J. Dekker wrote:
hevc_add_res_4x4_12_c: 46.0
hevc_add_res_4x4_12_neon: 18.7
hevc_add_res_8x8_12_c: 194.7
hevc_add_res_8x8_12_neon: 25.2
hevc_add_res_16x16_12_c: 716.0
hevc_add_res_16x16_12_neon: 69.7
hevc_add_res_32x32_12_c
On Thu, 23 Jun 2022, J. Dekker wrote:
hevc_add_res_4x4_12_c: 46.0
hevc_add_res_4x4_12_neon: 18.7
hevc_add_res_8x8_12_c: 194.7
hevc_add_res_8x8_12_neon: 25.2
hevc_add_res_16x16_12_c: 716.0
hevc_add_res_16x16_12_neon: 69.7
hevc_add_res_32x32_12_c: 3820.7
hevc_add_res_32x32_12_neon: 261.0
On Thu, 23 Jun 2022, J. Dekker wrote:
Signed-off-by: J. Dekker
---
libavcodec/aarch64/hevcdsp_idct_neon.S | 216 -
1 file changed, 108 insertions(+), 108 deletions(-)
LGTM, thanks!
// Martin
___
ffmpeg-devel mailing list
On Thu, 23 Jun 2022, J. Dekker wrote:
Signed-off-by: J. Dekker
---
tests/checkasm/hevc_add_res.c | 15 ---
1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/tests/checkasm/hevc_add_res.c b/tests/checkasm/hevc_add_res.c
index 0c896adaca..f17d121939 100644
---
On Fri, 5 Aug 2022, Martin Storsjö wrote:
On Wed, 27 Jul 2022, Andreas Rheinhardt wrote:
Swinney, Jonathan:
This patch looks good to me. I would appreciate its merging.
} while (0)
#define PERF_STOP(t) do { \
+int ret
On Mon, 8 Aug 2022, James Almer wrote:
Signed-off-by: James Almer
---
libswscale/output.c | 4 ++--
tests/ref/fate/filter-pixdesc-vuya | 2 +-
tests/ref/fate/filter-pixfmts-copy | 2 +-
tests/ref/fate/filter-pixfmts-crop | 2 +-
401 - 500 of 1377 matches
Mail list logo