Re: [FFmpeg-devel] [PATCH 1/3] riscv: add Zvbb vector bit manipulation extension

2024-05-07 Thread Martin Storsjö

On Tue, 7 May 2024, Rémi Denis-Courmont wrote:


---
Makefile  | 2 +-
configure | 3 +++
doc/APIchanges| 3 +++
ffbuild/arch.mak  | 1 +
libavutil/cpu.h   | 1 +
libavutil/tests/cpu.c | 1 +
tests/checkasm/checkasm.c | 1 +
7 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/libavutil/tests/cpu.c b/libavutil/tests/cpu.c
index d91bfeab5c..10e620963b 100644
--- a/libavutil/tests/cpu.c
+++ b/libavutil/tests/cpu.c
@@ -94,6 +94,7 @@ static const struct {
{ AV_CPU_FLAG_RVV_F32,   "zve32f" },
{ AV_CPU_FLAG_RVV_I64,   "zve64x" },
{ AV_CPU_FLAG_RVV_F64,   "zve64d" },
+{ AV_CPU_FLAG_RV_ZVBB,   "zvbb"  },
#endif
{ 0 }
};


Doesn't this test require you to add this extension to the list in 
libavutil/cpu.c as well?


// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH] lavu/riscv: fix build without

2024-05-07 Thread Martin Storsjö

On Tue, 7 May 2024, Rémi Denis-Courmont wrote:


---
libavutil/riscv/cpu.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/libavutil/riscv/cpu.c b/libavutil/riscv/cpu.c
index c3683b06d0..69d1afe853 100644
--- a/libavutil/riscv/cpu.c
+++ b/libavutil/riscv/cpu.c
@@ -29,14 +29,14 @@
#include 
#define HWCAP_RV(letter) (1ul << ((letter) - 'A'))
#endif
-#ifdef HAVE_SYS_HWPROBE_H
+#if HAVE_SYS_HWPROBE_H
#include 
#endif

int ff_get_cpu_flags_riscv(void)
{
int ret = 0;
-#ifdef HAVE_SYS_HWPROBE_H
+#if HAVE_SYS_HWPROBE_H
struct riscv_hwprobe pairs[] = {
{ RISCV_HWPROBE_KEY_BASE_BEHAVIOR, 0 },
{ RISCV_HWPROBE_KEY_IMA_EXT_0, 0 },
--
2.43.0


LGTM

// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH] checkasm/blockdsp: don't randomize the buffers for fill_block_tab

2024-05-07 Thread Martin Storsjö

On Tue, 7 May 2024, Andreas Rheinhardt wrote:


Martin Storsjö:

On Mon, 6 May 2024, James Almer wrote:


It ignores and overwrites the previous values.
Fixes running the test under ubsan.

Signed-off-by: James Almer 
---
tests/checkasm/blockdsp.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)


The change is probably correct, but what issue is ubsan complaining
about? If this would just be a dead store of unused random values, that
shouldn't be an ubsan issue in general, right?



UBSan complains about unaligned stores in randomize_buffers; which is
obvious given that i is incremented by 1, not by 2. I sent a patch that
fixes this without removing randomization:
https://ffmpeg.org/pipermail/ffmpeg-devel/2024-May/326945.html


Thanks, that explains it. Those two patches LGTM.

// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH] checkasm/blockdsp: don't randomize the buffers for fill_block_tab

2024-05-06 Thread Martin Storsjö

On Mon, 6 May 2024, James Almer wrote:


It ignores and overwrites the previous values.
Fixes running the test under ubsan.

Signed-off-by: James Almer 
---
tests/checkasm/blockdsp.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)


The change is probably correct, but what issue is ubsan complaining about? 
If this would just be a dead store of unused random values, that shouldn't 
be an ubsan issue in general, right?


// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH 2/2] lavu/riscv: add hwprobe() for CPU detection

2024-05-06 Thread Martin Storsjö

On Fri, 3 May 2024, Rémi Denis-Courmont wrote:


This adds the Linux-specific function call to detect CPU features. Unlike
the more portable auxillary vector, this supports extensions other than
single lettered ones. At this point, FFmpeg already needs this to detect
Zba and Zbb at run-time, and probably will need it for Zvbb in the near
future.

Support will be available in glibc 2.40 onward.
---
configure |  3 +++
libavutil/riscv/cpu.c | 25 +
2 files changed, 28 insertions(+)



@@ -27,10 +29,33 @@
#include 
#define HWCAP_RV(letter) (1ul << ((letter) - 'A'))
#endif
+#ifdef HAVE_SYS_HWPROBE_H


Aren't these kind of config.h macros always defined, but with the values 
0/1? I.e., shouldn't this use #if instead of #ifdef?


// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH] avcodec/x86/vp3dsp_init: Set correct function pointer, fix crash

2024-04-30 Thread Martin Storsjö

On Tue, 30 Apr 2024, Andreas Rheinhardt wrote:


Regression since fd172185580c1ccdcfb90bbfdb59fa806fad3117;
triggered by vp4/KTkvw8dg1J8.avi in the FATE suite, but not
when running fate as this code is not used when the bitexact
flag is set.

Bisecting done by ami_stuff, patch from user Mika Fischer
in ticket #10027 (which this commit fixes).

Signed-off-by: Andreas Rheinhardt 
---
libavcodec/x86/vp3dsp_init.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libavcodec/x86/vp3dsp_init.c b/libavcodec/x86/vp3dsp_init.c
index f54fa57b3e..edac1764cb 100644
--- a/libavcodec/x86/vp3dsp_init.c
+++ b/libavcodec/x86/vp3dsp_init.c
@@ -53,7 +53,7 @@ av_cold void ff_vp3dsp_init_x86(VP3DSPContext *c, int flags)

if (!(flags & AV_CODEC_FLAG_BITEXACT)) {
c->v_loop_filter = c->v_loop_filter_unaligned = 
ff_vp3_v_loop_filter_mmxext;
-c->h_loop_filter = c->v_loop_filter_unaligned = 
ff_vp3_h_loop_filter_mmxext;
+c->h_loop_filter = c->h_loop_filter_unaligned = 
ff_vp3_h_loop_filter_mmxext;
}
}

--
2.40.1


LGTM

// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


[FFmpeg-devel] [PATCH] checkasm: vc1dsp: Align buffers sufficiently for the mspel tests

2024-04-30 Thread Martin Storsjö
This fixes crashes in the mspel tests on x86.
---
 tests/checkasm/vc1dsp.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/tests/checkasm/vc1dsp.c b/tests/checkasm/vc1dsp.c
index 407d9e5fe8..f18f0f8251 100644
--- a/tests/checkasm/vc1dsp.c
+++ b/tests/checkasm/vc1dsp.c
@@ -441,10 +441,10 @@ static void check_unescape(void)
 
 static void check_mspel_pixels(void)
 {
-LOCAL_ALIGNED_8(uint8_t, src0, [32 * 32]);
-LOCAL_ALIGNED_8(uint8_t, src1, [32 * 32]);
-LOCAL_ALIGNED_8(uint8_t, dst0, [32 * 32]);
-LOCAL_ALIGNED_8(uint8_t, dst1, [32 * 32]);
+LOCAL_ALIGNED_16(uint8_t, src0, [32 * 32]);
+LOCAL_ALIGNED_16(uint8_t, src1, [32 * 32]);
+LOCAL_ALIGNED_16(uint8_t, dst0, [32 * 32]);
+LOCAL_ALIGNED_16(uint8_t, dst1, [32 * 32]);
 
 VC1DSPContext h;
 
-- 
2.34.1

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH v3 0/2] HTTP Retry-After Support

2024-04-25 Thread Martin Storsjö

On Thu, 25 Apr 2024, Derek Buitenhuis wrote:


Changes since last set:
 * Updated commit message with RFC references.
 * Properly support Retry-After as both a date and integer number of seconds.

I have tested this against both an HTTP-Date and seconds, and confirmed
it to work.

Derek Buitenhuis (2):
 avformat/http: Rename parse_set_cookie_expiry_time to parse_http_date
 avformat/http: Add support for Retry-After header

doc/protocols.texi|  5 
libavformat/http.c| 62 ++-
libavformat/version.h |  2 +-
3 files changed, 49 insertions(+), 20 deletions(-)


Thanks, these patches LGTM.

// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH v2 0/9] HTTP rate limiting and retry improvements

2024-04-24 Thread Martin Storsjö

On Mon, 22 Apr 2024, Derek Buitenhuis wrote:


This patch set adds support for properly handling HTTP 429 codes,
and their rate limiting, which is widely used and is standardized.

Changes since first set:
 * Added AVERROR_HTTP_TOO_MANY_REQUESTS top error_entries in error.c, per 
Andreas' review.
 * Made respect_retry_after unsigned and use strtoull, per James' review.
 * Added docs, as per Stefano's reviews./
 * Added a new option to limit the total reconnect delay.
* Unfortunate, but HTTP connection management is messy business.


I had a look over this patchset, and I had a handful of minor comments, 
but overall, the patchset seems fine to me. Thanks!


// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH v2 6/9] avformat/http: Add options to set the max number of connection retries

2024-04-24 Thread Martin Storsjö

On Mon, 22 Apr 2024, Derek Buitenhuis wrote:


Not every use case benefits from setting retries in terms of the backoff.

Signed-off-by: Derek Buitenhuis 
---
libavformat/http.c| 12 +---
libavformat/version.h |  2 +-
2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/libavformat/http.c b/libavformat/http.c
index 6927fea2fb..06bd3e340e 100644
--- a/libavformat/http.c
+++ b/libavformat/http.c
@@ -140,6 +140,7 @@ typedef struct HTTPContext {
uint64_t filesize_from_content_range;
int respect_retry_after;
unsigned int retry_after;
+int reconnect_max_retries;
} HTTPContext;

#define OFFSET(x) offsetof(HTTPContext, x)
@@ -178,6 +179,7 @@ static const AVOption options[] = {
{ "reconnect_on_http_error", "list of http status codes to reconnect on", 
OFFSET(reconnect_on_http_error), AV_OPT_TYPE_STRING, { .str = NULL }, 0, 0, D },
{ "reconnect_streamed", "auto reconnect streamed / non seekable streams", 
OFFSET(reconnect_streamed), AV_OPT_TYPE_BOOL, { .i64 = 0 }, 0, 1, D },
{ "reconnect_delay_max", "max reconnect delay in seconds after which to give 
up", OFFSET(reconnect_delay_max), AV_OPT_TYPE_INT, { .i64 = 120 }, 0, UINT_MAX/1000/1000, D },
+{ "reconnect_max_retries", "the max number of times to retry a 
connection", OFFSET(reconnect_max_retries), AV_OPT_TYPE_INT, { .i64 = -1 }, -1, INT_MAX, D },
{ "respect_retry_after", "respect the Retry-After header when retrying 
connections", OFFSET(respect_retry_after), AV_OPT_TYPE_BOOL, { .i64 = 1 }, 0, 1, D },
{ "listen", "listen on HTTP", OFFSET(listen), AV_OPT_TYPE_INT, { .i64 = 0 
}, 0, 2, D | E },
{ "resource", "The resource requested by a client", OFFSET(resource), 
AV_OPT_TYPE_STRING, { .str = NULL }, 0, 0, E },
@@ -359,7 +361,7 @@ static int http_open_cnx(URLContext *h, AVDictionary 
**options)
{
HTTPAuthType cur_auth_type, cur_proxy_auth_type;
HTTPContext *s = h->priv_data;
-int ret, auth_attempts = 0, redirects = 0;
+int ret, conn_attempts = 1, auth_attempts = 0, redirects = 0;
int reconnect_delay = 0;
uint64_t off;
char *cached;
@@ -386,7 +388,8 @@ redo:
ret = http_open_cnx_internal(h, options);
if (ret < 0) {
if (!http_should_reconnect(s, ret) ||
-reconnect_delay > s->reconnect_delay_max)
+reconnect_delay > s->reconnect_delay_max ||
+(s->reconnect_max_retries >= 0 && conn_attempts > 
s->reconnect_max_retries))
goto fail;

if (s->respect_retry_after && s->retry_after > 0) {
@@ -401,6 +404,7 @@ redo:
if (ret != AVERROR(ETIMEDOUT))
goto fail;
reconnect_delay = 1 + 2 * reconnect_delay;
+conn_attempts++;

/* restore the offset (http_connect resets it) */
s->off = off;
@@ -1706,6 +1710,7 @@ static int http_read_stream(URLContext *h, uint8_t *buf, 
int size)
int err, read_ret;
int64_t seek_ret;
int reconnect_delay = 0;
+int conn_attempt = 1;


Minor inconsistency; the corresponding variable in the other function was 
called conn_attempts, as a plural.


// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH v2 4/9] avformat/http: Add support for Retry-After header

2024-04-24 Thread Martin Storsjö

On Mon, 22 Apr 2024, Derek Buitenhuis wrote:


429 and 503 codes can, and often do (e.g. all Google Cloud
Storage URLs can), return a Retry-After header with the error,
indicating how long to wait, in seconds, before retrying again.
If it is not respected by, for example, using our default backoff
stratetgy instead, chances of success are very unlikely.

This adds an AVOption to respect that header.

Signed-off-by: Derek Buitenhuis 
---
libavformat/http.c| 12 
libavformat/version.h |  2 +-
2 files changed, 13 insertions(+), 1 deletion(-)


Is this feature standardized in a RFC, or is it some other spec somewhere? 
I think it would be nice with a link to a spec in the commit message here.




diff --git a/libavformat/http.c b/libavformat/http.c
index e7603037f4..5ed481b63a 100644
--- a/libavformat/http.c
+++ b/libavformat/http.c
@@ -138,6 +138,8 @@ typedef struct HTTPContext {
char *new_location;
AVDictionary *redirect_cache;
uint64_t filesize_from_content_range;
+int respect_retry_after;
+unsigned int retry_after;
} HTTPContext;

#define OFFSET(x) offsetof(HTTPContext, x)
@@ -176,6 +178,7 @@ static const AVOption options[] = {
{ "reconnect_on_http_error", "list of http status codes to reconnect on", 
OFFSET(reconnect_on_http_error), AV_OPT_TYPE_STRING, { .str = NULL }, 0, 0, D },
{ "reconnect_streamed", "auto reconnect streamed / non seekable streams", 
OFFSET(reconnect_streamed), AV_OPT_TYPE_BOOL, { .i64 = 0 }, 0, 1, D },
{ "reconnect_delay_max", "max reconnect delay in seconds after which to give 
up", OFFSET(reconnect_delay_max), AV_OPT_TYPE_INT, { .i64 = 120 }, 0, UINT_MAX/1000/1000, D },
+{ "respect_retry_after", "respect the Retry-After header when retrying 
connections", OFFSET(respect_retry_after), AV_OPT_TYPE_BOOL, { .i64 = 1 }, 0, 1, D },
{ "listen", "listen on HTTP", OFFSET(listen), AV_OPT_TYPE_INT, { .i64 = 0 
}, 0, 2, D | E },
{ "resource", "The resource requested by a client", OFFSET(resource), 
AV_OPT_TYPE_STRING, { .str = NULL }, 0, 0, E },
{ "reply_code", "The http status code to return to a client", 
OFFSET(reply_code), AV_OPT_TYPE_INT, { .i64 = 200}, INT_MIN, 599, E},
@@ -386,6 +389,13 @@ redo:
reconnect_delay > s->reconnect_delay_max)
goto fail;

+if (s->respect_retry_after && s->retry_after > 0) {
+reconnect_delay = s->retry_after;


It'd be nice with a comment to clarify the units of both values here, 
which apparently both happen to be integer seconds?



+if (reconnect_delay > s->reconnect_delay_max)
+goto fail;
+s->retry_after = 0;
+}
+
av_log(h, AV_LOG_WARNING, "Will reconnect at %"PRIu64" in %d 
second(s).\n", off, reconnect_delay);
ret = ff_network_sleep_interruptible(1000U * 1000 * reconnect_delay, 
>interrupt_callback);
if (ret != AVERROR(ETIMEDOUT))
@@ -1231,6 +1241,8 @@ static int process_line(URLContext *h, char *line, int 
line_count, int *parsed_h
parse_expires(s, p);
} else if (!av_strcasecmp(tag, "Cache-Control")) {
parse_cache_control(s, p);
+} else if (!av_strcasecmp(tag, "Retry-After")) {
+s->retry_after = strtoul(p, NULL, 10);


Can you add a comment here, to clarify what unit the value is expressed 
in?


// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH v2 2/9] avformat/http: Use AVERROR_HTTP_TOO_MANY_REQUESTS

2024-04-24 Thread Martin Storsjö

On Mon, 22 Apr 2024, Derek Buitenhuis wrote:


Added in thep previous commit.

Signed-off-by: Derek Buitenhuis 
---
libavformat/http.c | 6 ++
1 file changed, 6 insertions(+)

diff --git a/libavformat/http.c b/libavformat/http.c
index ed20359552..bbace2694f 100644
--- a/libavformat/http.c
+++ b/libavformat/http.c
@@ -286,6 +286,7 @@ static int http_should_reconnect(HTTPContext *s, int err)
case AVERROR_HTTP_UNAUTHORIZED:
case AVERROR_HTTP_FORBIDDEN:
case AVERROR_HTTP_NOT_FOUND:
+case AVERROR_HTTP_TOO_MANY_REQUESTS:
case AVERROR_HTTP_OTHER_4XX:
status_group = "4xx";
break;
@@ -522,6 +523,7 @@ int ff_http_averror(int status_code, int default_averror)
case 401: return AVERROR_HTTP_UNAUTHORIZED;
case 403: return AVERROR_HTTP_FORBIDDEN;
case 404: return AVERROR_HTTP_NOT_FOUND;
+case 429: return AVERROR_HTTP_TOO_MANY_REQUESTS;
default: break;
}
if (status_code >= 400 && status_code <= 499)
@@ -558,6 +560,10 @@ static int http_write_reply(URLContext* h, int status_code)
reply_code = 404;
reply_text = "Not Found";
break;
+case 429:
+reply_code = 429;
+reply_text = "Too Many Requests";
+break;
case 200:


This function seems to handle both the literal status codes, like 429, and 
also AVERROR style error codes, as when called from handle_http_errors, so 
perhaps it would be good for consistency to add the AVERROR here too.


// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH v2 2/9] avformat/http: Use AVERROR_HTTP_TOO_MANY_REQUESTS

2024-04-24 Thread Martin Storsjö

On Mon, 22 Apr 2024, Derek Buitenhuis wrote:


Added in thep previous commit.


Typo in the commit message

// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH] avdevice/avfoundation: fix macOS/iOS/tvOS SDK conditional checks

2024-04-24 Thread Martin Storsjö

On Wed, 17 Apr 2024, Marvin Scholz wrote:


This fixes the checks to properly use runtime feature detection and
check the SDK version (*_MAX_ALLOWED) instead of the targeted version
for the relevant APIs.


As these things are pretty hard to think straight about, it could be good 
with a more concrete example of what this achieves. I.e. if building with 
-mmacosx-version-min=10.13, we can still use the macOS 10.15 specific 
APIs, if they were available at build time, via the runtime check.




The target is still checked (*_MIN_REQUIRED) to avoid using deprecated
methods when targeting new enough versions.
---
libavdevice/avfoundation.m | 164 ++---
1 file changed, 116 insertions(+), 48 deletions(-)


The diff is pretty hard to read as is, but when applied and viewed with 
"git show -w", it becomes clearer.


The changes from TARGET_OS_IPHONE to TARGET_OS_IOS is pretty subtle, iirc 
TARGET_OS_IPHONE was any non-desktop platform (ios/tvos/watchos etc), 
while TARGET_OS_IOS specifically is iOS. The change looks right, but it 
might be good to spell this out as well.


Specifically also, that TARGET_OS_IPHONE covers a whole class of OSes, 
while TARGET_OS_IOS is one OS - but the version defines for that OS are 
__IPHONE_OS_VERSION_MIN_REQUIRED and __IPHONE_OS_VERSION_MAX_ALLOWED.



+  /* If the targeted macOS is new enough, this fallback case can never be 
reached, so do not
+   * use a deprecated API to avoid compiler warnings.
+   */


This sentence gets somewhat warped up at some point, so I don't think it 
exactly means and is understandable as you meant it.


What about this:

If the targeted macOS is new enough, use of older APIs will cause
deprecation warnings. Due to the availability check, we actually
won't ever execute the code in such builds, but the compiler will
still warn about it, unless we actually ifdef out the reference.


Outside of what the patch does, I see the existing file uses this 
construct in a few places:


#if !TARGET_OS_IPHONE && __MAC_OS_X_VERSION_MIN_REQUIRED >= 1070

I think it would seem more consistent to update this to use TARGET_OS_OSX 
instead of negating TARGET_OS_IPHONE - or is there something I'm missing?



As for alternative ways of doing this, that would be less unwieldy - I 
have something like this in mind:


#define SDK_AT_LEAST(macos, ios, tvos) \
(TARGET_OS_OSX&& MAC_OS_X_VERSION_MAX_ALLOWED>= macos) || \
(TARGET_OS_IOS&& __IPHONE_OS_VERSION_MAX_ALLOWED >= ios) || \
(TARGET_OS_TV && __TV_OS_VERSION_MAX_ALLOWED >= tvos)

#if SDK_AT_LEAST(__MAC_10_15, __IPHONE_10_0, __TVOS_17_0)

We could add similar macros for both SDK_AT_LEAST and 
TARGET_VERSION_AT_LEAST, and variants for different combinations of 
macos/ios/tvos for when we don't want to specify all of them.



We can't use defined(macos) etc within this context though, so if we want 
to go this way, we'd need to start out with ifdefs for all the defines we 
use, like this:


#ifndef __MAC_10_15
#define __MAC_10_15 
#endif

There's of course a bit of fragility here, we need to make sure that we 
actually copypaste the exact right value here. But on the other hand, we 
even could make it intentionally something else, e.g. like this:


#ifndef __MAC_10_15
// If the SDK doesn't define this constant, the SDK doesn't support this 
version anyway, and we won't end up selecting it, so just use a dummy 
value instead.

#define __MAC_10_15 
#endif


What do you think, does any of that seem like it would make the code more 
manageable?


// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH v3 0/2] lavc/aarch64/fdct: add neon-optimized fdct for aarch64

2024-04-17 Thread Martin Storsjö

On Wed, 17 Apr 2024, Ramiro Polla wrote:


This patch set adds fdct to checkasm and neon-optimized fdct for aarch64.

Ramiro Polla (2):
 checkasm: add test for fdct
 lavc/aarch64/fdct: add neon-optimized fdct for aarch64

libavcodec/aarch64/Makefile   |   2 +
libavcodec/aarch64/fdct.h |  26 ++
libavcodec/aarch64/fdctdsp_init_aarch64.c |  39 +++
libavcodec/aarch64/fdctdsp_neon.S | 368 ++
libavcodec/avcodec.h  |   1 +
libavcodec/fdctdsp.c  |   4 +-
libavcodec/fdctdsp.h  |   2 +
libavcodec/options_table.h|   1 +
libavcodec/tests/aarch64/dct.c|   2 +
tests/checkasm/Makefile   |   1 +
tests/checkasm/checkasm.c |   3 +
tests/checkasm/checkasm.h |   1 +
tests/checkasm/fdctdsp.c  |  68 
tests/fate/checkasm.mak   |   1 +
14 files changed, 518 insertions(+), 1 deletion(-)
create mode 100644 libavcodec/aarch64/fdct.h
create mode 100644 libavcodec/aarch64/fdctdsp_init_aarch64.c
create mode 100644 libavcodec/aarch64/fdctdsp_neon.S
create mode 100644 tests/checkasm/fdctdsp.c


LGTM, thanks!

// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


[FFmpeg-devel] [PATCH] Remove .travis.yml

2024-04-17 Thread Martin Storsjö
Travis is no longer relevant for attempting to run CI jobs in our
setup.
---
 .travis.yml | 30 --
 1 file changed, 30 deletions(-)
 delete mode 100644 .travis.yml

diff --git a/.travis.yml b/.travis.yml
deleted file mode 100644
index 784b7bdf73..00
--- a/.travis.yml
+++ /dev/null
@@ -1,30 +0,0 @@
-language: c
-sudo: false
-os:
-  - linux
-  - osx
-addons:
-  apt:
-packages:
-  - nasm
-  - diffutils
-compiler:
-  - clang
-  - gcc
-matrix:
-exclude:
-- os: osx
-  compiler: gcc
-cache:
-  directories:
-- ffmpeg-samples
-before_install:
-  - if [ "$TRAVIS_OS_NAME" == "osx" ]; then brew update; fi
-install:
-  - if [ "$TRAVIS_OS_NAME" == "osx" ]; then brew install nasm; fi
-script:
-  - mkdir -p ffmpeg-samples
-  - ./configure --samples=ffmpeg-samples --cc=$CC
-  - make -j 8
-  - make fate-rsync
-  - make check -j 8
-- 
2.39.3 (Apple Git-146)

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH v2] lavc/aarch64/fdct: add neon-optimized fdct for aarch64

2024-04-17 Thread Martin Storsjö

On Wed, 17 Apr 2024, Ramiro Polla wrote:


The code is imported from libjpeg-turbo-3.0.1. The neon registers used
have been changed to avoid modifying v8-v15.
---
libavcodec/aarch64/Makefile   |   2 +
libavcodec/aarch64/fdct.h |  26 ++
libavcodec/aarch64/fdctdsp_init_aarch64.c |  39 +++
libavcodec/aarch64/fdctdsp_neon.S | 368 ++
libavcodec/avcodec.h  |   1 +
libavcodec/fdctdsp.c  |   4 +-
libavcodec/fdctdsp.h  |   2 +
libavcodec/options_table.h|   1 +
libavcodec/tests/aarch64/dct.c|   2 +
tests/checkasm/Makefile   |   1 +
tests/checkasm/checkasm.c |   3 +
tests/checkasm/checkasm.h |   1 +
tests/checkasm/fdctdsp.c  |  68 
tests/fate/checkasm.mak   |   1 +
14 files changed, 518 insertions(+), 1 deletion(-)
create mode 100644 libavcodec/aarch64/fdct.h
create mode 100644 libavcodec/aarch64/fdctdsp_init_aarch64.c
create mode 100644 libavcodec/aarch64/fdctdsp_neon.S
create mode 100644 tests/checkasm/fdctdsp.c


Overall LGTM, thanks!

You may wish to split adding the checkasm test to a separate patch, 
before adding the new implementation.


I was surprised by the header libavcodec/aarch64/fdct.h which seemed 
redundant on first glance, but I see that this is needed for the dct test 
executable in libavcodec/tests/aarch64/dct.c, so I guess this is 
reasonable. (In most other asm implementations, we just declare the 
functions at the start of the *_init.c files.)


// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH v2] tests/checkasm: add exclude_guest for non-x86 linux perf

2024-04-10 Thread Martin Storsjö

On Wed, 10 Apr 2024, J. Dekker wrote:


The exclude_guest option only has an effect on x86. Omitting
'exclude_guest' defaults to zero which implies that you can count guest
events should you run one. Some non-x86 kernels just ignore it, while
others (e.g. the Asahi Linux kernels) require the user to explicitly set
the option to 1, i.e. the only behaviour that makes sense when counting
guest events isn't supported.

Signed-off-by: J. Dekker 
---

Made commit message clearer, no functional change since v1.

tests/checkasm/checkasm.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c
index dcd2fd6957..8be6cb0f55 100644
--- a/tests/checkasm/checkasm.c
+++ b/tests/checkasm/checkasm.c
@@ -742,6 +742,9 @@ static int bench_init_linux(void)
.disabled   = 1, // start counting only on demand
.exclude_kernel = 1,
.exclude_hv = 1,
+#if !ARCH_X86
+.exclude_guest  = 1,
+#endif
};

printf("benchmarking with Linux Perf Monitoring API\n");
--
2.44.0


Thanks, the updated commit message feels more readable to me at least.

I'm not familiar with the perf API, but I tested perf on an aarch64 
machine where perf benchmarking previously worked, and it still works 
after this change, so it seems ok.


// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH] movenc: Allow writing timed ID3 metadata

2024-04-10 Thread Martin Storsjö

On Tue, 9 Apr 2024, James Almer wrote:


On 4/4/2024 7:29 AM, Martin Storsjö wrote:

This is based on a spec at https://aomediacodec.github.io/id3-emsg/,
further based on ISO/IEC 23009-1:2019.

Within libavformat, timed ID3 metadata (already supported by the
mpegts demuxer and muxer) is handled as a separate data AVStream
with codec type AV_CODEC_ID_TIMED_ID3. However, it doesn't
have a corresponding track in the mov file - instead, these events
are written as separate toplevel 'emsg' boxes.
---
  libavformat/movenc.c   | 49 -
  libavformat/tests/movenc.c | 55 +-
  tests/ref/fate/movenc  |  8 ++
  3 files changed, 104 insertions(+), 8 deletions(-)


Should be ok.


Thanks for the review, pushed now.

// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH] tests/movenc: Validate that normal muxer usage doesn't print warnings

2024-04-10 Thread Martin Storsjö

On Thu, 4 Apr 2024, Martin Storsjö wrote:


We have test to make sure that certain configurations do print
warnings. However, the normal operation of the muxer within this
test always printed a warning, so those tests to check for
extra warnings didn't essentially guard anything.

The warning that always was printed, "track 1: codec frame size is
not set" was not present in the libav fork where this testcase
originated, it was removed in f234e8a32e6c69d7b63f8627f278be7c2c987f43.

Set the frame size for the audio stream to silence the warning,
and use this frame size in a couple later calculations, and check
that one test configuration doesn't print warnings.

Setting the frame size apparently changes the rounding of a timestamp
in the ismv muxing testcase.
---
libavformat/tests/movenc.c | 10 --
tests/ref/fate/movenc  |  2 +-
2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/libavformat/tests/movenc.c b/libavformat/tests/movenc.c
index 77f73abdfa..12a3632d4e 100644
--- a/libavformat/tests/movenc.c
+++ b/libavformat/tests/movenc.c
@@ -215,6 +215,7 @@ static void init_fps(int bf, int audio_preroll, int fps)
st->codecpar->codec_type = AVMEDIA_TYPE_AUDIO;
st->codecpar->codec_id = AV_CODEC_ID_AAC;
st->codecpar->sample_rate = 44100;
+st->codecpar->frame_size = 1024;
st->codecpar->ch_layout = (AVChannelLayout)AV_CHANNEL_LAYOUT_STEREO;
st->time_base.num = 1;
st->time_base.den = 44100;
@@ -232,9 +233,10 @@ static void init_fps(int bf, int audio_preroll, int fps)
frames = 0;
gop_size = 30;
duration = video_st->time_base.den / fps;
-audio_duration = 1024LL * audio_st->time_base.den / 
audio_st->codecpar->sample_rate;
+audio_duration = (long long)audio_st->codecpar->frame_size *
+ audio_st->time_base.den / audio_st->codecpar->sample_rate;
if (audio_preroll)
-audio_preroll = 2048LL * audio_st->time_base.den / 
audio_st->codecpar->sample_rate;
+audio_preroll = 2 * audio_duration;

bframes = bf;
video_dts = bframes ? -duration : 0;
@@ -442,6 +444,7 @@ int main(int argc, char **argv)
// Similar to the previous one, but with input that doesn't start at
// pts/dts 0. avoid_negative_ts behaves in the same way as
// in non-empty-moov-no-elst above.
+init_count_warnings();
init_out("empty-moov-no-elst");
av_dict_set(, "movflags", "+frag_keyframe+empty_moov", 0);
init(1, 0);
@@ -449,6 +452,9 @@ int main(int argc, char **argv)
finish();
close_out();

+reset_count_warnings();
+check(num_warnings == 0, "Unexpected warnings printed");
+
// Same as the previous one, but disable avoid_negative_ts (which
// would require using an edit list, but with empty_moov, one can't
// write a sensible edit list, when the start timestamps aren't known).
diff --git a/tests/ref/fate/movenc b/tests/ref/fate/movenc
index 968a3d27f2..0c77f5187c 100644
--- a/tests/ref/fate/movenc
+++ b/tests/ref/fate/movenc
@@ -20,7 +20,7 @@ write_data len 828, time nopts, type unknown atom -
write_data len 728, time 99, type sync atom moof
write_data len 812, time nopts, type unknown atom -
write_data len 148, time nopts, type trailer atom -
-92ce825ff40505ec8676191705adb7e7 4439 ismv
+d2df24d323f4a8896441cd91203ac5f8 4439 ismv
write_data len 36, time nopts, type header atom ftyp
write_data len 1123, time nopts, type header atom -
write_data len 796, time 0, type sync atom moof
--
2.39.3 (Apple Git-146)


Will push within a few days if there are no objections.

// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH] movenc: Remove a leftover commented out line

2024-04-10 Thread Martin Storsjö

On Thu, 4 Apr 2024, Martin Storsjö wrote:


This line originates from 6f69f7a8bf6a0d013985578df2ef42ee6b1c7994.
---
libavformat/movenc.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/libavformat/movenc.c b/libavformat/movenc.c
index 46a5b3a62f..ccdd2dbfc9 100644
--- a/libavformat/movenc.c
+++ b/libavformat/movenc.c
@@ -1173,8 +1173,6 @@ static int get_samples_per_packet(MOVTrack *track)
{
int i, first_duration;

-// return track->par->frame_size;
-
/* use 1 for raw PCM */
if (!track->audio_vbr)
return 1;
--
2.39.3 (Apple Git-146)


Will apply.

// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH] aarch64: Factorize code for CPU feature detection on Apple platforms

2024-04-10 Thread Martin Storsjö

On Tue, 12 Mar 2024, Martin Storsjö wrote:


---
libavutil/aarch64/cpu.c | 25 +
1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/libavutil/aarch64/cpu.c b/libavutil/aarch64/cpu.c
index 7a05391343..196bdaf6b0 100644
--- a/libavutil/aarch64/cpu.c
+++ b/libavutil/aarch64/cpu.c
@@ -45,22 +45,23 @@ static int detect_flags(void)
#elif defined(__APPLE__) && HAVE_SYSCTLBYNAME
#include 

+static int have_feature(const char *feature) {
+uint32_t value = 0;
+size_t size = sizeof(value);
+if (!sysctlbyname(feature, , , NULL, 0))
+return value;
+return 0;
+}
+
static int detect_flags(void)
{
-uint32_t value = 0;
-size_t size;
int flags = 0;

-size = sizeof(value);
-if (!sysctlbyname("hw.optional.arm.FEAT_DotProd", , , NULL, 0)) 
{
-if (value)
-flags |= AV_CPU_FLAG_DOTPROD;
-}
-size = sizeof(value);
-if (!sysctlbyname("hw.optional.arm.FEAT_I8MM", , , NULL, 0)) {
-if (value)
-flags |= AV_CPU_FLAG_I8MM;
-}
+if (have_feature("hw.optional.arm.FEAT_DotProd"))
+flags |= AV_CPU_FLAG_DOTPROD;
+if (have_feature("hw.optional.arm.FEAT_I8MM"))
+flags |= AV_CPU_FLAG_I8MM;
+
return flags;
}


Will apply.

// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH v3 3/5] configure: switch to shebang without space

2024-04-09 Thread Martin Storsjö

On Tue, 9 Apr 2024, J. Dekker wrote:


Note that the config.sh file is left without a shebang, this file is
supposed to be sourced into the current environment.

This commit is purely cosmetic.

Signed-off-by: J. Dekker 
---
configure | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)


Thanks, this set seems fine to me - the explanations seem good now. (I'd 
consider merging patches 3-5 though, but keeping the full commit message 
from patch 3).)


// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH v2 2/2] configure: simplify bigendian check

2024-04-09 Thread Martin Storsjö

On Mon, 8 Apr 2024, J. Dekker wrote:


The preferred way to use LTO is --enable-lto but often times packagers
still end up with -flto in cflags for various reasons. Using grep
on binary object files is brittle and relies on specific object
representation, which in the case of LLVM bitcode, debug-info or other
intermediary formats can fail silently.

This patch changes the check to a more commonly used define for
big-endian systems.


It's not common only for big-endian systems, but for GCC-style compilers 
on all endians.



More checks may need to be added in the future to cover legacy machines.


Don't use the word "legacy" here. This define is not standard, so it's 
perfectly plausible to have a modern, standards compliant compiler that 
just doesn't use this define.


With the commmit message you added here, the change is ok, but please do 
reword the last sentence above.


I'd suggest changing the last paragraph into this:

---
This patch changes the check to a more commonly used define for
GCC style compilers. More checks may be needed to cover other potential 
compilers that don't use the __BYTE_ORDER__ define.

---

// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH v2 1/2] configure, etc: unify shebang usage

2024-04-09 Thread Martin Storsjö

On Mon, 8 Apr 2024, J. Dekker wrote:


In some cases, these scripts can be called directly by packagers, and
some systems require the interpreter to be explicit.


It is unclear to me which of the changes are needed and for what reason, 
please elaborate much more in the commit message.


Is it possible to elaborate on "some systems require the interpreter to be 
explicit"? It'd be much nicer to reason about if there was a concrete 
example of such a case (even if it certainly is right to add the missing 
shebang line).


The changes I see fall into these categories:

- Change "#! " into "#!. Does this change have a functional 
effect for someone (where, and why?) or is it purely a cosmetic change?
- Add a shebang line in the generated ffbuild/config.sh. This script is 
highly unlikely to be useful to call on its own like that, so while this 
probably is good for consistency I don't see it ever making a difference.
- Add a shebang line in ffbuild/libversion.sh. I can see the value in 
calling this script directly, outside of our build system. I presume this 
is the actual change that makes a difference here?


I don't mind the changes, but I'd prefer to split them into two separate 
commits; add missing shebangs (with an example of the case where it really 
does make a difference), vs removing extra spaces in shebangs for 
consistency (with explicit clarification in the commit message whether 
this is only for stylistic consistency or whether it does make a 
difference somewhere, and if it does, where).


// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


[FFmpeg-devel] [PATCH] aarch64: ac3dsp: Simplify the end of ff_ac3_sum_square_butterfly_float_neon

2024-04-08 Thread Martin Storsjö
Before:   Cortex A53 A72 A78
ac3_sum_square_bufferfly_float_neon:  1005.7   516.5   224.5
After:
ac3_sum_square_bufferfly_float_neon:   981.7   504.5   223.2
---
 libavcodec/aarch64/ac3dsp_neon.S | 16 
 1 file changed, 4 insertions(+), 12 deletions(-)

diff --git a/libavcodec/aarch64/ac3dsp_neon.S b/libavcodec/aarch64/ac3dsp_neon.S
index 20beb6cc50..7e97cc39f7 100644
--- a/libavcodec/aarch64/ac3dsp_neon.S
+++ b/libavcodec/aarch64/ac3dsp_neon.S
@@ -103,17 +103,9 @@ function ff_ac3_sum_square_butterfly_float_neon, export=1
 fmlav3.4s, v17.4s, v17.4s
 subsw3, w3, #4
 b.gt1b
-faddp   v0.4s, v0.4s, v0.4s
-faddp   v0.2s, v0.2s, v0.2s
-st1 {v0.s}[0], [x0], #4
-faddp   v1.4s, v1.4s, v1.4s
-faddp   v1.2s, v1.2s, v1.2s
-st1 {v1.s}[0], [x0], #4
-faddp   v2.4s, v2.4s, v2.4s
-faddp   v2.2s, v2.2s, v2.2s
-st1 {v2.s}[0], [x0], #4
-faddp   v3.4s, v3.4s, v3.4s
-faddp   v3.2s, v3.2s, v3.2s
-st1 {v3.s}[0], [x0]
+faddp   v0.4s, v0.4s, v1.4s
+faddp   v2.4s, v2.4s, v3.4s
+faddp   v0.4s, v0.4s, v2.4s
+st1 {v0.4s}, [x0]
 ret
 endfunc
-- 
2.39.3 (Apple Git-146)

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH v4 0/5] avcodec/ac3: Add aarch64 NEON DSP

2024-04-08 Thread Martin Storsjö

On Sat, 6 Apr 2024, Geoff Hill wrote:


Thanks Martin for your review and testing.

Here's v4 with the following changes:

 * Use fmal in sum_square_butterfly_float loop. Faster.

 * Removed redundant loop bound zero checks in extract_exponents,
   sum_square_bufferfly_int32 and sum_square_bufferfly_float.

 * Fixed randomize_int24() to also use negative values.

 * Carry copyright from arm implementation over to aarch64. I
   did use this version as reference.

 * Fix indentation to match existing aarch64 assembly style.

Tested once again on aarch64 and x86.


Thanks, this set looked good, so I pushed it.

I amended the commits a bit, moving the added copyright line from 
checkasm/ac3dsp.c from patch 1 to 2, where that file actually gets 
extended.


Actually, after pushing, I realized another thing that can be done better 
in ff_ac3_sum_square_butterfly_float_neon - I'll send a patch for that.



On AWS Graviton2 (t4g.medium), GCC 12.3:

$ tests/checkasm/checkasm --bench --test=ac3dsp
...
NEON:
- ac3dsp.ac3_exponent_min   [OK]
- ac3dsp.ac3_extract_exponents  [OK]
- ac3dsp.float_to_fixed24   [OK]
- ac3dsp.ac3_sum_square_butterfly_int32 [OK]
- ac3dsp.ac3_sum_square_butterfly_float [OK]
checkasm: all 20 tests passed
float_to_fixed24_c: 2460.5
float_to_fixed24_neon: 561.5


FWIW, it's usually neater to include such numbers in the commit message, 
so it gets brought along into the final git history (to show the benefit 
we got from the optimization at the time), quoting only those functions 
that are added/modified in each patch. But I didn't amend in that in the 
commit messages this time, but you can keep it in mind for the future.


Anyway, thanks for the patches!

// Martin


___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH v3 5/5] avcodec/ac3: Implement sum_square_butterfly_float for aarch64 NEON

2024-04-04 Thread Martin Storsjö

On Tue, 2 Apr 2024, Geoff Hill wrote:


Signed-off-by: Geoff Hill 
---
libavcodec/aarch64/ac3dsp_init_aarch64.c |  5 
libavcodec/aarch64/ac3dsp_neon.S | 35 
tests/checkasm/ac3dsp.c  | 26 ++
3 files changed, 66 insertions(+)

diff --git a/libavcodec/aarch64/ac3dsp_neon.S b/libavcodec/aarch64/ac3dsp_neon.S
index fa8fcf2e47..4a78ec0b2a 100644
--- a/libavcodec/aarch64/ac3dsp_neon.S
+++ b/libavcodec/aarch64/ac3dsp_neon.S
@@ -88,3 +88,38 @@ function ff_ac3_sum_square_butterfly_int32_neon, export=1
st1 {v0.1d-v3.1d}, [x0]
1:  ret
endfunc
+
+function ff_ac3_sum_square_butterfly_float_neon, export=1
+cbz w3, 1f
+moviv0.4s, #0
+moviv1.4s, #0
+moviv2.4s, #0
+moviv3.4s, #0
+0:  ld1 {v30.4s}, [x1], #16
+ld1 {v31.4s}, [x2], #16
+faddv16.4s, v30.4s, v31.4s
+fsubv17.4s, v30.4s, v31.4s
+fmulv30.4s, v30.4s, v30.4s
+faddv0.4s, v0.4s, v30.4s


The arm version here used vmla instead of separate vmul+vadd - is there 
any reason why we can't use fmla here?


// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH v3 4/5] avcodec/ac3: Implement sum_square_butterfly_int32 for aarch64 NEON

2024-04-04 Thread Martin Storsjö

On Tue, 2 Apr 2024, Geoff Hill wrote:


Signed-off-by: Geoff Hill 
---
libavcodec/aarch64/ac3dsp_init_aarch64.c |  5 +
libavcodec/aarch64/ac3dsp_neon.S | 24 +
tests/checkasm/ac3dsp.c  | 27 
3 files changed, 56 insertions(+)

diff --git a/libavcodec/aarch64/ac3dsp_init_aarch64.c 
b/libavcodec/aarch64/ac3dsp_init_aarch64.c
index 1bdc215b51..e95436c651 100644
--- a/libavcodec/aarch64/ac3dsp_init_aarch64.c
+++ b/libavcodec/aarch64/ac3dsp_init_aarch64.c
@@ -28,6 +28,10 @@
void ff_ac3_exponent_min_neon(uint8_t *exp, int num_reuse_blocks, int nb_coefs);
void ff_ac3_extract_exponents_neon(uint8_t *exp, int32_t *coef, int nb_coefs);
void ff_float_to_fixed24_neon(int32_t *dst, const float *src, size_t len);
+void ff_ac3_sum_square_butterfly_int32_neon(int64_t sum[4],
+const int32_t *coef0,
+const int32_t *coef1,
+int len);

av_cold void ff_ac3dsp_init_aarch64(AC3DSPContext *c)
{
@@ -37,4 +41,5 @@ av_cold void ff_ac3dsp_init_aarch64(AC3DSPContext *c)
c->ac3_exponent_min = ff_ac3_exponent_min_neon;
c->extract_exponents = ff_ac3_extract_exponents_neon;
c->float_to_fixed24 = ff_float_to_fixed24_neon;
+c->sum_square_butterfly_int32 = ff_ac3_sum_square_butterfly_int32_neon;
}
diff --git a/libavcodec/aarch64/ac3dsp_neon.S b/libavcodec/aarch64/ac3dsp_neon.S
index b26f71a3f6..fa8fcf2e47 100644
--- a/libavcodec/aarch64/ac3dsp_neon.S
+++ b/libavcodec/aarch64/ac3dsp_neon.S
@@ -64,3 +64,27 @@ function ff_float_to_fixed24_neon, export=1
b.ne0b
ret
endfunc
+
+function ff_ac3_sum_square_butterfly_int32_neon, export=1
+cbz w3, 1f


The arm version of this patch doesn't have any corresponding check for 
whether this parameter is zero, and the checkasm test doesn't test that 
behaviour either. Is that never feasiable (and we could leave it out here) 
or should we test that and fix it in other assembly versions? In the 
latter case, it's of course ok to defer that to a separate later patch, 
not holding up this one.


// Martin


___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH v3 0/5] avcodec/ac3: Add aarch64 NEON DSP

2024-04-04 Thread Martin Storsjö

On Tue, 2 Apr 2024, Geoff Hill wrote:


Here's v3 to push the AC-3 ARMv8 NEON experiment a step further.

This version implements 5 of the AC-3 encoder DSP functions,
and adds checkasm tests where missing.

I've tested that the checkasm tests pass on aarch64 and x86.


Thanks, I've tested that checkasm also passes on 32 bit arm (where we also 
do have an ac3dsp implementation).


Overall the patches look mostly fine.

Are these implementations based on the existing 32 bit arm ones? The code 
is quite similar (although there's not very many different ways to 
implement things, so this could be a coincidence)? If based on the 
existing code, it would be good to retain the copyright statement from 
that file.


These functions have a different indentation than the rest of 
essentially all our aarch64 assembly (the code you're adding is aligned in 
two different ways) - please check other files (e.g. vp8dsp_neon.S) for 
example. The instructions should be aligned to 8 leading spaces, and 
operands to 24 leading characters.


Other than those generic points, I have two comments on the patches 
themselves.


// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


[FFmpeg-devel] [PATCH] movenc: Allow writing timed ID3 metadata

2024-04-04 Thread Martin Storsjö
This is based on a spec at https://aomediacodec.github.io/id3-emsg/,
further based on ISO/IEC 23009-1:2019.

Within libavformat, timed ID3 metadata (already supported by the
mpegts demuxer and muxer) is handled as a separate data AVStream
with codec type AV_CODEC_ID_TIMED_ID3. However, it doesn't
have a corresponding track in the mov file - instead, these events
are written as separate toplevel 'emsg' boxes.
---
 libavformat/movenc.c   | 49 -
 libavformat/tests/movenc.c | 55 +-
 tests/ref/fate/movenc  |  8 ++
 3 files changed, 104 insertions(+), 8 deletions(-)

diff --git a/libavformat/movenc.c b/libavformat/movenc.c
index ccdd2dbfc9..29b1e4bb0f 100644
--- a/libavformat/movenc.c
+++ b/libavformat/movenc.c
@@ -5515,7 +5515,7 @@ static int mov_write_ftyp_tag(AVIOContext *pb, 
AVFormatContext *s)
 {
 MOVMuxContext *mov = s->priv_data;
 int64_t pos = avio_tell(pb);
-int has_h264 = 0, has_av1 = 0, has_video = 0, has_dolby = 0;
+int has_h264 = 0, has_av1 = 0, has_video = 0, has_dolby = 0, has_id3 = 0;
 int has_iamf = 0;
 
 for (int i = 0; i < s->nb_stream_groups; i++) {
@@ -5544,6 +5544,8 @@ static int mov_write_ftyp_tag(AVIOContext *pb, 
AVFormatContext *s)
 st->codecpar->nb_coded_side_data,
 AV_PKT_DATA_DOVI_CONF))
 has_dolby = 1;
+if (st->codecpar->codec_id == AV_CODEC_ID_TIMED_ID3)
+has_id3 = 1;
 }
 
 avio_wb32(pb, 0); /* size */
@@ -5623,6 +5625,9 @@ static int mov_write_ftyp_tag(AVIOContext *pb, 
AVFormatContext *s)
 if (mov->flags & FF_MOV_FLAG_DASH && mov->flags & FF_MOV_FLAG_GLOBAL_SIDX)
 ffio_wfourcc(pb, "dash");
 
+if (has_id3)
+ffio_wfourcc(pb, "aid3");
+
 return update_size(pb, pos);
 }
 
@@ -6704,6 +6709,34 @@ static int mov_build_iamf_packet(AVFormatContext *s, 
MOVTrack *trk, AVPacket *pk
 return ret;
 }
 
+static int mov_write_emsg_tag(AVIOContext *pb, AVStream *st, AVPacket *pkt)
+{
+int64_t pos = avio_tell(pb);
+const char *scheme_id_uri = "https://aomedia.org/emsg/ID3;;
+const char *value = "";
+
+av_assert0(st->time_base.num == 1);
+
+avio_write_marker(pb,
+  av_rescale_q(pkt->pts, st->time_base, AV_TIME_BASE_Q),
+  AVIO_DATA_MARKER_BOUNDARY_POINT);
+
+avio_wb32(pb, 0); /* size */
+ffio_wfourcc(pb, "emsg");
+avio_w8(pb, 1); /* version */
+avio_wb24(pb, 0);
+avio_wb32(pb, st->time_base.den); /* timescale */
+avio_wb64(pb, pkt->pts); /* presentation_time */
+avio_wb32(pb, 0xU); /* event_duration */
+avio_wb32(pb, 0); /* id */
+/* null terminated UTF8 strings */
+avio_write(pb, scheme_id_uri, strlen(scheme_id_uri) + 1);
+avio_write(pb, value, strlen(value) + 1);
+avio_write(pb, pkt->data, pkt->size);
+
+return update_size(pb, pos);
+}
+
 static int mov_write_packet(AVFormatContext *s, AVPacket *pkt)
 {
 MOVMuxContext *mov = s->priv_data;
@@ -6714,6 +6747,11 @@ static int mov_write_packet(AVFormatContext *s, AVPacket 
*pkt)
 return 1;
 }
 
+if (s->streams[pkt->stream_index]->codecpar->codec_id == 
AV_CODEC_ID_TIMED_ID3) {
+mov_write_emsg_tag(s->pb, s->streams[pkt->stream_index], pkt);
+return 0;
+}
+
 trk = s->streams[pkt->stream_index]->priv_data;
 
 if (trk->iamf) {
@@ -7365,6 +7403,12 @@ static int mov_init(AVFormatContext *s)
 AVStream *st = s->streams[i];
 if (st->priv_data)
 continue;
+// Don't produce a track in the output file for timed ID3 streams.
+if (st->codecpar->codec_id == AV_CODEC_ID_TIMED_ID3) {
+// Leave priv_data set to NULL for these AVStreams that don't
+// have a corresponding track.
+continue;
+}
 st->priv_data = st;
 mov->nb_tracks++;
 }
@@ -7462,6 +7506,9 @@ static int mov_init(AVFormatContext *s)
 MOVTrack *track = st->priv_data;
 AVDictionaryEntry *lang = av_dict_get(st->metadata, "language", 
NULL,0);
 
+if (!track)
+continue;
+
 if (!track->st) {
 track->st  = st;
 track->par = st->codecpar;
diff --git a/libavformat/tests/movenc.c b/libavformat/tests/movenc.c
index 12a3632d4e..2fd5c67e76 100644
--- a/libavformat/tests/movenc.c
+++ b/libavformat/tests/movenc.c
@@ -58,7 +58,7 @@ struct AVMD5* md5;
 uint8_t hash[HASH_SIZE];
 
 AVPacket *pkt;
-AVStream *video_st, *audio_st;
+AVStream *video_st, *audio_st, *id3_st;
 int64_t audio_dts, video_dts;
 
 int bframes;
@@ -177,7 +177,7 @@ static void check_func(int value, int line, const char 
*msg, ...)
 }
 #define check(value, ...) check_func(value, __LINE__, __VA_ARGS__)
 
-static void init_fps(int bf, int audio_preroll, int fps)
+static void init_fps(int bf, int audio_preroll, int fps, int id3)
 {
 AVStream *st;
 int iobuf_size = 

[FFmpeg-devel] [PATCH] tests/movenc: Validate that normal muxer usage doesn't print warnings

2024-04-04 Thread Martin Storsjö
We have test to make sure that certain configurations do print
warnings. However, the normal operation of the muxer within this
test always printed a warning, so those tests to check for
extra warnings didn't essentially guard anything.

The warning that always was printed, "track 1: codec frame size is
not set" was not present in the libav fork where this testcase
originated, it was removed in f234e8a32e6c69d7b63f8627f278be7c2c987f43.

Set the frame size for the audio stream to silence the warning,
and use this frame size in a couple later calculations, and check
that one test configuration doesn't print warnings.

Setting the frame size apparently changes the rounding of a timestamp
in the ismv muxing testcase.
---
 libavformat/tests/movenc.c | 10 --
 tests/ref/fate/movenc  |  2 +-
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/libavformat/tests/movenc.c b/libavformat/tests/movenc.c
index 77f73abdfa..12a3632d4e 100644
--- a/libavformat/tests/movenc.c
+++ b/libavformat/tests/movenc.c
@@ -215,6 +215,7 @@ static void init_fps(int bf, int audio_preroll, int fps)
 st->codecpar->codec_type = AVMEDIA_TYPE_AUDIO;
 st->codecpar->codec_id = AV_CODEC_ID_AAC;
 st->codecpar->sample_rate = 44100;
+st->codecpar->frame_size = 1024;
 st->codecpar->ch_layout = (AVChannelLayout)AV_CHANNEL_LAYOUT_STEREO;
 st->time_base.num = 1;
 st->time_base.den = 44100;
@@ -232,9 +233,10 @@ static void init_fps(int bf, int audio_preroll, int fps)
 frames = 0;
 gop_size = 30;
 duration = video_st->time_base.den / fps;
-audio_duration = 1024LL * audio_st->time_base.den / 
audio_st->codecpar->sample_rate;
+audio_duration = (long long)audio_st->codecpar->frame_size *
+ audio_st->time_base.den / audio_st->codecpar->sample_rate;
 if (audio_preroll)
-audio_preroll = 2048LL * audio_st->time_base.den / 
audio_st->codecpar->sample_rate;
+audio_preroll = 2 * audio_duration;
 
 bframes = bf;
 video_dts = bframes ? -duration : 0;
@@ -442,6 +444,7 @@ int main(int argc, char **argv)
 // Similar to the previous one, but with input that doesn't start at
 // pts/dts 0. avoid_negative_ts behaves in the same way as
 // in non-empty-moov-no-elst above.
+init_count_warnings();
 init_out("empty-moov-no-elst");
 av_dict_set(, "movflags", "+frag_keyframe+empty_moov", 0);
 init(1, 0);
@@ -449,6 +452,9 @@ int main(int argc, char **argv)
 finish();
 close_out();
 
+reset_count_warnings();
+check(num_warnings == 0, "Unexpected warnings printed");
+
 // Same as the previous one, but disable avoid_negative_ts (which
 // would require using an edit list, but with empty_moov, one can't
 // write a sensible edit list, when the start timestamps aren't known).
diff --git a/tests/ref/fate/movenc b/tests/ref/fate/movenc
index 968a3d27f2..0c77f5187c 100644
--- a/tests/ref/fate/movenc
+++ b/tests/ref/fate/movenc
@@ -20,7 +20,7 @@ write_data len 828, time nopts, type unknown atom -
 write_data len 728, time 99, type sync atom moof
 write_data len 812, time nopts, type unknown atom -
 write_data len 148, time nopts, type trailer atom -
-92ce825ff40505ec8676191705adb7e7 4439 ismv
+d2df24d323f4a8896441cd91203ac5f8 4439 ismv
 write_data len 36, time nopts, type header atom ftyp
 write_data len 1123, time nopts, type header atom -
 write_data len 796, time 0, type sync atom moof
-- 
2.39.3 (Apple Git-146)

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


[FFmpeg-devel] [PATCH] movenc: Remove a leftover commented out line

2024-04-04 Thread Martin Storsjö
This line originates from 6f69f7a8bf6a0d013985578df2ef42ee6b1c7994.
---
 libavformat/movenc.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/libavformat/movenc.c b/libavformat/movenc.c
index 46a5b3a62f..ccdd2dbfc9 100644
--- a/libavformat/movenc.c
+++ b/libavformat/movenc.c
@@ -1173,8 +1173,6 @@ static int get_samples_per_packet(MOVTrack *track)
 {
 int i, first_duration;
 
-// return track->par->frame_size;
-
 /* use 1 for raw PCM */
 if (!track->audio_vbr)
 return 1;
-- 
2.39.3 (Apple Git-146)

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


[FFmpeg-devel] [GASPP PATCH] Implicitly start out in the text section for armasm

2024-04-03 Thread Martin Storsjö
This fixes assembling files starting with bare symbol declarations,
without explicitly switching to .text first.
---
 gas-preprocessor.pl | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/gas-preprocessor.pl b/gas-preprocessor.pl
index 2880858..b66181a 100755
--- a/gas-preprocessor.pl
+++ b/gas-preprocessor.pl
@@ -289,6 +289,9 @@ my %aarch64_req_alias;
 if ($force_thumb) {
 parse_line(".thumb\n");
 }
+if ($as_type eq "armasm") {
+parse_line(".text\n");
+}
 
 # pass 1: parse .macro
 # note that the handling of arguments is probably overly permissive vs. gas
-- 
2.34.1

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions

2024-03-26 Thread Martin Storsjö

On Tue, 26 Mar 2024, Jean-Baptiste Kempf wrote:


On Mon, 25 Mar 2024, at 22:56, J. Dekker wrote:

On Mon, 25 Mar 2024, Martin Storsjö wrote:


Since some time, we have pretty complete AArch64 NEON coverage
for the hevc decoder.

However, some of these functions require the I8MM instruction set
extension, and many of them (but not all) lack a plain NEON
version.

This patchset fills in a regular NEON version of all functions
where we have an I8MM function.

For context; the I8MM instruction set extension is a mandatory
part of armv8.6-a. E.g. Apple M2, AWS Graviton 3 have it,
but Apple M1 and Ampere Altra don't.

This patchset takes decoding of a 1080p HEVC clip from 402
fps to 649 fps on an Apple M1.

Patch #2 also fixes a subtle bug in the existing implementation;
two functions relied on the contents on the stack, below the
stack pointer, being untouched within a function. If a signal
gets delivered, those parts of the stack could be clobbered.


I know this is a bit short notice for a patchset of this size - but, would 
people be OK with merging this patchset before the impending 7.0 branch (which 
is made within the next 24h)?

The patches pass all my tricky build configurations, they give a very 
non-negligible speedup on many common CPUs, and patch #2 fixes a real bug in 
the existing impleemntations. (A bug fix patch can of course be backported 
after the branch too, but performance optimizations aren't generally relevant 
for backporting.)

// Martin


Yes, please. I will tomorrow morning if you didn’t already push.


+1


Thanks, I pushed this set now.

// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions

2024-03-25 Thread Martin Storsjö

On Mon, 25 Mar 2024, Martin Storsjö wrote:


Since some time, we have pretty complete AArch64 NEON coverage
for the hevc decoder.

However, some of these functions require the I8MM instruction set
extension, and many of them (but not all) lack a plain NEON
version.

This patchset fills in a regular NEON version of all functions
where we have an I8MM function.

For context; the I8MM instruction set extension is a mandatory
part of armv8.6-a. E.g. Apple M2, AWS Graviton 3 have it,
but Apple M1 and Ampere Altra don't.

This patchset takes decoding of a 1080p HEVC clip from 402
fps to 649 fps on an Apple M1.

Patch #2 also fixes a subtle bug in the existing implementation;
two functions relied on the contents on the stack, below the
stack pointer, being untouched within a function. If a signal
gets delivered, those parts of the stack could be clobbered.


I know this is a bit short notice for a patchset of this size - but, would 
people be OK with merging this patchset before the impending 7.0 branch 
(which is made within the next 24h)?


The patches pass all my tricky build configurations, they give a very 
non-negligible speedup on many common CPUs, and patch #2 fixes a real bug 
in the existing impleemntations. (A bug fix patch can of course be 
backported after the branch too, but performance optimizations aren't 
generally relevant for backporting.)


// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


[FFmpeg-devel] [PATCH 21/21] aarch64: hevc: Produce plain neon versions of qpel_bi_hv

2024-03-25 Thread Martin Storsjö
As the plain neon qpel_h functions process two rows at a time,
we need to allocate storage for h+8 rows instead of h+7.

By allocating storage for h+8 rows, incrementing the stack
pointer won't end up at the right spot in the end. Store the
intended final stack pointer value in a register x14 which we
store on the stack.

AWS Graviton 3:
put_hevc_qpel_bi_hv4_8_c: 385.7
put_hevc_qpel_bi_hv4_8_neon: 131.0
put_hevc_qpel_bi_hv4_8_i8mm: 92.2
put_hevc_qpel_bi_hv6_8_c: 701.0
put_hevc_qpel_bi_hv6_8_neon: 239.5
put_hevc_qpel_bi_hv6_8_i8mm: 191.0
put_hevc_qpel_bi_hv8_8_c: 1162.0
put_hevc_qpel_bi_hv8_8_neon: 228.0
put_hevc_qpel_bi_hv8_8_i8mm: 225.2
put_hevc_qpel_bi_hv12_8_c: 2305.0
put_hevc_qpel_bi_hv12_8_neon: 558.0
put_hevc_qpel_bi_hv12_8_i8mm: 483.2
put_hevc_qpel_bi_hv16_8_c: 3965.2
put_hevc_qpel_bi_hv16_8_neon: 732.7
put_hevc_qpel_bi_hv16_8_i8mm: 656.5
put_hevc_qpel_bi_hv24_8_c: 8709.7
put_hevc_qpel_bi_hv24_8_neon: 1555.2
put_hevc_qpel_bi_hv24_8_i8mm: 1448.7
put_hevc_qpel_bi_hv32_8_c: 14818.0
put_hevc_qpel_bi_hv32_8_neon: 2763.7
put_hevc_qpel_bi_hv32_8_i8mm: 2468.0
put_hevc_qpel_bi_hv48_8_c: 32855.5
put_hevc_qpel_bi_hv48_8_neon: 6107.2
put_hevc_qpel_bi_hv48_8_i8mm: 5452.7
put_hevc_qpel_bi_hv64_8_c: 57591.5
put_hevc_qpel_bi_hv64_8_neon: 10660.2
put_hevc_qpel_bi_hv64_8_i8mm: 9580.0
---
 libavcodec/aarch64/hevcdsp_init_aarch64.c |   5 +
 libavcodec/aarch64/hevcdsp_qpel_neon.S| 164 +-
 2 files changed, 103 insertions(+), 66 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c 
b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index e9ee901322..e24dd0cbda 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -319,6 +319,10 @@ NEON8_FNPROTO(qpel_bi_v, (uint8_t *dst, ptrdiff_t 
dststride,
 const uint8_t *src, ptrdiff_t srcstride, const int16_t *src2,
 int height, intptr_t mx, intptr_t my, int width),);
 
+NEON8_FNPROTO(qpel_bi_hv, (uint8_t *dst, ptrdiff_t dststride,
+const uint8_t *src, ptrdiff_t srcstride, const int16_t *src2,
+int height, intptr_t mx, intptr_t my, int width),);
+
 NEON8_FNPROTO(qpel_bi_hv, (uint8_t *dst, ptrdiff_t dststride,
 const uint8_t *src, ptrdiff_t srcstride, const int16_t *src2,
 int height, intptr_t mx, intptr_t my, int width), _i8mm);
@@ -452,6 +456,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, 
const int bit_depth)
 NEON8_FNASSIGN(c->put_hevc_qpel, 1, 1, qpel_hv,);
 NEON8_FNASSIGN(c->put_hevc_qpel_uni, 1, 1, qpel_uni_hv,);
 NEON8_FNASSIGN_PARTIAL_5(c->put_hevc_qpel_uni_w, 1, 1, qpel_uni_w_hv,);
+NEON8_FNASSIGN(c->put_hevc_qpel_bi, 1, 1, qpel_bi_hv,);
 
 if (have_i8mm(cpu_flags)) {
 NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S 
b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index df7032b692..8ddaa32b70 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -4590,14 +4590,6 @@ endfunc
 
 qpel_uni_w_hv neon
 
-#if HAVE_I8MM
-ENABLE_I8MM
-
-qpel_uni_w_hv neon_i8mm
-
-DISABLE_I8MM
-#endif
-
 function hevc_put_hevc_qpel_bi_hv4_8_end_neon
 mov x9, #(MAX_PB_SIZE * 2)
 load_qpel_filterh x7, x6
@@ -4620,7 +4612,8 @@ function hevc_put_hevc_qpel_bi_hv4_8_end_neon
 .endm
 1:  calc_all
 .purgem calc
-2:  ret
+2:  mov sp, x14
+ret
 endfunc
 
 function hevc_put_hevc_qpel_bi_hv6_8_end_neon
@@ -4650,7 +4643,8 @@ function hevc_put_hevc_qpel_bi_hv6_8_end_neon
 .endm
 1:  calc_all
 .purgem calc
-2:  ret
+2:  mov sp, x14
+ret
 endfunc
 
 function hevc_put_hevc_qpel_bi_hv8_8_end_neon
@@ -4678,7 +4672,8 @@ function hevc_put_hevc_qpel_bi_hv8_8_end_neon
 .endm
 1:  calc_all
 .purgem calc
-2:  ret
+2:  mov sp, x14
+ret
 endfunc
 
 function hevc_put_hevc_qpel_bi_hv16_8_end_neon
@@ -4723,83 +4718,87 @@ function hevc_put_hevc_qpel_bi_hv16_8_end_neon
 subsx10, x10, #16
 add x4, x4, #32
 b.ne0b
-add w10, w5, #7
-lsl x10, x10, #7
-sub x10, x10, x6, lsl #1 // part of first line
-add sp, sp, x10 // tmp_array without first line
+mov sp, x14
 ret
 endfunc
 
-#if HAVE_I8MM
-ENABLE_I8MM
-
-function ff_hevc_put_hevc_qpel_bi_hv4_8_neon_i8mm, export=1
-add w10, w5, #7
+.macro qpel_bi_hv suffix
+function ff_hevc_put_hevc_qpel_bi_hv4_8_\suffix, export=1
+add w10, w5, #8
 lsl x10, x10, #7
+mov x14, sp
 sub sp, sp, x10 // tmp_array
-stp x7, x30, [sp, #-48]!
+stp x7, x30, [sp, #-64]!
 stp x4, x5, [sp, #16]
 stp x0, x1, [sp, #32]
+str x14,[sp, 

[FFmpeg-devel] [PATCH 20/21] aarch64: hevc: Produce plain neon versions of qpel_uni_w_hv

2024-03-25 Thread Martin Storsjö
As the plain neon qpel_h functions process two rows at a time,
we need to allocate storage for h+8 rows instead of h+7.

AWS Graviton 3:
put_hevc_qpel_uni_w_hv4_8_c: 422.2
put_hevc_qpel_uni_w_hv4_8_neon: 140.7
put_hevc_qpel_uni_w_hv4_8_i8mm: 100.7
put_hevc_qpel_uni_w_hv8_8_c: 1208.0
put_hevc_qpel_uni_w_hv8_8_neon: 268.2
put_hevc_qpel_uni_w_hv8_8_i8mm: 261.5
put_hevc_qpel_uni_w_hv16_8_c: 4297.2
put_hevc_qpel_uni_w_hv16_8_neon: 802.2
put_hevc_qpel_uni_w_hv16_8_i8mm: 731.2
put_hevc_qpel_uni_w_hv32_8_c: 15518.5
put_hevc_qpel_uni_w_hv32_8_neon: 3085.2
put_hevc_qpel_uni_w_hv32_8_i8mm: 2783.2
put_hevc_qpel_uni_w_hv64_8_c: 57254.5
put_hevc_qpel_uni_w_hv64_8_neon: 11787.5
put_hevc_qpel_uni_w_hv64_8_i8mm: 10659.0
---
 libavcodec/aarch64/hevcdsp_init_aarch64.c |  6 +++
 libavcodec/aarch64/hevcdsp_qpel_neon.S| 47 +++
 2 files changed, 37 insertions(+), 16 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c 
b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index 0531db027b..e9ee901322 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -305,6 +305,11 @@ NEON8_FNPROTO(epel_uni_w_hv, (uint8_t *_dst,  ptrdiff_t 
_dststride,
 int height, int denom, int wx, int ox,
 intptr_t mx, intptr_t my, int width), _i8mm);
 
+NEON8_FNPROTO_PARTIAL_5(qpel_uni_w_hv, (uint8_t *_dst,  ptrdiff_t _dststride,
+const uint8_t *_src, ptrdiff_t _srcstride,
+int height, int denom, int wx, int ox,
+intptr_t mx, intptr_t my, int width),);
+
 NEON8_FNPROTO_PARTIAL_5(qpel_uni_w_hv, (uint8_t *_dst,  ptrdiff_t _dststride,
 const uint8_t *_src, ptrdiff_t _srcstride,
 int height, int denom, int wx, int ox,
@@ -446,6 +451,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, 
const int bit_depth)
 
 NEON8_FNASSIGN(c->put_hevc_qpel, 1, 1, qpel_hv,);
 NEON8_FNASSIGN(c->put_hevc_qpel_uni, 1, 1, qpel_uni_hv,);
+NEON8_FNASSIGN_PARTIAL_5(c->put_hevc_qpel_uni_w, 1, 1, qpel_uni_w_hv,);
 
 if (have_i8mm(cpu_flags)) {
 NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S 
b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index f285ab7461..df7032b692 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -4164,7 +4164,7 @@ qpel_hv neon_i8mm
 DISABLE_I8MM
 #endif
 
-.macro QPEL_UNI_W_HV_HEADER width
+.macro QPEL_UNI_W_HV_HEADER width, suffix
 ldp x14, x15, [sp]  // mx, my
 ldr w13, [sp, #16]  // width
 stp x19, x30, [sp, #-80]!
@@ -4173,7 +4173,7 @@ DISABLE_I8MM
 stp x24, x25, [sp, #48]
 stp x26, x27, [sp, #64]
 mov x19, sp
-mov x11, #9088
+mov x11, #(MAX_PB_SIZE*(MAX_PB_SIZE+8)*2)
 sub sp, sp, x11
 mov x20, x0
 mov x21, x1
@@ -4190,7 +4190,16 @@ DISABLE_I8MM
 mov w26, #-6
 sub w26, w26, w5// -shift
 mov w27, w13// width
-bl  X(ff_hevc_put_hevc_qpel_h\width\()_8_neon_i8mm)
+.ifc \suffix, neon
+.if \width >= 32
+mov w6,  #\width
+bl  X(ff_hevc_put_hevc_qpel_h32_8_neon)
+.else
+bl  X(ff_hevc_put_hevc_qpel_h\width\()_8_\suffix)
+.endif
+.else
+bl  X(ff_hevc_put_hevc_qpel_h\width\()_8_\suffix)
+.endif
 movrel  x9, qpel_filters
 add x9, x9, x23, lsl #3
 ld1 {v0.8b}, [x9]
@@ -4552,33 +4561,39 @@ function hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
 ret
 endfunc
 
-#if HAVE_I8MM
-ENABLE_I8MM
-
-function ff_hevc_put_hevc_qpel_uni_w_hv4_8_neon_i8mm, export=1
-QPEL_UNI_W_HV_HEADER 4
+.macro qpel_uni_w_hv suffix
+function ff_hevc_put_hevc_qpel_uni_w_hv4_8_\suffix, export=1
+QPEL_UNI_W_HV_HEADER 4, \suffix
 b   hevc_put_hevc_qpel_uni_w_hv4_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_qpel_uni_w_hv8_8_neon_i8mm, export=1
-QPEL_UNI_W_HV_HEADER 8
+function ff_hevc_put_hevc_qpel_uni_w_hv8_8_\suffix, export=1
+QPEL_UNI_W_HV_HEADER 8, \suffix
 b   hevc_put_hevc_qpel_uni_w_hv8_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_qpel_uni_w_hv16_8_neon_i8mm, export=1
-QPEL_UNI_W_HV_HEADER 16
+function ff_hevc_put_hevc_qpel_uni_w_hv16_8_\suffix, export=1
+QPEL_UNI_W_HV_HEADER 16, \suffix
 b   hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
-QPEL_UNI_W_HV_HEADER 32
+function ff_hevc_put_hevc_qpel_uni_w_hv32_8_\suffix, export=1
+QPEL_UNI_W_HV_HEADER 32, \suffix
 b   hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
 endfunc
 

[FFmpeg-devel] [PATCH 19/21] aarch64: hevc: Produce plain neon versions of qpel_uni_hv

2024-03-25 Thread Martin Storsjö
As the plain neon qpel_h functions process two rows at a time,
we need to allocate storage for h+8 rows instead of h+7.

By allocating storage for h+8 rows, incrementing the stack
pointer won't end up at the right spot in the end. Store the
intended final stack pointer value in a register x14 which we
store on the stack.

AWS Graviton 3:
put_hevc_qpel_uni_hv4_8_c: 384.2
put_hevc_qpel_uni_hv4_8_neon: 127.5
put_hevc_qpel_uni_hv4_8_i8mm: 85.5
put_hevc_qpel_uni_hv6_8_c: 705.5
put_hevc_qpel_uni_hv6_8_neon: 224.5
put_hevc_qpel_uni_hv6_8_i8mm: 176.2
put_hevc_qpel_uni_hv8_8_c: 1136.5
put_hevc_qpel_uni_hv8_8_neon: 216.5
put_hevc_qpel_uni_hv8_8_i8mm: 214.0
put_hevc_qpel_uni_hv12_8_c: 2259.5
put_hevc_qpel_uni_hv12_8_neon: 498.5
put_hevc_qpel_uni_hv12_8_i8mm: 410.7
put_hevc_qpel_uni_hv16_8_c: 3824.7
put_hevc_qpel_uni_hv16_8_neon: 670.0
put_hevc_qpel_uni_hv16_8_i8mm: 603.7
put_hevc_qpel_uni_hv24_8_c: 8113.5
put_hevc_qpel_uni_hv24_8_neon: 1474.7
put_hevc_qpel_uni_hv24_8_i8mm: 1351.5
put_hevc_qpel_uni_hv32_8_c: 14744.5
put_hevc_qpel_uni_hv32_8_neon: 2599.7
put_hevc_qpel_uni_hv32_8_i8mm: 2266.0
put_hevc_qpel_uni_hv48_8_c: 32800.0
put_hevc_qpel_uni_hv48_8_neon: 5650.0
put_hevc_qpel_uni_hv48_8_i8mm: 5011.7
put_hevc_qpel_uni_hv64_8_c: 57856.2
put_hevc_qpel_uni_hv64_8_neon: 9863.5
put_hevc_qpel_uni_hv64_8_i8mm: 8767.7
---
 libavcodec/aarch64/hevcdsp_init_aarch64.c |   5 +
 libavcodec/aarch64/hevcdsp_qpel_neon.S| 156 ++
 2 files changed, 102 insertions(+), 59 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c 
b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index 105c26017b..0531db027b 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -277,6 +277,10 @@ NEON8_FNPROTO(qpel_uni_v, (uint8_t *dst,  ptrdiff_t 
dststride,
 const uint8_t *src, ptrdiff_t srcstride,
 int height, intptr_t mx, intptr_t my, int width),);
 
+NEON8_FNPROTO(qpel_uni_hv, (uint8_t *dst,  ptrdiff_t dststride,
+const uint8_t *src, ptrdiff_t srcstride,
+int height, intptr_t mx, intptr_t my, int width),);
+
 NEON8_FNPROTO(qpel_uni_hv, (uint8_t *dst,  ptrdiff_t dststride,
 const uint8_t *src, ptrdiff_t srcstride,
 int height, intptr_t mx, intptr_t my, int width), _i8mm);
@@ -441,6 +445,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, 
const int bit_depth)
 NEON8_FNASSIGN_SHARED_32(c->put_hevc_qpel_uni_w, 0, 1, qpel_uni_w_h,);
 
 NEON8_FNASSIGN(c->put_hevc_qpel, 1, 1, qpel_hv,);
+NEON8_FNASSIGN(c->put_hevc_qpel_uni, 1, 1, qpel_uni_hv,);
 
 if (have_i8mm(cpu_flags)) {
 NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S 
b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index 7bffb991a7..f285ab7461 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -2169,7 +2169,8 @@ function hevc_put_hevc_qpel_uni_hv4_8_end_neon
 .endm
 1:  calc_all
 .purgem calc
-2:  ret
+2:  mov sp, x14
+ret
 endfunc
 
 function hevc_put_hevc_qpel_uni_hv6_8_end_neon
@@ -2198,7 +2199,8 @@ function hevc_put_hevc_qpel_uni_hv6_8_end_neon
 .endm
 1:  calc_all
 .purgem calc
-2:  ret
+2:  mov sp, x14
+ret
 endfunc
 
 function hevc_put_hevc_qpel_uni_hv8_8_end_neon
@@ -2225,7 +2227,8 @@ function hevc_put_hevc_qpel_uni_hv8_8_end_neon
 .endm
 1:  calc_all
 .purgem calc
-2:  ret
+2:  mov sp, x14
+ret
 endfunc
 
 function hevc_put_hevc_qpel_uni_hv12_8_end_neon
@@ -2252,7 +2255,8 @@ function hevc_put_hevc_qpel_uni_hv12_8_end_neon
 .endm
 1:  calc_all2
 .purgem calc
-2:  ret
+2:  mov sp, x14
+ret
 endfunc
 
 function hevc_put_hevc_qpel_uni_hv16_8_end_neon
@@ -2286,21 +2290,17 @@ function hevc_put_hevc_qpel_uni_hv16_8_end_neon
 add sp, sp, #32
 subsw7, w7, #16
 b.ne0b
-add w10, w4, #6
-add sp, sp, x12 // discard rest of first line
-lsl x10, x10, #7
-add sp, sp, x10 // tmp_array without first line
+mov sp, x14
 ret
 endfunc
 
-#if HAVE_I8MM
-ENABLE_I8MM
-
-function ff_hevc_put_hevc_qpel_uni_hv4_8_neon_i8mm, export=1
-add w10, w4, #7
+.macro qpel_uni_hv suffix
+function ff_hevc_put_hevc_qpel_uni_hv4_8_\suffix, export=1
+add w10, w4, #8
 lsl x10, x10, #7
+mov x14, sp
 sub sp, sp, x10 // tmp_array
-str x30, [sp, #-48]!
+stp x30, x14,[sp, #-48]!
 stp x4, x6, [sp, #16]
 stp x0, x1, [sp, #32]
 sub x1, x2, x3, lsl #1
@@ -2309,18 +2309,19 @@ function ff_hevc_put_hevc_qpel_uni_hv4_8_neon_i8mm, 
export=1
 mov  

[FFmpeg-devel] [PATCH 18/21] aarch64: hevc: Produce plain neon versions of qpel_hv

2024-03-25 Thread Martin Storsjö
As the plain neon qpel_h functions process two rows at a time,
we need to allocate storage for h+8 rows instead of h+7.

By allocating storage for h+8 rows, incrementing the stack
pointer won't end up at the right spot in the end. Store the
intended final stack pointer value in a register x14 which we
store on the stack.

AWS Graviton 3:
put_hevc_qpel_hv4_8_c: 386.0
put_hevc_qpel_hv4_8_neon: 125.7
put_hevc_qpel_hv4_8_i8mm: 83.2
put_hevc_qpel_hv6_8_c: 749.0
put_hevc_qpel_hv6_8_neon: 207.0
put_hevc_qpel_hv6_8_i8mm: 166.0
put_hevc_qpel_hv8_8_c: 1305.2
put_hevc_qpel_hv8_8_neon: 216.5
put_hevc_qpel_hv8_8_i8mm: 213.0
put_hevc_qpel_hv12_8_c: 2570.5
put_hevc_qpel_hv12_8_neon: 480.0
put_hevc_qpel_hv12_8_i8mm: 398.2
put_hevc_qpel_hv16_8_c: 4158.7
put_hevc_qpel_hv16_8_neon: 659.7
put_hevc_qpel_hv16_8_i8mm: 593.5
put_hevc_qpel_hv24_8_c: 8626.7
put_hevc_qpel_hv24_8_neon: 1653.5
put_hevc_qpel_hv24_8_i8mm: 1398.7
put_hevc_qpel_hv32_8_c: 14646.0
put_hevc_qpel_hv32_8_neon: 2566.2
put_hevc_qpel_hv32_8_i8mm: 2287.5
put_hevc_qpel_hv48_8_c: 31072.5
put_hevc_qpel_hv48_8_neon: 6228.5
put_hevc_qpel_hv48_8_i8mm: 5291.0
put_hevc_qpel_hv64_8_c: 53847.2
put_hevc_qpel_hv64_8_neon: 9856.7
put_hevc_qpel_hv64_8_i8mm: 8831.0
---
 libavcodec/aarch64/hevcdsp_init_aarch64.c |   6 +
 libavcodec/aarch64/hevcdsp_qpel_neon.S| 166 +-
 2 files changed, 104 insertions(+), 68 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c 
b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index ea0d26c019..105c26017b 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -265,6 +265,10 @@ NEON8_FNPROTO(qpel_v, (int16_t *dst,
 const uint8_t *src, ptrdiff_t srcstride,
 int height, intptr_t mx, intptr_t my, int width),);
 
+NEON8_FNPROTO(qpel_hv, (int16_t *dst,
+const uint8_t *src, ptrdiff_t srcstride,
+int height, intptr_t mx, intptr_t my, int width),);
+
 NEON8_FNPROTO(qpel_hv, (int16_t *dst,
 const uint8_t *src, ptrdiff_t srcstride,
 int height, intptr_t mx, intptr_t my, int width), _i8mm);
@@ -436,6 +440,8 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, 
const int bit_depth)
 
 NEON8_FNASSIGN_SHARED_32(c->put_hevc_qpel_uni_w, 0, 1, qpel_uni_w_h,);
 
+NEON8_FNASSIGN(c->put_hevc_qpel, 1, 1, qpel_hv,);
+
 if (have_i8mm(cpu_flags)) {
 NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
 NEON8_FNASSIGN(c->put_hevc_epel, 1, 1, epel_hv, _i8mm);
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S 
b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index ad568e415b..7bffb991a7 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -3804,7 +3804,8 @@ function hevc_put_hevc_qpel_hv4_8_end_neon
 .endm
 1:  calc_all
 .purgem calc
-2:  ret
+2:  mov sp, x14
+ret
 endfunc
 
 function hevc_put_hevc_qpel_hv6_8_end_neon
@@ -3831,7 +3832,8 @@ function hevc_put_hevc_qpel_hv6_8_end_neon
 .endm
 1:  calc_all
 .purgem calc
-2:  ret
+2:  mov sp, x14
+ret
 endfunc
 
 function hevc_put_hevc_qpel_hv8_8_end_neon
@@ -3857,7 +3859,8 @@ function hevc_put_hevc_qpel_hv8_8_end_neon
 .endm
 1:  calc_all
 .purgem calc
-2:  ret
+2:  mov sp, x14
+ret
 endfunc
 
 function hevc_put_hevc_qpel_hv12_8_end_neon
@@ -3882,7 +3885,8 @@ function hevc_put_hevc_qpel_hv12_8_end_neon
 .endm
 1:  calc_all2
 .purgem calc
-2:  ret
+2:  mov sp, x14
+ret
 endfunc
 
 function hevc_put_hevc_qpel_hv16_8_end_neon
@@ -3906,7 +3910,8 @@ function hevc_put_hevc_qpel_hv16_8_end_neon
 .endm
 1:  calc_all2
 .purgem calc
-2:  ret
+2:  mov sp, x14
+ret
 endfunc
 
 function hevc_put_hevc_qpel_hv32_8_end_neon
@@ -3937,162 +3942,187 @@ function hevc_put_hevc_qpel_hv32_8_end_neon
 add sp, sp, #32
 subsw6, w6, #16
 b.hi0b
-add w10, w3, #6
-add sp, sp, #64  // discard rest of first line
-lsl x10, x10, #7
-add sp, sp, x10 // tmp_array without first line
+mov sp, x14
 ret
 endfunc
 
-#if HAVE_I8MM
-ENABLE_I8MM
-function ff_hevc_put_hevc_qpel_hv4_8_neon_i8mm, export=1
-add w10, w3, #7
+.macro qpel_hv suffix
+function ff_hevc_put_hevc_qpel_hv4_8_\suffix, export=1
+add w10, w3, #8
 mov x7, #128
 lsl x10, x10, #7
+mov x14, sp
 sub sp, sp, x10 // tmp_array
-stp x5, x30, [sp, #-32]!
-stp x0, x3, [sp, #16]
-add x0, sp, #32
+stp x5,  x30, [sp, #-48]!
+stp x0,  x3,  [sp, #16]
+str x14,  [sp, #32]
+add x0, sp, #48
 

[FFmpeg-devel] [PATCH 16/21] aarch64: hevc: Deduplicate the hevc_put_hevc_qpel_uni_w_hv*_8_end_neon functions

2024-03-25 Thread Martin Storsjö
The hv32 and hv64 functions were identical - both loop and
process 16 pixels at a time.

The hv16 function was near identical, except for the outer loop
(and using sp instead of a separate register).

Given the size of these functions, the extra cost of the outer
loop is negligible, so use the same function for hv16 as well.

This removes over 200 lines of duplicated assembly, and over 4 KB
of binary size.
---
 libavcodec/aarch64/hevcdsp_qpel_neon.S | 220 +
 1 file changed, 3 insertions(+), 217 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S 
b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index c04e8dbea8..06832603d9 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -4381,231 +4381,17 @@ function ff_hevc_put_hevc_qpel_uni_w_hv16_8_neon_i8mm, 
export=1
 b   hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
 endfunc
 
-function hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
-ldp q16, q1, [sp]
-add sp, sp, x10
-ldp q17, q2, [sp]
-add sp, sp, x10
-ldp q18, q3, [sp]
-add sp, sp, x10
-ldp q19, q4, [sp]
-add sp, sp, x10
-ldp q20, q5, [sp]
-add sp, sp, x10
-ldp q21, q6, [sp]
-add sp, sp, x10
-ldp q22, q7, [sp]
-add sp, sp, x10
-1:
-ldp q23, q31, [sp]
-add sp, sp, x10
-QPEL_FILTER_H   v24, v16, v17, v18, v19, v20, v21, v22, v23
-QPEL_FILTER_H2  v25, v16, v17, v18, v19, v20, v21, v22, v23
-QPEL_FILTER_H   v26,  v1,  v2,  v3,  v4,  v5,  v6,  v7, v31
-QPEL_FILTER_H2  v27,  v1,  v2,  v3,  v4,  v5,  v6,  v7, v31
-QPEL_UNI_W_HV_16
-subsw22, w22, #1
-b.eq2f
-
-ldp q16, q1, [sp]
-add sp, sp, x10
-QPEL_FILTER_H   v24, v17, v18, v19, v20, v21, v22, v23, v16
-QPEL_FILTER_H2  v25, v17, v18, v19, v20, v21, v22, v23, v16
-QPEL_FILTER_H   v26,  v2,  v3,  v4,  v5,  v6,  v7, v31,  v1
-QPEL_FILTER_H2  v27,  v2,  v3,  v4,  v5,  v6,  v7, v31,  v1
-QPEL_UNI_W_HV_16
-subsw22, w22, #1
-b.eq2f
-
-ldp q17, q2, [sp]
-add sp, sp, x10
-QPEL_FILTER_H   v24, v18, v19, v20, v21, v22, v23, v16, v17
-QPEL_FILTER_H2  v25, v18, v19, v20, v21, v22, v23, v16, v17
-QPEL_FILTER_H   v26,  v3,  v4,  v5,  v6,  v7, v31,  v1,  v2
-QPEL_FILTER_H2  v27,  v3,  v4,  v5,  v6,  v7, v31,  v1,  v2
-QPEL_UNI_W_HV_16
-subsw22, w22, #1
-b.eq2f
-
-ldp q18, q3, [sp]
-add sp, sp, x10
-QPEL_FILTER_H   v24, v19, v20, v21, v22, v23, v16, v17, v18
-QPEL_FILTER_H2  v25, v19, v20, v21, v22, v23, v16, v17, v18
-QPEL_FILTER_H   v26,  v4,  v5,  v6,  v7, v31,  v1,  v2,  v3
-QPEL_FILTER_H2  v27,  v4,  v5,  v6,  v7, v31,  v1,  v2,  v3
-QPEL_UNI_W_HV_16
-subsw22, w22, #1
-b.eq2f
-
-ldp q19, q4, [sp]
-add sp, sp, x10
-QPEL_FILTER_H   v24, v20, v21, v22, v23, v16, v17, v18, v19
-QPEL_FILTER_H2  v25, v20, v21, v22, v23, v16, v17, v18, v19
-QPEL_FILTER_H   v26,  v5,  v6,  v7, v31,  v1,  v2,  v3,  v4
-QPEL_FILTER_H2  v27,  v5,  v6,  v7, v31,  v1,  v2,  v3,  v4
-QPEL_UNI_W_HV_16
-subsw22, w22, #1
-b.eq2f
-
-ldp q20, q5, [sp]
-add sp, sp, x10
-QPEL_FILTER_H   v24, v21, v22, v23, v16, v17, v18, v19, v20
-QPEL_FILTER_H2  v25, v21, v22, v23, v16, v17, v18, v19, v20
-QPEL_FILTER_H   v26,  v6,  v7, v31,  v1,  v2,  v3,  v4,  v5
-QPEL_FILTER_H2  v27,  v6,  v7, v31,  v1,  v2,  v3,  v4,  v5
-QPEL_UNI_W_HV_16
-subsw22, w22, #1
-b.eq2f
-
-ldp q21, q6, [sp]
-add sp, sp, x10
-QPEL_FILTER_H   v24, v22, v23, v16, v17, v18, v19, v20, v21
-QPEL_FILTER_H2  v25, v22, v23, v16, v17, v18, v19, v20, v21
-QPEL_FILTER_H   v26,  v7, v31,  v1,  v2,  v3,  v4,  v5,  v6
-QPEL_FILTER_H2  v27,  v7, v31,  v1,  v2,  v3,  v4,  v5,  v6
-QPEL_UNI_W_HV_16
-subsw22, w22, #1
-b.eq2f
-
-ldp q22, q7, [sp]
-add sp, sp, x10
-QPEL_FILTER_H   v24, v23, v16, v17, v18, v19, v20, v21, v22
-QPEL_FILTER_H2  v25, v23, v16, v17, v18, v19, v20, v21, v22
-QPEL_FILTER_H   v26, v31,  v1,  v2,  v3,  v4,  v5,  v6,  v7
-QPEL_FILTER_H2  v27, v31,  v1,  v2,  v3,  v4,  v5,  v6,  v7
-

[FFmpeg-devel] [PATCH 17/21] aarch64: hevc: Reorder qpel_hv functions to prepare for templating

2024-03-25 Thread Martin Storsjö
---
 libavcodec/aarch64/hevcdsp_qpel_neon.S | 695 +
 1 file changed, 355 insertions(+), 340 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S 
b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index 06832603d9..ad568e415b 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -2146,29 +2146,6 @@ function ff_hevc_put_hevc_qpel_uni_w_v64_8_neon, export=1
 ret
 endfunc
 
-#if HAVE_I8MM
-ENABLE_I8MM
-
-function ff_hevc_put_hevc_qpel_uni_hv4_8_neon_i8mm, export=1
-add w10, w4, #7
-lsl x10, x10, #7
-sub sp, sp, x10 // tmp_array
-str x30, [sp, #-48]!
-stp x4, x6, [sp, #16]
-stp x0, x1, [sp, #32]
-sub x1, x2, x3, lsl #1
-sub x1, x1, x3
-add x0, sp, #48
-mov x2, x3
-add x3, x4, #7
-mov x4, x5
-bl  X(ff_hevc_put_hevc_qpel_h4_8_neon_i8mm)
-ldp x4, x6, [sp, #16]
-ldp x0, x1, [sp, #32]
-ldr x30, [sp], #48
-b   hevc_put_hevc_qpel_uni_hv4_8_end_neon
-endfunc
-
 function hevc_put_hevc_qpel_uni_hv4_8_end_neon
 mov x9, #(MAX_PB_SIZE * 2)
 load_qpel_filterh x6, x5
@@ -2195,26 +2172,6 @@ function hevc_put_hevc_qpel_uni_hv4_8_end_neon
 2:  ret
 endfunc
 
-function ff_hevc_put_hevc_qpel_uni_hv6_8_neon_i8mm, export=1
-add w10, w4, #7
-lsl x10, x10, #7
-sub sp, sp, x10 // tmp_array
-str x30, [sp, #-48]!
-stp x4, x6, [sp, #16]
-stp x0, x1, [sp, #32]
-sub x1, x2, x3, lsl #1
-sub x1, x1, x3
-add x0, sp, #48
-mov x2, x3
-add w3, w4, #7
-mov x4, x5
-bl  X(ff_hevc_put_hevc_qpel_h6_8_neon_i8mm)
-ldp x4, x6, [sp, #16]
-ldp x0, x1, [sp, #32]
-ldr x30, [sp], #48
-b   hevc_put_hevc_qpel_uni_hv6_8_end_neon
-endfunc
-
 function hevc_put_hevc_qpel_uni_hv6_8_end_neon
 mov x9, #(MAX_PB_SIZE * 2)
 load_qpel_filterh x6, x5
@@ -2244,26 +2201,6 @@ function hevc_put_hevc_qpel_uni_hv6_8_end_neon
 2:  ret
 endfunc
 
-function ff_hevc_put_hevc_qpel_uni_hv8_8_neon_i8mm, export=1
-add w10, w4, #7
-lsl x10, x10, #7
-sub sp, sp, x10 // tmp_array
-str x30, [sp, #-48]!
-stp x4, x6, [sp, #16]
-stp x0, x1, [sp, #32]
-sub x1, x2, x3, lsl #1
-sub x1, x1, x3
-add x0, sp, #48
-mov x2, x3
-add w3, w4, #7
-mov x4, x5
-bl  X(ff_hevc_put_hevc_qpel_h8_8_neon_i8mm)
-ldp x4, x6, [sp, #16]
-ldp x0, x1, [sp, #32]
-ldr x30, [sp], #48
-b   hevc_put_hevc_qpel_uni_hv8_8_end_neon
-endfunc
-
 function hevc_put_hevc_qpel_uni_hv8_8_end_neon
 mov x9, #(MAX_PB_SIZE * 2)
 load_qpel_filterh x6, x5
@@ -2291,26 +2228,6 @@ function hevc_put_hevc_qpel_uni_hv8_8_end_neon
 2:  ret
 endfunc
 
-function ff_hevc_put_hevc_qpel_uni_hv12_8_neon_i8mm, export=1
-add w10, w4, #7
-lsl x10, x10, #7
-sub sp, sp, x10 // tmp_array
-stp x7, x30, [sp, #-48]!
-stp x4, x6, [sp, #16]
-stp x0, x1, [sp, #32]
-sub x1, x2, x3, lsl #1
-sub x1, x1, x3
-mov x2, x3
-add x0, sp, #48
-add w3, w4, #7
-mov x4, x5
-bl  X(ff_hevc_put_hevc_qpel_h12_8_neon_i8mm)
-ldp x4, x6, [sp, #16]
-ldp x0, x1, [sp, #32]
-ldp x7, x30, [sp], #48
-b   hevc_put_hevc_qpel_uni_hv12_8_end_neon
-endfunc
-
 function hevc_put_hevc_qpel_uni_hv12_8_end_neon
 mov x9, #(MAX_PB_SIZE * 2)
 load_qpel_filterh x6, x5
@@ -2338,26 +2255,6 @@ function hevc_put_hevc_qpel_uni_hv12_8_end_neon
 2:  ret
 endfunc
 
-function ff_hevc_put_hevc_qpel_uni_hv16_8_neon_i8mm, export=1
-add w10, w4, #7
-lsl x10, x10, #7
-sub sp, sp, x10 // tmp_array
-stp x7, x30, [sp, #-48]!
-stp x4, x6, [sp, #16]
-stp x0, x1, [sp, #32]
-add x0, sp, #48
-sub

[FFmpeg-devel] [PATCH 15/21] aarch64: hevc: Split the qpel_*_hv functions into two parts

2024-03-25 Thread Martin Storsjö
---
 libavcodec/aarch64/hevcdsp_qpel_neon.S | 94 +++---
 1 file changed, 86 insertions(+), 8 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S 
b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index fba063186c..c04e8dbea8 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -2166,6 +2166,10 @@ function ff_hevc_put_hevc_qpel_uni_hv4_8_neon_i8mm, 
export=1
 ldp x4, x6, [sp, #16]
 ldp x0, x1, [sp, #32]
 ldr x30, [sp], #48
+b   hevc_put_hevc_qpel_uni_hv4_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_uni_hv4_8_end_neon
 mov x9, #(MAX_PB_SIZE * 2)
 load_qpel_filterh x6, x5
 ldr d16, [sp]
@@ -2208,6 +2212,10 @@ function ff_hevc_put_hevc_qpel_uni_hv6_8_neon_i8mm, 
export=1
 ldp x4, x6, [sp, #16]
 ldp x0, x1, [sp, #32]
 ldr x30, [sp], #48
+b   hevc_put_hevc_qpel_uni_hv6_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_uni_hv6_8_end_neon
 mov x9, #(MAX_PB_SIZE * 2)
 load_qpel_filterh x6, x5
 sub x1, x1, #4
@@ -2253,6 +2261,10 @@ function ff_hevc_put_hevc_qpel_uni_hv8_8_neon_i8mm, 
export=1
 ldp x4, x6, [sp, #16]
 ldp x0, x1, [sp, #32]
 ldr x30, [sp], #48
+b   hevc_put_hevc_qpel_uni_hv8_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_uni_hv8_8_end_neon
 mov x9, #(MAX_PB_SIZE * 2)
 load_qpel_filterh x6, x5
 ldr q16, [sp]
@@ -2296,6 +2308,10 @@ function ff_hevc_put_hevc_qpel_uni_hv12_8_neon_i8mm, 
export=1
 ldp x4, x6, [sp, #16]
 ldp x0, x1, [sp, #32]
 ldp x7, x30, [sp], #48
+b   hevc_put_hevc_qpel_uni_hv12_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_uni_hv12_8_end_neon
 mov x9, #(MAX_PB_SIZE * 2)
 load_qpel_filterh x6, x5
 sub x1, x1, #8
@@ -2339,7 +2355,10 @@ function ff_hevc_put_hevc_qpel_uni_hv16_8_neon_i8mm, 
export=1
 ldp x4, x6, [sp, #16]
 ldp x0, x1, [sp, #32]
 ldp x7, x30, [sp], #48
-.Lqpel_uni_hv16_loop:
+b   hevc_put_hevc_qpel_uni_hv16_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_uni_hv16_8_end_neon
 mov x9, #(MAX_PB_SIZE * 2)
 load_qpel_filterh x6, x5
 sub w12, w9, w7, lsl #1
@@ -2414,7 +2433,7 @@ function ff_hevc_put_hevc_qpel_uni_hv32_8_neon_i8mm, 
export=1
 ldp x4, x6, [sp, #16]
 ldp x0, x1, [sp, #32]
 ldp x7, x30, [sp], #48
-b   .Lqpel_uni_hv16_loop
+b   hevc_put_hevc_qpel_uni_hv16_8_end_neon
 endfunc
 
 function ff_hevc_put_hevc_qpel_uni_hv48_8_neon_i8mm, export=1
@@ -2434,7 +2453,7 @@ function ff_hevc_put_hevc_qpel_uni_hv48_8_neon_i8mm, 
export=1
 ldp x4, x6, [sp, #16]
 ldp x0, x1, [sp, #32]
 ldp x7, x30, [sp], #48
-b   .Lqpel_uni_hv16_loop
+b   hevc_put_hevc_qpel_uni_hv16_8_end_neon
 endfunc
 
 function ff_hevc_put_hevc_qpel_uni_hv64_8_neon_i8mm, export=1
@@ -2454,7 +2473,7 @@ function ff_hevc_put_hevc_qpel_uni_hv64_8_neon_i8mm, 
export=1
 ldp x4, x6, [sp, #16]
 ldp x0, x1, [sp, #32]
 ldp x7, x30, [sp], #48
-b   .Lqpel_uni_hv16_loop
+b   hevc_put_hevc_qpel_uni_hv16_8_end_neon
 endfunc
 DISABLE_I8MM
 #endif
@@ -3776,6 +3795,10 @@ function ff_hevc_put_hevc_qpel_hv4_8_neon_i8mm, export=1
 bl  X(ff_hevc_put_hevc_qpel_h4_8_neon_i8mm)
 ldp x0, x3, [sp, #16]
 ldp x5, x30, [sp], #32
+b   hevc_put_hevc_qpel_hv4_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_hv4_8_end_neon
 load_qpel_filterh x5, x4
 ldr d16, [sp]
 ldr d17, [sp, x7]
@@ -3813,6 +3836,10 @@ function ff_hevc_put_hevc_qpel_hv6_8_neon_i8mm, export=1
 bl  X(ff_hevc_put_hevc_qpel_h6_8_neon_i8mm)
 ldp x0, x3, [sp, #16]
 ldp x5, x30, [sp], #32
+b   hevc_put_hevc_qpel_hv6_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_hv6_8_end_neon
 mov x8, #120
 load_qpel_filterh x5, x4
 ldr q16, [sp]
@@ -3852,6 +3879,10 @@ function ff_hevc_put_hevc_qpel_hv8_8_neon_i8mm, export=1
 bl  X(ff_hevc_put_hevc_qpel_h8_8_neon_i8mm)
 ldp x0, x3, [sp, #16]
 ldp x5, x30, [sp], #32
+b   hevc_put_hevc_qpel_hv8_8_end_neon

[FFmpeg-devel] [PATCH 13/21] aarch64: hevc: Produce epel_bi_hv functions for both neon and i8mm

2024-03-25 Thread Martin Storsjö
In addition to just templating, this contains one change to
ff_hevc_put_hevc_epel_bi_hv32_8, by setting the w6 register
which ff_hevc_put_hevc_epel_h32_8_neon requires.

AWS Graviton 3:
put_hevc_epel_bi_hv4_8_c: 176.5
put_hevc_epel_bi_hv4_8_neon: 62.0
put_hevc_epel_bi_hv4_8_i8mm: 58.0
put_hevc_epel_bi_hv6_8_c: 343.7
put_hevc_epel_bi_hv6_8_neon: 109.7
put_hevc_epel_bi_hv6_8_i8mm: 105.7
put_hevc_epel_bi_hv8_8_c: 536.0
put_hevc_epel_bi_hv8_8_neon: 112.7
put_hevc_epel_bi_hv8_8_i8mm: 111.7
put_hevc_epel_bi_hv12_8_c: 1107.7
put_hevc_epel_bi_hv12_8_neon: 254.7
put_hevc_epel_bi_hv12_8_i8mm: 239.0
put_hevc_epel_bi_hv16_8_c: 1927.7
put_hevc_epel_bi_hv16_8_neon: 356.2
put_hevc_epel_bi_hv16_8_i8mm: 334.2
put_hevc_epel_bi_hv24_8_c: 4195.2
put_hevc_epel_bi_hv24_8_neon: 736.7
put_hevc_epel_bi_hv24_8_i8mm: 715.5
put_hevc_epel_bi_hv32_8_c: 7280.5
put_hevc_epel_bi_hv32_8_neon: 1287.7
put_hevc_epel_bi_hv32_8_i8mm: 1162.2
put_hevc_epel_bi_hv48_8_c: 16857.7
put_hevc_epel_bi_hv48_8_neon: 2836.2
put_hevc_epel_bi_hv48_8_i8mm: 2908.5
put_hevc_epel_bi_hv64_8_c: 29248.2
put_hevc_epel_bi_hv64_8_neon: 5051.7
put_hevc_epel_bi_hv64_8_i8mm: 4491.5
---
 libavcodec/aarch64/hevcdsp_epel_neon.S| 62 +++
 libavcodec/aarch64/hevcdsp_init_aarch64.c |  5 ++
 2 files changed, 36 insertions(+), 31 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S 
b/libavcodec/aarch64/hevcdsp_epel_neon.S
index d0c6205e1c..cb17758a72 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -3792,14 +3792,6 @@ endfunc
 
 epel_uni_w_hv neon
 
-#if HAVE_I8MM
-ENABLE_I8MM
-
-epel_uni_w_hv neon_i8mm
-
-DISABLE_I8MM
-#endif
-
 function hevc_put_hevc_epel_bi_hv4_8_end_neon
 load_epel_filterh x7, x6
 mov x10, #(MAX_PB_SIZE * 2)
@@ -3978,10 +3970,8 @@ function hevc_put_hevc_epel_bi_hv32_8_end_neon
 ret
 endfunc
 
-#if HAVE_I8MM
-ENABLE_I8MM
-
-function ff_hevc_put_hevc_epel_bi_hv4_8_neon_i8mm, export=1
+.macro epel_bi_hv suffix
+function ff_hevc_put_hevc_epel_bi_hv4_8_\suffix, export=1
 add w10, w5, #3
 lsl x10, x10, #7
 sub sp, sp, x10 // tmp_array
@@ -3994,14 +3984,14 @@ function ff_hevc_put_hevc_epel_bi_hv4_8_neon_i8mm, 
export=1
 add w3, w5, #3
 mov x4, x6
 mov x5, x7
-bl  X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
+bl  X(ff_hevc_put_hevc_epel_h4_8_\suffix)
 ldp x4, x5, [sp, #16]
 ldp x0, x1, [sp, #32]
 ldp x7, x30, [sp], #48
 b   hevc_put_hevc_epel_bi_hv4_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_bi_hv6_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_bi_hv6_8_\suffix, export=1
 add w10, w5, #3
 lsl x10, x10, #7
 sub sp, sp, x10 // tmp_array
@@ -4014,14 +4004,14 @@ function ff_hevc_put_hevc_epel_bi_hv6_8_neon_i8mm, 
export=1
 add w3, w5, #3
 mov x4, x6
 mov x5, x7
-bl  X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
+bl  X(ff_hevc_put_hevc_epel_h6_8_\suffix)
 ldp x4, x5, [sp, #16]
 ldp x0, x1, [sp, #32]
 ldp x7, x30, [sp], #48
 b   hevc_put_hevc_epel_bi_hv6_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_bi_hv8_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_bi_hv8_8_\suffix, export=1
 add w10, w5, #3
 lsl x10, x10, #7
 sub sp, sp, x10 // tmp_array
@@ -4034,14 +4024,14 @@ function ff_hevc_put_hevc_epel_bi_hv8_8_neon_i8mm, 
export=1
 add w3, w5, #3
 mov x4, x6
 mov x5, x7
-bl  X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
+bl  X(ff_hevc_put_hevc_epel_h8_8_\suffix)
 ldp x4, x5, [sp, #16]
 ldp x0, x1, [sp, #32]
 ldp x7, x30, [sp], #48
 b   hevc_put_hevc_epel_bi_hv8_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_bi_hv12_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_bi_hv12_8_\suffix, export=1
 add w10, w5, #3
 lsl x10, x10, #7
 sub sp, sp, x10 // tmp_array
@@ -4054,14 +4044,14 @@ function ff_hevc_put_hevc_epel_bi_hv12_8_neon_i8mm, 
export=1
 add w3, w5, #3
 mov x4, x6
 mov x5, x7
-bl  X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
+bl  X(ff_hevc_put_hevc_epel_h12_8_\suffix)
 ldp x4, x5, [sp, #16]
 ldp x0, x1, [sp, #32]
 ldp x7, x30, [sp], #48
 b   hevc_put_hevc_epel_bi_hv12_8_end_neon
 endfunc
 

[FFmpeg-devel] [PATCH 14/21] aarch64: hevc: Implement a neon version of hevc_qpel_uni_w_h*_8

2024-03-25 Thread Martin Storsjö
AWS Graviton 3:
put_hevc_qpel_uni_w_h4_8_c: 159.0
put_hevc_qpel_uni_w_h4_8_neon: 64.2
put_hevc_qpel_uni_w_h4_8_i8mm: 40.0
put_hevc_qpel_uni_w_h6_8_c: 344.7
put_hevc_qpel_uni_w_h6_8_neon: 114.5
put_hevc_qpel_uni_w_h6_8_i8mm: 82.0
put_hevc_qpel_uni_w_h8_8_c: 596.2
put_hevc_qpel_uni_w_h8_8_neon: 132.2
put_hevc_qpel_uni_w_h8_8_i8mm: 106.0
put_hevc_qpel_uni_w_h12_8_c: 1325.0
put_hevc_qpel_uni_w_h12_8_neon: 299.0
put_hevc_qpel_uni_w_h12_8_i8mm: 211.5
put_hevc_qpel_uni_w_h16_8_c: 2300.0
put_hevc_qpel_uni_w_h16_8_neon: 422.0
put_hevc_qpel_uni_w_h16_8_i8mm: 286.2
put_hevc_qpel_uni_w_h24_8_c: 5059.0
put_hevc_qpel_uni_w_h24_8_neon: 912.2
put_hevc_qpel_uni_w_h24_8_i8mm: 664.2
put_hevc_qpel_uni_w_h32_8_c: 9198.2
put_hevc_qpel_uni_w_h32_8_neon: 1638.2
put_hevc_qpel_uni_w_h32_8_i8mm: 1033.7
put_hevc_qpel_uni_w_h48_8_c: 20754.7
put_hevc_qpel_uni_w_h48_8_neon: 3633.7
put_hevc_qpel_uni_w_h48_8_i8mm: 2300.7
put_hevc_qpel_uni_w_h64_8_c: 36854.7
put_hevc_qpel_uni_w_h64_8_neon: 6435.7
put_hevc_qpel_uni_w_h64_8_i8mm: 4039.2
---
 libavcodec/aarch64/hevcdsp_init_aarch64.c |   7 +
 libavcodec/aarch64/hevcdsp_qpel_neon.S| 405 +-
 2 files changed, 410 insertions(+), 2 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c 
b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index 6110a360d8..ea0d26c019 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -277,6 +277,11 @@ NEON8_FNPROTO(qpel_uni_hv, (uint8_t *dst,  ptrdiff_t 
dststride,
 const uint8_t *src, ptrdiff_t srcstride,
 int height, intptr_t mx, intptr_t my, int width), _i8mm);
 
+NEON8_FNPROTO(qpel_uni_w_h, (uint8_t *_dst,  ptrdiff_t _dststride,
+const uint8_t *_src, ptrdiff_t _srcstride,
+int height, int denom, int wx, int ox,
+intptr_t mx, intptr_t my, int width),);
+
 NEON8_FNPROTO(qpel_uni_w_h, (uint8_t *_dst,  ptrdiff_t _dststride,
 const uint8_t *_src, ptrdiff_t _srcstride,
 int height, int denom, int wx, int ox,
@@ -429,6 +434,8 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, 
const int bit_depth)
 NEON8_FNASSIGN(c->put_hevc_epel_uni_w, 1, 1, epel_uni_w_hv,);
 NEON8_FNASSIGN(c->put_hevc_epel_bi, 1, 1, epel_bi_hv,);
 
+NEON8_FNASSIGN_SHARED_32(c->put_hevc_qpel_uni_w, 0, 1, qpel_uni_w_h,);
+
 if (have_i8mm(cpu_flags)) {
 NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
 NEON8_FNASSIGN(c->put_hevc_epel, 1, 1, epel_hv, _i8mm);
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S 
b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index 062b7d4d0f..fba063186c 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -2456,8 +2456,10 @@ function ff_hevc_put_hevc_qpel_uni_hv64_8_neon_i8mm, 
export=1
 ldp x7, x30, [sp], #48
 b   .Lqpel_uni_hv16_loop
 endfunc
+DISABLE_I8MM
+#endif
 
-.macro QPEL_UNI_W_H_HEADER
+.macro QPEL_UNI_W_H_HEADER elems=4s
 ldr x12, [sp]
 sub x2, x2, #3
 movrel  x9, qpel_filters
@@ -2465,11 +2467,410 @@ endfunc
 ld1r{v28.2d}, [x9]
 mov w10, #-6
 sub w10, w10, w5
-dup v30.4s, w6  // wx
+dup v30.\elems, w6  // wx
 dup v31.4s, w10 // shift
 dup v29.4s, w7  // ox
 .endm
 
+function ff_hevc_put_hevc_qpel_uni_w_h4_8_neon, export=1
+QPEL_UNI_W_H_HEADER 4h
+sxtlv0.8h,   v28.8b
+1:
+ld1 {v1.8b, v2.8b}, [x2], x3
+subsw4,  w4,  #1
+uxtlv1.8h,   v1.8b
+uxtlv2.8h,   v2.8b
+ext v3.16b,  v1.16b,  v2.16b,  #2
+ext v4.16b,  v1.16b,  v2.16b,  #4
+ext v5.16b,  v1.16b,  v2.16b,  #6
+ext v6.16b,  v1.16b,  v2.16b,  #8
+ext v7.16b,  v1.16b,  v2.16b,  #10
+ext v16.16b, v1.16b,  v2.16b,  #12
+ext v17.16b, v1.16b,  v2.16b,  #14
+mul v18.4h,  v1.4h,   v0.h[0]
+mla v18.4h,  v3.4h,   v0.h[1]
+mla v18.4h,  v4.4h,   v0.h[2]
+mla v18.4h,  v5.4h,   v0.h[3]
+mla v18.4h,  v6.4h,   v0.h[4]
+mla v18.4h,  v7.4h,   v0.h[5]
+mla v18.4h,  v16.4h,  v0.h[6]
+mla v18.4h,  v17.4h,  v0.h[7]
+smull   v16.4s,  v18.4h,  v30.4h
+sqrshl  v16.4s,  v16.4s,  v31.4s
+sqadd   v16.4s,  v16.4s,  v29.4s
+sqxtn   v16.4h,  v16.4s
+sqxtun  v16.8b,  v16.8h
+str s16, [x0]
+add x0,  x0,  x1
+b.hi1b
+ret
+endfunc
+
+function 

[FFmpeg-devel] [PATCH 08/21] aarch64: hevc: Split the epel_*_hv functions into two parts

2024-03-25 Thread Martin Storsjö
The first horizontal filter can use either i8mm or plain neon
versions, while the second part is a pure neon implementation.
---
 libavcodec/aarch64/hevcdsp_epel_neon.S | 100 +
 1 file changed, 100 insertions(+)

diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S 
b/libavcodec/aarch64/hevcdsp_epel_neon.S
index 0e49491a81..6be171ece1 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -2186,6 +2186,10 @@ function ff_hevc_put_hevc_epel_hv4_8_neon_i8mm, export=1
 bl  X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
 ldp x0, x3, [sp, #16]
 ldp x5, x30, [sp], #32
+b   hevc_put_hevc_epel_hv4_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_hv4_8_end_neon
 load_epel_filterh x5, x4
 mov x10, #(MAX_PB_SIZE * 2)
 ldr d16, [sp]
@@ -2215,6 +2219,10 @@ function ff_hevc_put_hevc_epel_hv6_8_neon_i8mm, export=1
 bl  X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
 ldp x0, x3, [sp, #16]
 ldp x5, x30, [sp], #32
+b   hevc_put_hevc_epel_hv6_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_hv6_8_end_neon
 load_epel_filterh x5, x4
 mov x5, #120
 mov x10, #(MAX_PB_SIZE * 2)
@@ -2247,6 +2255,10 @@ function ff_hevc_put_hevc_epel_hv8_8_neon_i8mm, export=1
 bl  X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
 ldp x0, x3, [sp, #16]
 ldp x5, x30, [sp], #32
+b   hevc_put_hevc_epel_hv8_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_hv8_8_end_neon
 load_epel_filterh x5, x4
 mov x10, #(MAX_PB_SIZE * 2)
 ldr q16, [sp]
@@ -2277,6 +2289,10 @@ function ff_hevc_put_hevc_epel_hv12_8_neon_i8mm, export=1
 bl  X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
 ldp x0, x3, [sp, #16]
 ldp x5, x30, [sp], #32
+b   hevc_put_hevc_epel_hv12_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_hv12_8_end_neon
 load_epel_filterh x5, x4
 mov x5, #112
 mov x10, #(MAX_PB_SIZE * 2)
@@ -2309,6 +2325,10 @@ function ff_hevc_put_hevc_epel_hv16_8_neon_i8mm, export=1
 bl  X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
 ldp x0, x3, [sp, #16]
 ldp x5, x30, [sp], #32
+b   hevc_put_hevc_epel_hv16_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_hv16_8_end_neon
 load_epel_filterh x5, x4
 mov x10, #(MAX_PB_SIZE * 2)
 ld1 {v16.8h, v17.8h}, [sp], x10
@@ -2340,6 +2360,10 @@ function ff_hevc_put_hevc_epel_hv24_8_neon_i8mm, export=1
 bl  X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
 ldp x0, x3, [sp, #16]
 ldp x5, x30, [sp], #32
+b   hevc_put_hevc_epel_hv24_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_hv24_8_end_neon
 load_epel_filterh x5, x4
 mov x10, #(MAX_PB_SIZE * 2)
 ld1 {v16.8h, v17.8h, v18.8h}, [sp], x10
@@ -2445,6 +2469,10 @@ function ff_hevc_put_hevc_epel_uni_hv4_8_neon_i8mm, 
export=1
 ldp x4, x6, [sp, #16]
 ldp x0, x1, [sp, #32]
 ldr x30, [sp], #48
+b   hevc_put_hevc_epel_uni_hv4_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_hv4_8_end_neon
 load_epel_filterh x6, x5
 mov x10, #(MAX_PB_SIZE * 2)
 ld1 {v16.4h}, [sp], x10
@@ -2478,6 +2506,10 @@ function ff_hevc_put_hevc_epel_uni_hv6_8_neon_i8mm, 
export=1
 ldp x4, x6, [sp, #16]
 ldp x0, x1, [sp, #32]
 ldr x30, [sp], #48
+b   hevc_put_hevc_epel_uni_hv6_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_hv6_8_end_neon
 load_epel_filterh x6, x5
 sub x1, x1, #4
 mov x10, #(MAX_PB_SIZE * 2)
@@ -2514,6 +2546,10 @@ function ff_hevc_put_hevc_epel_uni_hv8_8_neon_i8mm, 
export=1
 ldp x4, x6, [sp, #16]
 ldp x0, x1, [sp, #32]
 ldr x30, [sp], #48
+b   hevc_put_hevc_epel_uni_hv8_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_hv8_8_end_neon
 load_epel_filterh x6, x5
 mov x10, #(MAX_PB_SIZE * 2)
 ld1 {v16.8h}, [sp], x10
@@ -2548,6 +2584,10 @@ function ff_hevc_put_hevc_epel_uni_hv12_8_neon_i8mm, 
export=1
 ldp x4, x6, [sp, #16]
 ldp x0, x1, [sp, #32]
 ldr x30, [sp], #48
+b   hevc_put_hevc_epel_uni_hv12_8_end_neon
+endfunc
+
+function 

[FFmpeg-devel] [PATCH 07/21] aarch64: hevc: Implement a neon version of hevc_epel_uni_w_h*_8

2024-03-25 Thread Martin Storsjö
AWS Graviton 3:
put_hevc_epel_uni_w_h4_8_c: 97.2
put_hevc_epel_uni_w_h4_8_neon: 41.2
put_hevc_epel_uni_w_h4_8_i8mm: 35.2
put_hevc_epel_uni_w_h6_8_c: 203.7
put_hevc_epel_uni_w_h6_8_neon: 84.7
put_hevc_epel_uni_w_h6_8_i8mm: 74.7
put_hevc_epel_uni_w_h8_8_c: 345.7
put_hevc_epel_uni_w_h8_8_neon: 94.0
put_hevc_epel_uni_w_h8_8_i8mm: 80.7
put_hevc_epel_uni_w_h12_8_c: 768.7
put_hevc_epel_uni_w_h12_8_neon: 196.7
put_hevc_epel_uni_w_h12_8_i8mm: 169.7
put_hevc_epel_uni_w_h16_8_c: 1313.0
put_hevc_epel_uni_w_h16_8_neon: 290.7
put_hevc_epel_uni_w_h16_8_i8mm: 238.0
put_hevc_epel_uni_w_h24_8_c: 2877.5
put_hevc_epel_uni_w_h24_8_neon: 650.0
put_hevc_epel_uni_w_h24_8_i8mm: 512.0
put_hevc_epel_uni_w_h32_8_c: 5113.5
put_hevc_epel_uni_w_h32_8_neon: 1129.5
put_hevc_epel_uni_w_h32_8_i8mm: 739.2
put_hevc_epel_uni_w_h48_8_c: 11757.0
put_hevc_epel_uni_w_h48_8_neon: 2518.7
put_hevc_epel_uni_w_h48_8_i8mm: 1688.5
put_hevc_epel_uni_w_h64_8_c: 20478.0
put_hevc_epel_uni_w_h64_8_neon: 4411.7
put_hevc_epel_uni_w_h64_8_i8mm: 2884.0
---
 libavcodec/aarch64/hevcdsp_epel_neon.S| 326 +-
 libavcodec/aarch64/hevcdsp_init_aarch64.c |   6 +
 2 files changed, 319 insertions(+), 13 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S 
b/libavcodec/aarch64/hevcdsp_epel_neon.S
index 419e83529a..0e49491a81 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -1520,6 +1520,319 @@ function ff_hevc_put_hevc_epel_h32_8_neon, export=1
 ret
 endfunc
 
+.macro EPEL_UNI_W_H_HEADER elems=4s
+ldr x12, [sp]
+sub x2, x2, #1
+movrel  x9, epel_filters
+add x9, x9, x12, lsl #2
+ld1r{v28.4s}, [x9]
+mov w10, #-6
+sub w10, w10, w5
+dup v30.\elems, w6
+dup v31.4s, w10
+dup v29.4s, w7
+.endm
+
+function ff_hevc_put_hevc_epel_uni_w_h4_8_neon, export=1
+EPEL_UNI_W_H_HEADER 4h
+sxtlv0.8h,   v28.8b
+1:
+ld1 {v4.8b}, [x2], x3
+subsw4,  w4,  #1
+uxtlv4.8h,   v4.8b
+ext v5.16b,  v4.16b,  v4.16b,  #2
+ext v6.16b,  v4.16b,  v4.16b,  #4
+ext v7.16b,  v4.16b,  v4.16b,  #6
+mul v16.4h,  v4.4h,   v0.h[0]
+mla v16.4h,  v5.4h,   v0.h[1]
+mla v16.4h,  v6.4h,   v0.h[2]
+mla v16.4h,  v7.4h,   v0.h[3]
+smull   v16.4s,  v16.4h,  v30.4h
+sqrshl  v16.4s,  v16.4s,  v31.4s
+sqadd   v16.4s,  v16.4s,  v29.4s
+sqxtn   v16.4h,  v16.4s
+sqxtun  v16.8b,  v16.8h
+str s16, [x0]
+add x0,  x0,  x1
+b.hi1b
+ret
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h6_8_neon, export=1
+EPEL_UNI_W_H_HEADER 8h
+sub x1,  x1,  #4
+sxtlv0.8h,   v28.8b
+1:
+ld1 {v3.8b, v4.8b}, [x2], x3
+subsw4,  w4,  #1
+uxtlv3.8h,   v3.8b
+uxtlv4.8h,   v4.8b
+ext v5.16b,  v3.16b,  v4.16b,  #2
+ext v6.16b,  v3.16b,  v4.16b,  #4
+ext v7.16b,  v3.16b,  v4.16b,  #6
+mul v16.8h,  v3.8h,   v0.h[0]
+mla v16.8h,  v5.8h,   v0.h[1]
+mla v16.8h,  v6.8h,   v0.h[2]
+mla v16.8h,  v7.8h,   v0.h[3]
+smull   v17.4s,  v16.4h,  v30.4h
+smull2  v18.4s,  v16.8h,  v30.8h
+sqrshl  v17.4s,  v17.4s,  v31.4s
+sqrshl  v18.4s,  v18.4s,  v31.4s
+sqadd   v17.4s,  v17.4s,  v29.4s
+sqadd   v18.4s,  v18.4s,  v29.4s
+sqxtn   v16.4h,  v17.4s
+sqxtn2  v16.8h,  v18.4s
+sqxtun  v16.8b,  v16.8h
+str s16, [x0], #4
+st1 {v16.h}[2], [x0], x1
+b.hi1b
+ret
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h8_8_neon, export=1
+EPEL_UNI_W_H_HEADER 8h
+sxtlv0.8h,   v28.8b
+1:
+ld1 {v3.8b, v4.8b}, [x2], x3
+subsw4,  w4,  #1
+uxtlv3.8h,   v3.8b
+uxtlv4.8h,   v4.8b
+ext v5.16b,  v3.16b,  v4.16b,  #2
+ext v6.16b,  v3.16b,  v4.16b,  #4
+ext v7.16b,  v3.16b,  v4.16b,  #6
+mul v16.8h,  v3.8h,   v0.h[0]
+mla v16.8h,  v5.8h,   v0.h[1]
+mla v16.8h,  v6.8h,   v0.h[2]
+mla v16.8h,  v7.8h,   v0.h[3]
+smull   v17.4s,  v16.4h,  v30.4h
+smull2  v18.4s,  v16.8h,  v30.8h
+sqrshl  v17.4s,  

[FFmpeg-devel] [PATCH 06/21] aarch64: hevc: Implement a neon version of put_hevc_epel_h*_8

2024-03-25 Thread Martin Storsjö
AWS Graviton 3:
put_hevc_epel_h4_8_c: 64.7
put_hevc_epel_h4_8_neon: 25.0
put_hevc_epel_h4_8_i8mm: 21.2
put_hevc_epel_h6_8_c: 130.0
put_hevc_epel_h6_8_neon: 40.7
put_hevc_epel_h6_8_i8mm: 36.5
put_hevc_epel_h8_8_c: 209.0
put_hevc_epel_h8_8_neon: 45.2
put_hevc_epel_h8_8_i8mm: 41.2
put_hevc_epel_h12_8_c: 465.5
put_hevc_epel_h12_8_neon: 104.5
put_hevc_epel_h12_8_i8mm: 86.5
put_hevc_epel_h16_8_c: 830.7
put_hevc_epel_h16_8_neon: 134.2
put_hevc_epel_h16_8_i8mm: 114.0
put_hevc_epel_h24_8_c: 1844.7
put_hevc_epel_h24_8_neon: 282.2
put_hevc_epel_h24_8_i8mm: 277.2
put_hevc_epel_h32_8_c: 3227.5
put_hevc_epel_h32_8_neon: 501.5
put_hevc_epel_h32_8_i8mm: 396.0
put_hevc_epel_h48_8_c: 7229.2
put_hevc_epel_h48_8_neon: 1120.2
put_hevc_epel_h48_8_i8mm: 901.2
put_hevc_epel_h64_8_c: 12869.0
put_hevc_epel_h64_8_neon: 1999.2
put_hevc_epel_h64_8_i8mm: 1610.5
---
 libavcodec/aarch64/hevcdsp_epel_neon.S| 194 +-
 libavcodec/aarch64/hevcdsp_init_aarch64.c |  17 ++
 2 files changed, 209 insertions(+), 2 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S 
b/libavcodec/aarch64/hevcdsp_epel_neon.S
index d3f0a26f79..419e83529a 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -1321,8 +1321,6 @@ function ff_hevc_put_hevc_epel_uni_v64_8_neon, export=1
 ret
 endfunc
 
-#if HAVE_I8MM
-ENABLE_I8MM
 
 .macro EPEL_H_HEADER
 movrel  x5, epel_filters
@@ -1332,6 +1330,198 @@ ENABLE_I8MM
 mov x10, #(MAX_PB_SIZE * 2)
 .endm
 
+function ff_hevc_put_hevc_epel_h4_8_neon, export=1
+EPEL_H_HEADER
+sxtlv0.8h,   v30.8b
+1:  ld1 {v4.8b}, [x1], x2
+subsw3,  w3,  #1   // height
+uxtlv4.8h,   v4.8b
+ext v5.16b,  v4.16b,  v4.16b,  #2
+ext v6.16b,  v4.16b,  v4.16b,  #4
+ext v7.16b,  v4.16b,  v4.16b,  #6
+mul v16.4h,  v4.4h,   v0.h[0]
+mla v16.4h,  v5.4h,   v0.h[1]
+mla v16.4h,  v6.4h,   v0.h[2]
+mla v16.4h,  v7.4h,   v0.h[3]
+st1 {v16.4h}, [x0], x10
+b.ne1b
+ret
+endfunc
+
+function ff_hevc_put_hevc_epel_h6_8_neon, export=1
+EPEL_H_HEADER
+sxtlv0.8h,   v30.8b
+add x6,  x0,  #8
+1:  ld1 {v3.16b},  [x1], x2
+subsw3,  w3,  #1   // height
+uxtl2   v4.8h,   v3.16b
+uxtlv3.8h,   v3.8b
+ext v5.16b,  v3.16b,  v4.16b,  #2
+ext v6.16b,  v3.16b,  v4.16b,  #4
+ext v7.16b,  v3.16b,  v4.16b,  #6
+mul v16.8h,  v3.8h,   v0.h[0]
+mla v16.8h,  v5.8h,   v0.h[1]
+mla v16.8h,  v6.8h,   v0.h[2]
+mla v16.8h,  v7.8h,   v0.h[3]
+st1 {v16.4h},   [x0], x10
+st1 {v16.s}[2], [x6], x10
+b.ne1b
+ret
+endfunc
+
+function ff_hevc_put_hevc_epel_h8_8_neon, export=1
+EPEL_H_HEADER
+sxtlv0.8h,   v30.8b
+1:  ld1 {v3.16b},  [x1], x2
+subsw3,  w3,  #1   // height
+uxtl2   v4.8h,   v3.16b
+uxtlv3.8h,   v3.8b
+ext v5.16b,  v3.16b,  v4.16b,  #2
+ext v6.16b,  v3.16b,  v4.16b,  #4
+ext v7.16b,  v3.16b,  v4.16b,  #6
+mul v16.8h,  v3.8h,   v0.h[0]
+mla v16.8h,  v5.8h,   v0.h[1]
+mla v16.8h,  v6.8h,   v0.h[2]
+mla v16.8h,  v7.8h,   v0.h[3]
+st1 {v16.8h},   [x0], x10
+b.ne1b
+ret
+endfunc
+
+function ff_hevc_put_hevc_epel_h12_8_neon, export=1
+EPEL_H_HEADER
+add x6,  x0,  #16
+sxtlv0.8h,   v30.8b
+1:  ld1 {v3.16b}, [x1], x2
+subsw3,  w3,  #1   // height
+uxtl2   v4.8h,   v3.16b
+uxtlv3.8h,   v3.8b
+ext v5.16b,  v3.16b,  v4.16b,  #2
+ext v6.16b,  v3.16b,  v4.16b,  #4
+ext v7.16b,  v3.16b,  v4.16b,  #6
+ext v20.16b, v4.16b,  v4.16b,  #2
+ext v21.16b, v4.16b,  v4.16b,  #4
+ext v22.16b, v4.16b,  v4.16b,  #6
+mul v16.8h,  v3.8h,   v0.h[0]
+mla v16.8h,  v5.8h,   v0.h[1]
+mla v16.8h,  v6.8h,   v0.h[2]
+mla v16.8h,  v7.8h,   v0.h[3]
+mul v17.4h,  v4.4h,   v0.h[0]
+mla v17.4h,  v20.4h,  v0.h[1]
+mla v17.4h,  v21.4h,  v0.h[2]
+mla v17.4h,  v22.4h,  v0.h[3]
+st1 {v16.8h}, [x0], x10
+st1 

[FFmpeg-devel] [PATCH 12/21] aarch64: hevc: Produce epel_uni_w_hv functions for both neon and i8mm

2024-03-25 Thread Martin Storsjö
AWS Graviton 3:
put_hevc_epel_uni_w_hv4_8_c: 191.2
put_hevc_epel_uni_w_hv4_8_neon: 87.7
put_hevc_epel_uni_w_hv4_8_i8mm: 83.2
put_hevc_epel_uni_w_hv6_8_c: 349.5
put_hevc_epel_uni_w_hv6_8_neon: 153.0
put_hevc_epel_uni_w_hv6_8_i8mm: 148.5
put_hevc_epel_uni_w_hv8_8_c: 581.2
put_hevc_epel_uni_w_hv8_8_neon: 166.7
put_hevc_epel_uni_w_hv8_8_i8mm: 163.5
put_hevc_epel_uni_w_hv12_8_c: 1230.0
put_hevc_epel_uni_w_hv12_8_neon: 387.7
put_hevc_epel_uni_w_hv12_8_i8mm: 370.2
put_hevc_epel_uni_w_hv16_8_c: 2003.2
put_hevc_epel_uni_w_hv16_8_neon: 501.5
put_hevc_epel_uni_w_hv16_8_i8mm: 490.2
put_hevc_epel_uni_w_hv24_8_c: 4448.7
put_hevc_epel_uni_w_hv24_8_neon: 1092.2
put_hevc_epel_uni_w_hv24_8_i8mm: 1069.7
put_hevc_epel_uni_w_hv32_8_c: 7817.2
put_hevc_epel_uni_w_hv32_8_neon: 1916.2
put_hevc_epel_uni_w_hv32_8_i8mm: 1829.5
put_hevc_epel_uni_w_hv48_8_c: 16728.2
put_hevc_epel_uni_w_hv48_8_neon: 4263.7
put_hevc_epel_uni_w_hv48_8_i8mm: 4342.7
put_hevc_epel_uni_w_hv64_8_c: 29563.2
put_hevc_epel_uni_w_hv64_8_neon: 7474.2
put_hevc_epel_uni_w_hv64_8_i8mm: 7128.5
---
 libavcodec/aarch64/hevcdsp_epel_neon.S| 55 ---
 libavcodec/aarch64/hevcdsp_init_aarch64.c |  6 +++
 2 files changed, 36 insertions(+), 25 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S 
b/libavcodec/aarch64/hevcdsp_epel_neon.S
index 876db9d449..d0c6205e1c 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -3573,10 +3573,8 @@ function hevc_put_hevc_epel_uni_w_hv24_8_end_neon
 ret
 endfunc
 
-#if HAVE_I8MM
-ENABLE_I8MM
-
-function ff_hevc_put_hevc_epel_uni_w_hv4_8_neon_i8mm, export=1
+.macro epel_uni_w_hv suffix
+function ff_hevc_put_hevc_epel_uni_w_hv4_8_\suffix, export=1
 epel_uni_w_hv_start
 sxtwx4, w4
 
@@ -3591,14 +3589,14 @@ function ff_hevc_put_hevc_epel_uni_w_hv4_8_neon_i8mm, 
export=1
 mov x2, x3
 add x3, x4, #3
 mov x4, x5
-bl  X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
+bl  X(ff_hevc_put_hevc_epel_h4_8_\suffix)
 ldp x4, x6, [sp, #16]
 ldp x0, x1, [sp, #32]
 ldr x30, [sp], #48
 b   hevc_put_hevc_epel_uni_w_hv4_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_w_hv6_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_w_hv6_8_\suffix, export=1
 epel_uni_w_hv_start
 sxtwx4, w4
 
@@ -3613,14 +3611,14 @@ function ff_hevc_put_hevc_epel_uni_w_hv6_8_neon_i8mm, 
export=1
 mov x2, x3
 add x3, x4, #3
 mov x4, x5
-bl  X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
+bl  X(ff_hevc_put_hevc_epel_h6_8_\suffix)
 ldp x4, x6, [sp, #16]
 ldp x0, x1, [sp, #32]
 ldr x30, [sp], #48
 b   hevc_put_hevc_epel_uni_w_hv6_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_w_hv8_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_w_hv8_8_\suffix, export=1
 epel_uni_w_hv_start
 sxtwx4, w4
 
@@ -3635,14 +3633,14 @@ function ff_hevc_put_hevc_epel_uni_w_hv8_8_neon_i8mm, 
export=1
 mov x2, x3
 add x3, x4, #3
 mov x4, x5
-bl  X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
+bl  X(ff_hevc_put_hevc_epel_h8_8_\suffix)
 ldp x4, x6, [sp, #16]
 ldp x0, x1, [sp, #32]
 ldr x30, [sp], #48
 b   hevc_put_hevc_epel_uni_w_hv8_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_w_hv12_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_w_hv12_8_\suffix, export=1
 epel_uni_w_hv_start
 sxtwx4, w4
 
@@ -3657,14 +3655,14 @@ function ff_hevc_put_hevc_epel_uni_w_hv12_8_neon_i8mm, 
export=1
 mov x2, x3
 add x3, x4, #3
 mov x4, x5
-bl  X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
+bl  X(ff_hevc_put_hevc_epel_h12_8_\suffix)
 ldp x4, x6, [sp, #16]
 ldp x0, x1, [sp, #32]
 ldr x30, [sp], #48
 b   hevc_put_hevc_epel_uni_w_hv12_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_w_hv16_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_w_hv16_8_\suffix, export=1
 epel_uni_w_hv_start
 sxtwx4, w4
 
@@ -3679,14 +3677,14 @@ function ff_hevc_put_hevc_epel_uni_w_hv16_8_neon_i8mm, 
export=1
 mov x2, x3
 add x3, x4, #3
 mov x4, x5
-bl  X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
+bl  X(ff_hevc_put_hevc_epel_h16_8_\suffix)
 ldp x4, x6, [sp, #16]

[FFmpeg-devel] [PATCH 11/21] aarch64: hevc: Produce epel_uni_hv functions for both neon and i8mm

2024-03-25 Thread Martin Storsjö
AWS Graviton 3:
put_hevc_epel_uni_hv4_8_c: 163.5
put_hevc_epel_uni_hv4_8_neon: 59.7
put_hevc_epel_uni_hv4_8_i8mm: 57.5
put_hevc_epel_uni_hv6_8_c: 344.7
put_hevc_epel_uni_hv6_8_neon: 105.0
put_hevc_epel_uni_hv6_8_i8mm: 102.7
put_hevc_epel_uni_hv8_8_c: 552.2
put_hevc_epel_uni_hv8_8_neon: 111.2
put_hevc_epel_uni_hv8_8_i8mm: 104.0
put_hevc_epel_uni_hv12_8_c: 1195.0
put_hevc_epel_uni_hv12_8_neon: 248.7
put_hevc_epel_uni_hv12_8_i8mm: 229.5
put_hevc_epel_uni_hv16_8_c: 1910.2
put_hevc_epel_uni_hv16_8_neon: 339.5
put_hevc_epel_uni_hv16_8_i8mm: 323.2
put_hevc_epel_uni_hv24_8_c: 4048.2
put_hevc_epel_uni_hv24_8_neon: 737.7
put_hevc_epel_uni_hv24_8_i8mm: 713.7
put_hevc_epel_uni_hv32_8_c: 6865.7
put_hevc_epel_uni_hv32_8_neon: 1285.0
put_hevc_epel_uni_hv32_8_i8mm: 1206.0
put_hevc_epel_uni_hv48_8_c: 15830.5
put_hevc_epel_uni_hv48_8_neon: 2844.7
put_hevc_epel_uni_hv48_8_i8mm: 2914.0
put_hevc_epel_uni_hv64_8_c: 27912.7
put_hevc_epel_uni_hv64_8_neon: 4970.5
put_hevc_epel_uni_hv64_8_i8mm: 4653.7
---
 libavcodec/aarch64/hevcdsp_epel_neon.S| 67 +++
 libavcodec/aarch64/hevcdsp_init_aarch64.c |  5 ++
 2 files changed, 38 insertions(+), 34 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S 
b/libavcodec/aarch64/hevcdsp_epel_neon.S
index 024464723b..876db9d449 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -2460,14 +2460,6 @@ endfunc
 
 epel_hv neon
 
-#if HAVE_I8MM
-ENABLE_I8MM
-
-epel_hv neon_i8mm
-
-DISABLE_I8MM
-#endif
-
 function hevc_put_hevc_epel_uni_hv4_8_end_neon
 load_epel_filterh x6, x5
 mov x10, #(MAX_PB_SIZE * 2)
@@ -2596,10 +2588,8 @@ function hevc_put_hevc_epel_uni_hv24_8_end_neon
 2:  ret
 endfunc
 
-#if HAVE_I8MM
-ENABLE_I8MM
-
-function ff_hevc_put_hevc_epel_uni_hv4_8_neon_i8mm, export=1
+.macro epel_uni_hv suffix
+function ff_hevc_put_hevc_epel_uni_hv4_8_\suffix, export=1
 add w10, w4, #3
 lsl x10, x10, #7
 sub sp, sp, x10 // tmp_array
@@ -2611,14 +2601,14 @@ function ff_hevc_put_hevc_epel_uni_hv4_8_neon_i8mm, 
export=1
 mov x2, x3
 add w3, w4, #3
 mov x4, x5
-bl  X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
+bl  X(ff_hevc_put_hevc_epel_h4_8_\suffix)
 ldp x4, x6, [sp, #16]
 ldp x0, x1, [sp, #32]
 ldr x30, [sp], #48
 b   hevc_put_hevc_epel_uni_hv4_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_hv6_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_hv6_8_\suffix, export=1
 add w10, w4, #3
 lsl x10, x10, #7
 sub sp, sp, x10 // tmp_array
@@ -2630,14 +2620,14 @@ function ff_hevc_put_hevc_epel_uni_hv6_8_neon_i8mm, 
export=1
 mov x2, x3
 add w3, w4, #3
 mov x4, x5
-bl  X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
+bl  X(ff_hevc_put_hevc_epel_h6_8_\suffix)
 ldp x4, x6, [sp, #16]
 ldp x0, x1, [sp, #32]
 ldr x30, [sp], #48
 b   hevc_put_hevc_epel_uni_hv6_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_hv8_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_hv8_8_\suffix, export=1
 add w10, w4, #3
 lsl x10, x10, #7
 sub sp, sp, x10 // tmp_array
@@ -2649,14 +2639,14 @@ function ff_hevc_put_hevc_epel_uni_hv8_8_neon_i8mm, 
export=1
 mov x2, x3
 add w3, w4, #3
 mov x4, x5
-bl  X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
+bl  X(ff_hevc_put_hevc_epel_h8_8_\suffix)
 ldp x4, x6, [sp, #16]
 ldp x0, x1, [sp, #32]
 ldr x30, [sp], #48
 b   hevc_put_hevc_epel_uni_hv8_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_hv12_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_hv12_8_\suffix, export=1
 add w10, w4, #3
 lsl x10, x10, #7
 sub sp, sp, x10 // tmp_array
@@ -2668,14 +2658,14 @@ function ff_hevc_put_hevc_epel_uni_hv12_8_neon_i8mm, 
export=1
 mov x2, x3
 add w3, w4, #3
 mov x4, x5
-bl  X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
+bl  X(ff_hevc_put_hevc_epel_h12_8_\suffix)
 ldp x4, x6, [sp, #16]
 ldp x0, x1, [sp, #32]
 ldr x30, [sp], #48
 b   hevc_put_hevc_epel_uni_hv12_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_hv16_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_hv16_8_\suffix, export=1
 add 

[FFmpeg-devel] [PATCH 05/21] aarch64: hevc: Use ld1r instead of ldr+dup in hevc_qpel_uni_w_h

2024-03-25 Thread Martin Storsjö
---
 libavcodec/aarch64/hevcdsp_qpel_neon.S | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S 
b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index 0fcded344b..062b7d4d0f 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -2462,8 +2462,7 @@ endfunc
 sub x2, x2, #3
 movrel  x9, qpel_filters
 add x9, x9, x12, lsl #3
-ldr x11, [x9]
-dup v28.2d, x11
+ld1r{v28.2d}, [x9]
 mov w10, #-6
 sub w10, w10, w5
 dup v30.4s, w6  // wx
-- 
2.39.3 (Apple Git-146)

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


[FFmpeg-devel] [PATCH 10/21] aarch64: hevc: Produce epel_hv functions for both plain neon and i8mm

2024-03-25 Thread Martin Storsjö
AWS Graviton 3:
put_hevc_epel_hv4_8_c: 163.7
put_hevc_epel_hv4_8_neon: 52.5
put_hevc_epel_hv4_8_i8mm: 49.5
put_hevc_epel_hv6_8_c: 292.2
put_hevc_epel_hv6_8_neon: 97.7
put_hevc_epel_hv6_8_i8mm: 101.2
put_hevc_epel_hv8_8_c: 471.0
put_hevc_epel_hv8_8_neon: 106.7
put_hevc_epel_hv8_8_i8mm: 102.5
put_hevc_epel_hv12_8_c: 1030.2
put_hevc_epel_hv12_8_neon: 240.5
put_hevc_epel_hv12_8_i8mm: 215.0
put_hevc_epel_hv16_8_c: 1711.5
put_hevc_epel_hv16_8_neon: 340.2
put_hevc_epel_hv16_8_i8mm: 319.2
put_hevc_epel_hv24_8_c: 3670.0
put_hevc_epel_hv24_8_neon: 702.0
put_hevc_epel_hv24_8_i8mm: 666.5
put_hevc_epel_hv32_8_c: 6785.5
put_hevc_epel_hv32_8_neon: 1247.0
put_hevc_epel_hv32_8_i8mm: 1169.0
put_hevc_epel_hv48_8_c: 14689.7
put_hevc_epel_hv48_8_neon: 2665.2
put_hevc_epel_hv48_8_i8mm: 2740.0
put_hevc_epel_hv64_8_c: 25899.2
put_hevc_epel_hv64_8_neon: 4801.2
put_hevc_epel_hv64_8_i8mm: 4487.7
---
 libavcodec/aarch64/hevcdsp_epel_neon.S| 58 +--
 libavcodec/aarch64/hevcdsp_init_aarch64.c |  6 +++
 2 files changed, 38 insertions(+), 26 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S 
b/libavcodec/aarch64/hevcdsp_epel_neon.S
index 2088630da1..024464723b 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -2298,10 +2298,8 @@ function hevc_put_hevc_epel_hv24_8_end_neon
 2:  ret
 endfunc
 
-#if HAVE_I8MM
-ENABLE_I8MM
-
-function ff_hevc_put_hevc_epel_hv4_8_neon_i8mm, export=1
+.macro epel_hv suffix
+function ff_hevc_put_hevc_epel_hv4_8_\suffix, export=1
 add w10, w3, #3
 lsl x10, x10, #7
 sub sp, sp, x10 // tmp_array
@@ -2310,13 +2308,13 @@ function ff_hevc_put_hevc_epel_hv4_8_neon_i8mm, export=1
 add x0, sp, #32
 sub x1, x1, x2
 add w3, w3, #3
-bl  X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
+bl  X(ff_hevc_put_hevc_epel_h4_8_\suffix)
 ldp x0, x3, [sp, #16]
 ldp x5, x30, [sp], #32
 b   hevc_put_hevc_epel_hv4_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_hv6_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_hv6_8_\suffix, export=1
 add w10, w3, #3
 lsl x10, x10, #7
 sub sp, sp, x10 // tmp_array
@@ -2325,13 +2323,13 @@ function ff_hevc_put_hevc_epel_hv6_8_neon_i8mm, export=1
 add x0, sp, #32
 sub x1, x1, x2
 add w3, w3, #3
-bl  X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
+bl  X(ff_hevc_put_hevc_epel_h6_8_\suffix)
 ldp x0, x3, [sp, #16]
 ldp x5, x30, [sp], #32
 b   hevc_put_hevc_epel_hv6_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_hv8_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_hv8_8_\suffix, export=1
 add w10, w3, #3
 lsl x10, x10, #7
 sub sp, sp, x10 // tmp_array
@@ -2340,13 +2338,13 @@ function ff_hevc_put_hevc_epel_hv8_8_neon_i8mm, export=1
 add x0, sp, #32
 sub x1, x1, x2
 add w3, w3, #3
-bl  X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
+bl  X(ff_hevc_put_hevc_epel_h8_8_\suffix)
 ldp x0, x3, [sp, #16]
 ldp x5, x30, [sp], #32
 b   hevc_put_hevc_epel_hv8_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_hv12_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_hv12_8_\suffix, export=1
 add w10, w3, #3
 lsl x10, x10, #7
 sub sp, sp, x10 // tmp_array
@@ -2355,13 +2353,13 @@ function ff_hevc_put_hevc_epel_hv12_8_neon_i8mm, 
export=1
 add x0, sp, #32
 sub x1, x1, x2
 add w3, w3, #3
-bl  X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
+bl  X(ff_hevc_put_hevc_epel_h12_8_\suffix)
 ldp x0, x3, [sp, #16]
 ldp x5, x30, [sp], #32
 b   hevc_put_hevc_epel_hv12_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_hv16_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_hv16_8_\suffix, export=1
 add w10, w3, #3
 lsl x10, x10, #7
 sub sp, sp, x10 // tmp_array
@@ -2370,13 +2368,13 @@ function ff_hevc_put_hevc_epel_hv16_8_neon_i8mm, 
export=1
 add x0, sp, #32
 sub x1, x1, x2
 add w3, w3, #3
-bl  X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
+bl  X(ff_hevc_put_hevc_epel_h16_8_\suffix)
 ldp x0, x3, [sp, #16]
 ldp x5, x30, [sp], #32
 b   

[FFmpeg-devel] [PATCH 09/21] aarch64: hevc: Reorder epel_hv functions to prepare for templating

2024-03-25 Thread Martin Storsjö
This is a pure reordering of code without changing anything in
the individual functions.
---
 libavcodec/aarch64/hevcdsp_epel_neon.S | 971 +
 1 file changed, 497 insertions(+), 474 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S 
b/libavcodec/aarch64/hevcdsp_epel_neon.S
index 6be171ece1..2088630da1 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -2173,21 +2173,9 @@ function ff_hevc_put_hevc_epel_h64_8_neon_i8mm, export=1
 ret
 endfunc
 
+DISABLE_I8MM
+#endif
 
-function ff_hevc_put_hevc_epel_hv4_8_neon_i8mm, export=1
-add w10, w3, #3
-lsl x10, x10, #7
-sub sp, sp, x10 // tmp_array
-stp x5, x30, [sp, #-32]!
-stp x0, x3, [sp, #16]
-add x0, sp, #32
-sub x1, x1, x2
-add w3, w3, #3
-bl  X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
-ldp x0, x3, [sp, #16]
-ldp x5, x30, [sp], #32
-b   hevc_put_hevc_epel_hv4_8_end_neon
-endfunc
 
 function hevc_put_hevc_epel_hv4_8_end_neon
 load_epel_filterh x5, x4
@@ -2207,21 +2195,6 @@ function hevc_put_hevc_epel_hv4_8_end_neon
 2:  ret
 endfunc
 
-function ff_hevc_put_hevc_epel_hv6_8_neon_i8mm, export=1
-add w10, w3, #3
-lsl x10, x10, #7
-sub sp, sp, x10 // tmp_array
-stp x5, x30, [sp, #-32]!
-stp x0, x3, [sp, #16]
-add x0, sp, #32
-sub x1, x1, x2
-add w3, w3, #3
-bl  X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
-ldp x0, x3, [sp, #16]
-ldp x5, x30, [sp], #32
-b   hevc_put_hevc_epel_hv6_8_end_neon
-endfunc
-
 function hevc_put_hevc_epel_hv6_8_end_neon
 load_epel_filterh x5, x4
 mov x5, #120
@@ -2243,21 +2216,6 @@ function hevc_put_hevc_epel_hv6_8_end_neon
 2:  ret
 endfunc
 
-function ff_hevc_put_hevc_epel_hv8_8_neon_i8mm, export=1
-add w10, w3, #3
-lsl x10, x10, #7
-sub sp, sp, x10 // tmp_array
-stp x5, x30, [sp, #-32]!
-stp x0, x3, [sp, #16]
-add x0, sp, #32
-sub x1, x1, x2
-add w3, w3, #3
-bl  X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
-ldp x0, x3, [sp, #16]
-ldp x5, x30, [sp], #32
-b   hevc_put_hevc_epel_hv8_8_end_neon
-endfunc
-
 function hevc_put_hevc_epel_hv8_8_end_neon
 load_epel_filterh x5, x4
 mov x10, #(MAX_PB_SIZE * 2)
@@ -2277,21 +2235,6 @@ function hevc_put_hevc_epel_hv8_8_end_neon
 2:  ret
 endfunc
 
-function ff_hevc_put_hevc_epel_hv12_8_neon_i8mm, export=1
-add w10, w3, #3
-lsl x10, x10, #7
-sub sp, sp, x10 // tmp_array
-stp x5, x30, [sp, #-32]!
-stp x0, x3, [sp, #16]
-add x0, sp, #32
-sub x1, x1, x2
-add w3, w3, #3
-bl  X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
-ldp x0, x3, [sp, #16]
-ldp x5, x30, [sp], #32
-b   hevc_put_hevc_epel_hv12_8_end_neon
-endfunc
-
 function hevc_put_hevc_epel_hv12_8_end_neon
 load_epel_filterh x5, x4
 mov x5, #112
@@ -2313,21 +2256,6 @@ function hevc_put_hevc_epel_hv12_8_end_neon
 2:  ret
 endfunc
 
-function ff_hevc_put_hevc_epel_hv16_8_neon_i8mm, export=1
-add w10, w3, #3
-lsl x10, x10, #7
-sub sp, sp, x10 // tmp_array
-stp x5, x30, [sp, #-32]!
-stp x0, x3, [sp, #16]
-add x0, sp, #32
-sub x1, x1, x2
-add w3, w3, #3
-bl  X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
-ldp x0, x3, [sp, #16]
-ldp x5, x30, [sp], #32
-b   hevc_put_hevc_epel_hv16_8_end_neon
-endfunc
-
 function hevc_put_hevc_epel_hv16_8_end_neon
 load_epel_filterh x5, x4
 mov x10, #(MAX_PB_SIZE * 2)
@@ -2348,21 +2276,6 @@ function hevc_put_hevc_epel_hv16_8_end_neon
 2:  ret
 endfunc
 
-function ff_hevc_put_hevc_epel_hv24_8_neon_i8mm, export=1
-add w10, w3, #3
-lsl x10, x10, #7
-sub sp, sp, x10 // tmp_array
-stp x5, x30, [sp, #-32]!
-stp x0, x3, [sp, #16]
-add x0, sp, #32
-sub x1, x1, x2
-add w3, w3, #3
-bl

[FFmpeg-devel] [PATCH 04/21] aarch64: hevc: Specialize put_hevc_\type\()_h*_8_neon for horizontal looping

2024-03-25 Thread Martin Storsjö
For widths of 32 pixels and more, loop first horizontally,
then vertically.

Previously, this function would process a 16 pixel wide slice
of the block, looping vertically. After processing the whole
height, it would backtrack and process the next 16 pixel wide
slice.

When doing 8tap filtering horizontally, the function must load
7 more pixels (in practice, 8) following the actual inputs, and
this was done for each slice.

By iterating first horizontally throughout each line, then
vertically, we access data in a more cache friendly order, and
we don't need to reload data unnecessarily.

Keep the original order in put_hevc_\type\()_h12_8_neon; the
only suboptimal case there is for width=24. But specializing
an optimal variant for that would require more code, which
might not be worth it.

For the h16 case, this implementation would give a slowdown,
as it now loads the first 8 pixels separately from the rest, but
for larger widths, it is a gain. Therefore, keep the h16 case
as it was (but remove the outer loop), and create a new specialized
version for horizontal looping with 16 pixels at a time.

Before:  Cortex A53  A72  A73  Graviton 3
put_hevc_qpel_h16_8_neon: 710.5667.7692.5   211.0
put_hevc_qpel_h32_8_neon:2791.5   2643.5   2732.0   883.5
put_hevc_qpel_h64_8_neon:   10954.0  10657.0  10874.2  3241.5
After:
put_hevc_qpel_h16_8_neon: 697.5663.5705.7   212.5
put_hevc_qpel_h32_8_neon:2767.2   2684.5   2791.2   920.5
put_hevc_qpel_h64_8_neon:   10559.2  10471.5  10932.2  3051.7
---
 libavcodec/aarch64/hevcdsp_init_aarch64.c |  20 +++--
 libavcodec/aarch64/hevcdsp_qpel_neon.S| 103 +-
 2 files changed, 94 insertions(+), 29 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c 
b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index d2f2a3681f..1e9f5e32db 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -109,6 +109,8 @@ void ff_hevc_put_hevc_qpel_h12_8_neon(int16_t *dst, const 
uint8_t *_src, ptrdiff
   intptr_t mx, intptr_t my, int width);
 void ff_hevc_put_hevc_qpel_h16_8_neon(int16_t *dst, const uint8_t *_src, 
ptrdiff_t _srcstride, int height,
   intptr_t mx, intptr_t my, int width);
+void ff_hevc_put_hevc_qpel_h32_8_neon(int16_t *dst, const uint8_t *_src, 
ptrdiff_t _srcstride, int height,
+  intptr_t mx, intptr_t my, int width);
 void ff_hevc_put_hevc_qpel_uni_h4_8_neon(uint8_t *_dst, ptrdiff_t _dststride, 
const uint8_t *_src,
  ptrdiff_t _srcstride, int height, 
intptr_t mx, intptr_t my,
  int width);
@@ -124,6 +126,9 @@ void ff_hevc_put_hevc_qpel_uni_h12_8_neon(uint8_t *_dst, 
ptrdiff_t _dststride, c
 void ff_hevc_put_hevc_qpel_uni_h16_8_neon(uint8_t *_dst, ptrdiff_t _dststride, 
const uint8_t *_src,
   ptrdiff_t _srcstride, int height, 
intptr_t mx, intptr_t
   my, int width);
+void ff_hevc_put_hevc_qpel_uni_h32_8_neon(uint8_t *_dst, ptrdiff_t _dststride, 
const uint8_t *_src,
+  ptrdiff_t _srcstride, int height, 
intptr_t mx, intptr_t
+  my, int width);
 void ff_hevc_put_hevc_qpel_bi_h4_8_neon(uint8_t *_dst, ptrdiff_t _dststride, 
const uint8_t *_src,
 ptrdiff_t _srcstride, const int16_t 
*src2, int height, intptr_t
 mx, intptr_t my, int width);
@@ -139,6 +144,9 @@ void ff_hevc_put_hevc_qpel_bi_h12_8_neon(uint8_t *_dst, 
ptrdiff_t _dststride, co
 void ff_hevc_put_hevc_qpel_bi_h16_8_neon(uint8_t *_dst, ptrdiff_t _dststride, 
const uint8_t *_src,
  ptrdiff_t _srcstride, const int16_t 
*src2, int height, intptr_t
  mx, intptr_t my, int width);
+void ff_hevc_put_hevc_qpel_bi_h32_8_neon(uint8_t *_dst, ptrdiff_t _dststride, 
const uint8_t *_src,
+ ptrdiff_t _srcstride, const int16_t 
*src2, int height, intptr_t
+ mx, intptr_t my, int width);
 
 #define NEON8_FNPROTO(fn, args, ext) \
 void ff_hevc_put_hevc_##fn##4_8_neon##ext args; \
@@ -335,28 +343,28 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, 
const int bit_depth)
 c->put_hevc_qpel[3][0][1]  = ff_hevc_put_hevc_qpel_h8_8_neon;
 c->put_hevc_qpel[4][0][1]  =
 c->put_hevc_qpel[6][0][1]  = ff_hevc_put_hevc_qpel_h12_8_neon;
-c->put_hevc_qpel[5][0][1]  =
+c->put_hevc_qpel[5][0][1]  = ff_hevc_put_hevc_qpel_h16_8_neon;
 c->put_hevc_qpel[7][0][1]  =
 c->put_hevc_qpel[8][0][1]  =
-c->put_hevc_qpel[9][0][1]  = 

[FFmpeg-devel] [PATCH 03/21] aarch64: hevc: Merge consecutive stores in put_hevc_\type\()_h16_8_neon

2024-03-25 Thread Martin Storsjö
This gets rid of a couple instructions, but the actual performance
is almost identical on Cortex A72/A73. On Cortex A53, it is a
handful of cycles faster.
---
 libavcodec/aarch64/hevcdsp_qpel_neon.S | 15 +--
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S 
b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index 815d897094..432558bb95 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -512,11 +512,10 @@ function ff_hevc_put_hevc_\type\()_h16_8_neon, export=1
 .ifc \type, qpel
 mov dststride, #(MAX_PB_SIZE << 1)
 lsl x13, srcstride, #1 // srcstridel
-mov x14, #((MAX_PB_SIZE << 2) - 16)
+mov x14, #(MAX_PB_SIZE << 2)
 .else
 lsl x14, dststride, #1 // dststridel
 lsl x13, srcstride, #1 // srcstridel
-sub x14, x14, #8
 .endif
 add x10, dst, dststride // dstb
 add x12, src, srcstride // srcb
@@ -527,10 +526,8 @@ function ff_hevc_put_hevc_\type\()_h16_8_neon, export=1
 bl  ff_hevc_put_hevc_h16_8_neon
 
 .ifc \type, qpel
-st1 {v26.8h}, [dst], #16
-st1 {v28.8h}, [x10], #16
-st1 {v27.8h}, [dst], x14
-st1 {v29.8h}, [x10], x14
+st1 {v26.8h, v27.8h}, [dst], x14
+st1 {v28.8h, v29.8h}, [x10], x14
 .else
 .ifc \type, qpel_bi
 ld1 {v16.8h, v17.8h}, [ x4], x16
@@ -549,10 +546,8 @@ function ff_hevc_put_hevc_\type\()_h16_8_neon, export=1
 sqrshrunv28.8b, v28.8h, #6
 sqrshrunv29.8b, v29.8h, #6
 .endif
-st1 {v26.8b}, [dst], #8
-st1 {v28.8b}, [x10], #8
-st1 {v27.8b}, [dst], x14
-st1 {v29.8b}, [x10], x14
+st1 {v26.8b, v27.8b}, [dst], x14
+st1 {v28.8b, v29.8b}, [x10], x14
 .endif
 b.gt1b // double line
 subswidth, width, #16
-- 
2.39.3 (Apple Git-146)

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


[FFmpeg-devel] [PATCH 02/21] aarch64: hevc: Don't iterate with sp in ff_hevc_put_hevc_qpel_uni_w_hv32/64_8_neon_i8mm

2024-03-25 Thread Martin Storsjö
Many of the routines within hevcdsp_epel_neon and hevcdsp_qpel_neon
store temporary buffers on the stack. When consuming it,
many of these functions use the stack pointer as incremental pointer
for reading the data (instead of storing it in another register),
which is rather unusual.

Technically, this is fine as long as the pointer remains properly
aligned.

However in the case of ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm,
after incrementing sp when reading data (within each 16 pixel
wide stripe) it would then reset the stack pointer back to a lower
value, for reading the next 16 pixel wide stripe, expecting the
data to remain untouched.

This can't be assumed; data on the stack below the stack pointer
can be clobbered (e.g. by a signal handler). Some OS ABIs
allow for a little margin that won't be touched, aka a red zone,
but not all do. The ones that do, guarantee 16 or 128 bytes, not
9 KB.

Convert this function to use a separate pointer register to
iterate through the data, retaining the stack pointer to point
at the bottom of the data we require to remain untouched.
---
 libavcodec/aarch64/hevcdsp_qpel_neon.S | 130 +
 1 file changed, 66 insertions(+), 64 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S 
b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index 9be29cafe2..815d897094 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -3981,24 +3981,25 @@ function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, 
export=1
 mov x11, sp
 mov w12, w22
 mov x13, x20
+mov x14, sp
 3:
-ldp q16, q1, [sp]
-add sp, sp, x10
-ldp q17, q2, [sp]
-add sp, sp, x10
-ldp q18, q3, [sp]
-add sp, sp, x10
-ldp q19, q4, [sp]
-add sp, sp, x10
-ldp q20, q5, [sp]
-add sp, sp, x10
-ldp q21, q6, [sp]
-add sp, sp, x10
-ldp q22, q7, [sp]
-add sp, sp, x10
+ldp q16, q1, [x11]
+add x11, x11, x10
+ldp q17, q2, [x11]
+add x11, x11, x10
+ldp q18, q3, [x11]
+add x11, x11, x10
+ldp q19, q4, [x11]
+add x11, x11, x10
+ldp q20, q5, [x11]
+add x11, x11, x10
+ldp q21, q6, [x11]
+add x11, x11, x10
+ldp q22, q7, [x11]
+add x11, x11, x10
 1:
-ldp q23, q31, [sp]
-add sp, sp, x10
+ldp q23, q31, [x11]
+add x11, x11, x10
 QPEL_FILTER_H   v24, v16, v17, v18, v19, v20, v21, v22, v23
 QPEL_FILTER_H2  v25, v16, v17, v18, v19, v20, v21, v22, v23
 QPEL_FILTER_H   v26,  v1,  v2,  v3,  v4,  v5,  v6,  v7, v31
@@ -4007,8 +4008,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, 
export=1
 subsw22, w22, #1
 b.eq2f
 
-ldp q16, q1, [sp]
-add sp, sp, x10
+ldp q16, q1, [x11]
+add x11, x11, x10
 QPEL_FILTER_H   v24, v17, v18, v19, v20, v21, v22, v23, v16
 QPEL_FILTER_H2  v25, v17, v18, v19, v20, v21, v22, v23, v16
 QPEL_FILTER_H   v26,  v2,  v3,  v4,  v5,  v6,  v7, v31,  v1
@@ -4017,8 +4018,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, 
export=1
 subsw22, w22, #1
 b.eq2f
 
-ldp q17, q2, [sp]
-add sp, sp, x10
+ldp q17, q2, [x11]
+add x11, x11, x10
 QPEL_FILTER_H   v24, v18, v19, v20, v21, v22, v23, v16, v17
 QPEL_FILTER_H2  v25, v18, v19, v20, v21, v22, v23, v16, v17
 QPEL_FILTER_H   v26,  v3,  v4,  v5,  v6,  v7, v31,  v1,  v2
@@ -4027,8 +4028,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, 
export=1
 subsw22, w22, #1
 b.eq2f
 
-ldp q18, q3, [sp]
-add sp, sp, x10
+ldp q18, q3, [x11]
+add x11, x11, x10
 QPEL_FILTER_H   v24, v19, v20, v21, v22, v23, v16, v17, v18
 QPEL_FILTER_H2  v25, v19, v20, v21, v22, v23, v16, v17, v18
 QPEL_FILTER_H   v26,  v4,  v5,  v6,  v7, v31,  v1,  v2,  v3
@@ -4037,8 +4038,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, 
export=1
 subsw22, w22, #1
 b.eq2f
 
-ldp q19, q4, [sp]
-add sp, sp, x10
+ldp q19, q4, [x11]
+add x11, x11, x10
 

[FFmpeg-devel] [PATCH 01/21] aarch64: hevc: Reorder a misplaced function init line

2024-03-25 Thread Martin Storsjö
Group the epel and qpel functions together.
---
 libavcodec/aarch64/hevcdsp_init_aarch64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c 
b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index 04692aa98e..d2f2a3681f 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -381,12 +381,12 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, 
const int bit_depth)
 NEON8_FNASSIGN(c->put_hevc_epel, 1, 1, epel_hv, _i8mm);
 NEON8_FNASSIGN(c->put_hevc_epel_uni, 1, 1, epel_uni_hv, _i8mm);
 NEON8_FNASSIGN(c->put_hevc_epel_uni_w, 0, 1, epel_uni_w_h ,_i8mm);
+NEON8_FNASSIGN(c->put_hevc_epel_uni_w, 1, 1, epel_uni_w_hv, _i8mm);
 NEON8_FNASSIGN(c->put_hevc_epel_bi, 1, 1, epel_bi_hv, _i8mm);
 NEON8_FNASSIGN(c->put_hevc_qpel, 0, 1, qpel_h, _i8mm);
 NEON8_FNASSIGN(c->put_hevc_qpel, 1, 1, qpel_hv, _i8mm);
 NEON8_FNASSIGN(c->put_hevc_qpel_uni, 1, 1, qpel_uni_hv, _i8mm);
 NEON8_FNASSIGN(c->put_hevc_qpel_uni_w, 0, 1, qpel_uni_w_h, _i8mm);
-NEON8_FNASSIGN(c->put_hevc_epel_uni_w, 1, 1, epel_uni_w_hv, _i8mm);
 NEON8_FNASSIGN_PARTIAL_5(c->put_hevc_qpel_uni_w, 1, 1, 
qpel_uni_w_hv, _i8mm);
 NEON8_FNASSIGN(c->put_hevc_qpel_bi, 1, 1, qpel_bi_hv, _i8mm);
 }
-- 
2.39.3 (Apple Git-146)

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


[FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions

2024-03-25 Thread Martin Storsjö
Hi,

Since some time, we have pretty complete AArch64 NEON coverage
for the hevc decoder.

However, some of these functions require the I8MM instruction set
extension, and many of them (but not all) lack a plain NEON
version.

This patchset fills in a regular NEON version of all functions
where we have an I8MM function.

For context; the I8MM instruction set extension is a mandatory
part of armv8.6-a. E.g. Apple M2, AWS Graviton 3 have it,
but Apple M1 and Ampere Altra don't.

This patchset takes decoding of a 1080p HEVC clip from 402
fps to 649 fps on an Apple M1.

Patch #2 also fixes a subtle bug in the existing implementation;
two functions relied on the contents on the stack, below the
stack pointer, being untouched within a function. If a signal
gets delivered, those parts of the stack could be clobbered.

// Martin

Martin Storsjö (21):
  aarch64: hevc: Reorder a misplaced function init line
  aarch64: hevc: Don't iterate with sp in
ff_hevc_put_hevc_qpel_uni_w_hv32/64_8_neon_i8mm
  aarch64: hevc: Merge consecutive stores in
put_hevc_\type\()_h16_8_neon
  aarch64: hevc: Specialize put_hevc_\type\()_h*_8_neon for horizontal
looping
  aarch64: hevc: Use ld1r instead of ldr+dup in hevc_qpel_uni_w_h
  aarch64: hevc: Implement a neon version of put_hevc_epel_h*_8
  aarch64: hevc: Implement a neon version of hevc_epel_uni_w_h*_8
  aarch64: hevc: Split the epel_*_hv functions into two parts
  aarch64: hevc: Reorder epel_hv functions to prepare for templating
  aarch64: hevc: Produce epel_hv functions for both plain neon and i8mm
  aarch64: hevc: Produce epel_uni_hv functions for both neon and i8mm
  aarch64: hevc: Produce epel_uni_w_hv functions for both neon and i8mm
  aarch64: hevc: Produce epel_bi_hv functions for both neon and i8mm
  aarch64: hevc: Implement a neon version of hevc_qpel_uni_w_h*_8
  aarch64: hevc: Split the qpel_*_hv functions into two parts
  aarch64: hevc: Deduplicate the hevc_put_hevc_qpel_uni_w_hv*_8_end_neon
functions
  aarch64: hevc: Reorder qpel_hv functions to prepare for templating
  aarch64: hevc: Produce plain neon versions of qpel_hv
  aarch64: hevc: Produce plain neon versions of qpel_uni_hv
  aarch64: hevc: Produce plain neon versions of qpel_uni_w_hv
  aarch64: hevc: Produce plain neon versions of qpel_bi_hv

 libavcodec/aarch64/hevcdsp_epel_neon.S| 1529 +++--
 libavcodec/aarch64/hevcdsp_init_aarch64.c |   96 +-
 libavcodec/aarch64/hevcdsp_qpel_neon.S| 1804 +
 3 files changed, 2291 insertions(+), 1138 deletions(-)

-- 
2.39.3 (Apple Git-146)

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH v2] configure: Explicitly check for static_assert

2024-03-22 Thread Martin Storsjö

On Fri, 22 Mar 2024, Andreas Rheinhardt wrote:


Martin Storsjö:


Both patches seem to work fine with MSVC 19.27 - I vaguely prefer the v2
version, which is simpler.


But to me, we could also just revert the change to
libavcodec/ccaption_dec.c, and declare that we require MSVC 19.28
instead. MSVC 19.27, when executed with -std:c11 without -nologo, it
prints this:

    /std:c11 is a preview implementation of the ISO C11 standard, and
    we're eager to hear about bugs and suggestions for improvements.
    However, note that these features are provided as-is without support.

And I don't have any specific reasons for wanting to use this compiler -
I just tested the lowest version that was supposed to be supported
earlier and noted that it had broken recently. So to me, reverting to
requiring _Static_assert would be quite ok as well.



We can actually do both: Test for static_assert and for _Static_assert
(to exclude MSVC 19.27; is 19.28 still supposed to be a preview
implementation?).


19.28 no longer has that preview implementation banner, so from there on, 
it should be fine.



The reason I prefer static_assert in the codebase is that _Static_assert
is actually deprecated with C23 (although I don't think it will be
removed any time).


Ah, I see. Right, with that in mind, unifying usage to static_assert 
sounds good.


No strong opinion either way about the configure checks still (or whether 
we should require _Static_assert to be supported), except that strictly 
requiring static_assert seems less kludgy than trying to define it 
ourselves.


// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH v2] configure: Explicitly check for static_assert

2024-03-21 Thread Martin Storsjö

On Thu, 21 Mar 2024, Andreas Rheinhardt wrote:


Andreas Rheinhardt:

C11 provides static assertions via _Static_assert and
provides static_assert as a convenience define for this
in assert.h. MSVC 19.27 declares support for C11, but does
not support _Static_assert, but somehow supports
static_assert. That's therefore what we use.

But apparently there are some old GCC toolchains where
_Static_assert is supported, but assert.h does not provide
the fallback define. Some fate boxes are affected by this
[1].

This commit therefore checks whether static_assert works
with assert.h included; if not, it errors out. Users like
the above can still add -Dstatic_assert=_Static_assert
to cflags as a workaround.

[1]: 
https://fate.ffmpeg.org/report.cgi?time=20240321123620=sh4-debian-qemu-gcc-4.7

Signed-off-by: Andreas Rheinhardt 
---
This is what a test without fallback looks like.
Posted to gather opinions on what people prefer.

 configure | 13 +
 1 file changed, 13 insertions(+)

diff --git a/configure b/configure
index 6d7b33b0ff..c2d2c70c20 100755
--- a/configure
+++ b/configure
@@ -5589,6 +5589,19 @@ check_cxxflags_cc -std=$stdcxx ctype.h "__cplusplus >= 
201103L" ||
 check_cflags_cc -std=$stdc ctype.h "__STDC_VERSION__ >= 201112L" ||
 { check_cflags_cc -std=c11 ctype.h "__STDC_VERSION__ >= 201112L" && stdc="c11" || 
die "Compiler lacks C11 support"; }

+test_cc <
+#include 
+struct Foo {
+int a;
+void *ptr;
+} obj;
+static_assert(offsetof(struct Foo, a) == 0,
+  "First element of struct does not have offset 0");
+static_assert(offsetof(struct Foo, ptr) >= offsetof(struct Foo, a) + 
sizeof(obj.a),
+  "elements not properly ordered in struct");
+EOF
+
 check_cppflags -D_FILE_OFFSET_BITS=64
 check_cppflags -D_LARGEFILE_SOURCE



Jan has tested old toolchains and found out that his GCC 4.7 has proper
C11 headers; so this seems to be unique to Michael's setup. This makes
me prefer this patch instead of the version with the fallback. (Michael
can simply add -Dstatic_assert=_Static_assert to his cflags.)
Of course others are still invited to share their opinions.


Both patches seem to work fine with MSVC 19.27 - I vaguely prefer the v2 
version, which is simpler.



But to me, we could also just revert the change to 
libavcodec/ccaption_dec.c, and declare that we require MSVC 19.28 instead. 
MSVC 19.27, when executed with -std:c11 without -nologo, it prints this:


/std:c11 is a preview implementation of the ISO C11 standard, and
we're eager to hear about bugs and suggestions for improvements.
However, note that these features are provided as-is without support.

And I don't have any specific reasons for wanting to use this compiler - I 
just tested the lowest version that was supposed to be supported earlier 
and noted that it had broken recently. So to me, reverting to requiring 
_Static_assert would be quite ok as well.


// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] duplicate symbol '_dec_init' in: fftools/ffmpeg_dec.o

2024-03-18 Thread Martin Storsjö

On Sun, 17 Mar 2024, Rémi Denis-Courmont wrote:

Obviously not. Imported libraries are only there to resolve missing 
symbols.


Sure - but if resolving the missing symbols brings in those conflicting 
object files, there's not much to do about it. If the static library 
contains dec_init in a standalone object file that nothing references, 
then sure, it won't be an issue. But if linking libbr brings in the object 
file that defines that symbol, we can't get around it.


Example:

$ cat mylib.h
void mylib_func(void);
$ cat mylib.c
#include "mylib.h"
void mylib_func(void) { }
void dec_init(void) { }
$ cat main.c
#include "mylib.h"

void dec_init(void) { }

int main(int argc, char **argv) {
mylib_func();
return 0;
}
$ gcc -c mylib.c
$ ar rcs libmylib.a mylib.o
$ gcc -c main.c
$ gcc main.o -o main -L. -lmylib
/usr/bin/ld: ./libmylib.a(mylib.o): in function `dec_init':
mylib.c:(.text+0xb): multiple definition of `dec_init'; 
main.o:main.c:(.text+0x0): first defined here

collect2: error: ld returned 1 exit status

I don't see what you propose that the FFmpeg build system should do 
differently to get around this issue, other than libbr not exposing global 
symbols outside of their namespace.


// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH] configure: Remove av_restrict

2024-03-15 Thread Martin Storsjö

On Sun, 10 Mar 2024, Andreas Rheinhardt wrote:


All versions of MSVC that support C11 (namely >= v19.27)
also support the restrict keyword, therefore av_restrict
is no longer necessary since 75697836b1db3e0f0a3b7061be6be28d00c675a0.

Signed-off-by: Andreas Rheinhardt 
---
Untested except via godbolt.
MSVC actually uses it for optimizations: https://godbolt.org/z/3EzPnff9T


This change looks good overall, thanks! Fate runs successfully both with 
an old version of MSVC targeting x86_64 and a new one targeting aarch64.


However, MSVC 19.27 (aka 2019 16.7) can't successfully build ffmpeg at the 
moment - it regressed in ec1b6e0cd404b2f7f4d202802b1c0a40d52fc9b0. Now 
building fails with this error:


src/libavcodec/ccaption_dec.c(186): error C2143: syntax error: missing ')' 
before 'sizeof'
src/libavcodec/ccaption_dec.c(186): error C2143: syntax error: missing '{' 
before 'sizeof'
src/libavcodec/ccaption_dec.c(186): error C2059: syntax error: 'sizeof'

This issue is not present with the following version, MSVC 2019 16.8 
(aka 19.28) though.



Btw: The block about __declspec(restrict) was always unneeded
for FFmpeg due to 17fad33f81c7e9787fcdc17934fc1eee6c6aa4bf.
It came from Libav commit 17fad33f81c7e9787fcdc17934fc1eee6c6aa4bf.


This looks like a copypaste typo, I presume the latter should have been 
0cff125200ab53fa3ae70d85b4f614f269fe3426. (The code it changed originated 
in dfa559bcbd41397b3408c59d016631c7c65e320f in libav.)


// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH 1/4] aarch64: Fix ff_hevc_put_hevc_epel_h48_8_neon_i8mm

2024-03-14 Thread Martin Storsjö

On Thu, 14 Mar 2024, J. Dekker wrote:



Martin Storsjö  writes:


The first 32 elements of each row were correct, while the
last 16 were scrambled.

This hasn't been noticed, because the checkasm test erroneously
only checked half of the output (for 8 bit functions), and
apparently none of the samples as part of "fate-hevc" seem to
trigger this specific function.
---
 libavcodec/aarch64/hevcdsp_epel_neon.S | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)


Thanks for the fixes, wonder if we should use checkasm_check()
exclusively in checkasm rather than memcmp(), would probably be useful.


Wherever it makes sense and works, then yes, using checkasm_check() 
probably is useful. (Within dav1d, we use it in most tests except for a 
few.)


FWIW, many checkasm tests seem to have pretty naive setups, where e.g. all 
rows are tightly packed. If they'd use a bigger stride with more padding 
between rows, one can also detect some other cases of potential asm bugs.



Pushed set


Thanks!

// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


[FFmpeg-devel] [PATCH 4/4] checkasm: hevc_pel: Use checkasm_check for printing failing output

2024-03-12 Thread Martin Storsjö
This simplifies the code for checking the output, and can print
the failing output (including a map of matching/mismatching
elements) if checkasm is run with the -v/--verbose option.
---
 tests/checkasm/hevc_pel.c | 71 ++-
 1 file changed, 41 insertions(+), 30 deletions(-)

diff --git a/tests/checkasm/hevc_pel.c b/tests/checkasm/hevc_pel.c
index 73a4619978..ed22ec4f9d 100644
--- a/tests/checkasm/hevc_pel.c
+++ b/tests/checkasm/hevc_pel.c
@@ -36,6 +36,15 @@ static const int offsets[] = {0, 255, -1 };
 #define SIZEOF_PIXEL ((bit_depth + 7) / 8)
 #define BUF_SIZE (2 * MAX_PB_SIZE * (2 * 4 + MAX_PB_SIZE))
 
+#define checkasm_check_pixel(buf1, stride1, buf2, stride2, ...) \
+((bit_depth > 8) ?  \
+ checkasm_check(uint16_t, (const uint16_t*)buf1, stride1,   \
+  (const uint16_t*)buf2, stride2,   \
+  __VA_ARGS__) :\
+ checkasm_check(uint8_t,  (const uint8_t*) buf1, stride1,   \
+  (const uint8_t*) buf2, stride2,   \
+  __VA_ARGS__))
+
 #define randomize_buffers()  \
 do { \
 uint32_t mask = pixel_mask[bit_depth - 8];   \
@@ -78,7 +87,7 @@ static void checkasm_check_hevc_qpel(void)
 LOCAL_ALIGNED_32(uint8_t, dst1, [BUF_SIZE]);
 
 HEVCDSPContext h;
-int size, bit_depth, i, j, row;
+int size, bit_depth, i, j;
 declare_func(void, int16_t *dst, uint8_t *src, ptrdiff_t srcstride,
  int height, intptr_t mx, intptr_t my, int width);
 
@@ -102,12 +111,9 @@ static void checkasm_check_hevc_qpel(void)
 randomize_buffers();
 call_ref(dstw0, src0, sizes[size] * SIZEOF_PIXEL, 
sizes[size], i, j, sizes[size]);
 call_new(dstw1, src1, sizes[size] * SIZEOF_PIXEL, 
sizes[size], i, j, sizes[size]);
-for (row = 0; row < size[sizes]; row++) {
-if (memcmp(dstw0 + row * MAX_PB_SIZE,
-   dstw1 + row * MAX_PB_SIZE,
-   sizes[size] * sizeof(int16_t)))
-fail();
-}
+checkasm_check(int16_t, dstw0, MAX_PB_SIZE * 
sizeof(int16_t),
+dstw1, MAX_PB_SIZE * 
sizeof(int16_t),
+size[sizes], size[sizes], 
"dst");
 bench_new(dstw1, src1, sizes[size] * SIZEOF_PIXEL, 
sizes[size], i, j, sizes[size]);
 }
 }
@@ -152,8 +158,9 @@ static void checkasm_check_hevc_qpel_uni(void)
 call_new(dst1, sizes[size] * SIZEOF_PIXEL,
  src1, sizes[size] * SIZEOF_PIXEL,
  sizes[size], i, j, sizes[size]);
-if (memcmp(dst0, dst1, sizes[size] * sizes[size] * 
SIZEOF_PIXEL))
-fail();
+checkasm_check_pixel(dst0, sizes[size] * SIZEOF_PIXEL,
+ dst1, sizes[size] * SIZEOF_PIXEL,
+ size[sizes], size[sizes], "dst");
 bench_new(dst1, sizes[size] * SIZEOF_PIXEL,
   src1, sizes[size] * SIZEOF_PIXEL,
   sizes[size], i, j, sizes[size]);
@@ -204,8 +211,9 @@ static void checkasm_check_hevc_qpel_uni_w(void)
 call_new(dst1, sizes[size] * SIZEOF_PIXEL,
  src1, sizes[size] * SIZEOF_PIXEL,
  sizes[size], *denom, *wx, *ox, i, 
j, sizes[size]);
-if (memcmp(dst0, dst1, sizes[size] * 
sizes[size] * SIZEOF_PIXEL))
-fail();
+checkasm_check_pixel(dst0, sizes[size] * 
SIZEOF_PIXEL,
+ dst1, sizes[size] * 
SIZEOF_PIXEL,
+ size[sizes], 
size[sizes], "dst");
 bench_new(dst1, sizes[size] * SIZEOF_PIXEL,
   src1, sizes[size] * SIZEOF_PIXEL,
   sizes[size], *denom, *wx, *ox, 
i, j, sizes[size]);
@@ -258,8 +266,9 @@ static void checkasm_check_hevc_qpel_bi(void)
 call_new(dst1, sizes[size] * SIZEOF_PIXEL,
  src1, sizes[size] * SIZEOF_PIXEL,
  ref1, sizes[size], i, j, sizes[size]);
-if (memcmp(dst0, dst1, sizes[size] * sizes[size] * 
SIZEOF_PIXEL))
-  

[FFmpeg-devel] [PATCH 3/4] checkasm: hevc_pel: Split a couple excessively long lines

2024-03-12 Thread Martin Storsjö
---
 tests/checkasm/hevc_pel.c | 134 --
 1 file changed, 98 insertions(+), 36 deletions(-)

diff --git a/tests/checkasm/hevc_pel.c b/tests/checkasm/hevc_pel.c
index 065da87622..73a4619978 100644
--- a/tests/checkasm/hevc_pel.c
+++ b/tests/checkasm/hevc_pel.c
@@ -96,13 +96,16 @@ static void checkasm_check_hevc_qpel(void)
 case 3: type = "qpel_hv"; break; // 1 1
 }
 
-if (check_func(h.put_hevc_qpel[size][j][i], 
"put_hevc_%s%d_%d", type, sizes[size], bit_depth)) {
+if (check_func(h.put_hevc_qpel[size][j][i],
+   "put_hevc_%s%d_%d", type, sizes[size], 
bit_depth)) {
 int16_t *dstw0 = (int16_t *) dst0, *dstw1 = (int16_t 
*) dst1;
 randomize_buffers();
 call_ref(dstw0, src0, sizes[size] * SIZEOF_PIXEL, 
sizes[size], i, j, sizes[size]);
 call_new(dstw1, src1, sizes[size] * SIZEOF_PIXEL, 
sizes[size], i, j, sizes[size]);
 for (row = 0; row < size[sizes]; row++) {
-if (memcmp(dstw0 + row * MAX_PB_SIZE, dstw1 + row 
* MAX_PB_SIZE, sizes[size] * sizeof(int16_t)))
+if (memcmp(dstw0 + row * MAX_PB_SIZE,
+   dstw1 + row * MAX_PB_SIZE,
+   sizes[size] * sizeof(int16_t)))
 fail();
 }
 bench_new(dstw1, src1, sizes[size] * SIZEOF_PIXEL, 
sizes[size], i, j, sizes[size]);
@@ -140,13 +143,20 @@ static void checkasm_check_hevc_qpel_uni(void)
 case 3: type = "qpel_uni_hv"; break; // 1 1
 }
 
-if (check_func(h.put_hevc_qpel_uni[size][j][i], 
"put_hevc_%s%d_%d", type, sizes[size], bit_depth)) {
+if (check_func(h.put_hevc_qpel_uni[size][j][i],
+   "put_hevc_%s%d_%d", type, sizes[size], 
bit_depth)) {
 randomize_buffers();
-call_ref(dst0, sizes[size] * SIZEOF_PIXEL, src0, 
sizes[size] * SIZEOF_PIXEL, sizes[size], i, j, sizes[size]);
-call_new(dst1, sizes[size] * SIZEOF_PIXEL, src1, 
sizes[size] * SIZEOF_PIXEL, sizes[size], i, j, sizes[size]);
+call_ref(dst0, sizes[size] * SIZEOF_PIXEL,
+ src0, sizes[size] * SIZEOF_PIXEL,
+ sizes[size], i, j, sizes[size]);
+call_new(dst1, sizes[size] * SIZEOF_PIXEL,
+ src1, sizes[size] * SIZEOF_PIXEL,
+ sizes[size], i, j, sizes[size]);
 if (memcmp(dst0, dst1, sizes[size] * sizes[size] * 
SIZEOF_PIXEL))
 fail();
-bench_new(dst1, sizes[size] * SIZEOF_PIXEL, src1, 
sizes[size] * SIZEOF_PIXEL, sizes[size], i, j, sizes[size]);
+bench_new(dst1, sizes[size] * SIZEOF_PIXEL,
+  src1, sizes[size] * SIZEOF_PIXEL,
+  sizes[size], i, j, sizes[size]);
 }
 }
 }
@@ -182,16 +192,23 @@ static void checkasm_check_hevc_qpel_uni_w(void)
 case 3: type = "qpel_uni_w_hv"; break; // 1 1
 }
 
-if (check_func(h.put_hevc_qpel_uni_w[size][j][i], 
"put_hevc_%s%d_%d", type, sizes[size], bit_depth)) {
+if (check_func(h.put_hevc_qpel_uni_w[size][j][i],
+   "put_hevc_%s%d_%d", type, sizes[size], 
bit_depth)) {
 for (denom = denoms; *denom >= 0; denom++) {
 for (wx = weights; *wx >= 0; wx++) {
 for (ox = offsets; *ox >= 0; ox++) {
 randomize_buffers();
-call_ref(dst0, sizes[size] * SIZEOF_PIXEL, 
src0, sizes[size] * SIZEOF_PIXEL, sizes[size], *denom, *wx, *ox, i, j, 
sizes[size]);
-call_new(dst1, sizes[size] * SIZEOF_PIXEL, 
src1, sizes[size] * SIZEOF_PIXEL, sizes[size], *denom, *wx, *ox, i, j, 
sizes[size]);
+call_ref(dst0, sizes[size] * SIZEOF_PIXEL,
+ src0, sizes[size] * SIZEOF_PIXEL,
+ sizes[size], *denom, *wx, *ox, i, 
j, sizes[size]);
+call_new(dst1, sizes[size] * SIZEOF_PIXEL,
+ src1, sizes[size] * SIZEOF_PIXEL,
+ sizes[size], *denom, *wx, *ox, i, 
j, sizes[size]);
 if (memcmp(dst0, dst1, sizes[size] * 
sizes[size] * 

[FFmpeg-devel] [PATCH 2/4] checkasm: hevc_pel: Check the full output in hevc_epel/hevc_qpel

2024-03-12 Thread Martin Storsjö
Previously it only checked half the output in 8 bit per pixel mode,
as the output actually is 16 bit elements here.
---
 tests/checkasm/hevc_pel.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tests/checkasm/hevc_pel.c b/tests/checkasm/hevc_pel.c
index f9a7a7717c..065da87622 100644
--- a/tests/checkasm/hevc_pel.c
+++ b/tests/checkasm/hevc_pel.c
@@ -102,7 +102,7 @@ static void checkasm_check_hevc_qpel(void)
 call_ref(dstw0, src0, sizes[size] * SIZEOF_PIXEL, 
sizes[size], i, j, sizes[size]);
 call_new(dstw1, src1, sizes[size] * SIZEOF_PIXEL, 
sizes[size], i, j, sizes[size]);
 for (row = 0; row < size[sizes]; row++) {
-if (memcmp(dstw0 + row * MAX_PB_SIZE, dstw1 + row 
* MAX_PB_SIZE, sizes[size] * SIZEOF_PIXEL))
+if (memcmp(dstw0 + row * MAX_PB_SIZE, dstw1 + row 
* MAX_PB_SIZE, sizes[size] * sizeof(int16_t)))
 fail();
 }
 bench_new(dstw1, src1, sizes[size] * SIZEOF_PIXEL, 
sizes[size], i, j, sizes[size]);
@@ -334,7 +334,7 @@ static void checkasm_check_hevc_epel(void)
 call_ref(dstw0, src0, sizes[size] * SIZEOF_PIXEL, 
sizes[size], i, j, sizes[size]);
 call_new(dstw1, src1, sizes[size] * SIZEOF_PIXEL, 
sizes[size], i, j, sizes[size]);
 for (row = 0; row < size[sizes]; row++) {
-if (memcmp(dstw0 + row * MAX_PB_SIZE, dstw1 + row 
* MAX_PB_SIZE, sizes[size] * SIZEOF_PIXEL))
+if (memcmp(dstw0 + row * MAX_PB_SIZE, dstw1 + row 
* MAX_PB_SIZE, sizes[size] * sizeof(int16_t)))
 fail();
 }
 bench_new(dstw1, src1, sizes[size] * SIZEOF_PIXEL, 
sizes[size], i, j, sizes[size]);
-- 
2.39.3 (Apple Git-146)

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


[FFmpeg-devel] [PATCH 1/4] aarch64: Fix ff_hevc_put_hevc_epel_h48_8_neon_i8mm

2024-03-12 Thread Martin Storsjö
The first 32 elements of each row were correct, while the
last 16 were scrambled.

This hasn't been noticed, because the checkasm test erroneously
only checked half of the output (for 8 bit functions), and
apparently none of the samples as part of "fate-hevc" seem to
trigger this specific function.
---
 libavcodec/aarch64/hevcdsp_epel_neon.S | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S 
b/libavcodec/aarch64/hevcdsp_epel_neon.S
index 2dafa09337..d3f0a26f79 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -1572,6 +1572,7 @@ function ff_hevc_put_hevc_epel_h48_8_neon_i8mm, export=1
 xtn2v22.8h, v26.4s
 xtn v23.4h, v23.4s
 xtn2v23.8h, v27.4s
+add x7, x0, #64
 st4 {v20.8h, v21.8h, v22.8h, v23.8h}, [x0], x10
 ext v4.16b, v2.16b, v3.16b, #1
 ext v5.16b, v2.16b, v3.16b, #2
@@ -1584,11 +1585,14 @@ function ff_hevc_put_hevc_epel_h48_8_neon_i8mm, export=1
 usdot   v21.4s, v4.16b, v30.16b
 usdot   v22.4s, v5.16b, v30.16b
 usdot   v23.4s, v6.16b, v30.16b
-xtn v20.4h, v20.4s
-xtn2v20.8h, v22.4s
-xtn v21.4h, v21.4s
-xtn2v21.8h, v23.4s
-add x7, x0, #64
+zip1v24.4s, v20.4s, v22.4s
+zip2v25.4s, v20.4s, v22.4s
+zip1v26.4s, v21.4s, v23.4s
+zip2v27.4s, v21.4s, v23.4s
+xtn v20.4h, v24.4s
+xtn2v20.8h, v25.4s
+xtn v21.4h, v26.4s
+xtn2v21.8h, v27.4s
 st2 {v20.8h, v21.8h}, [x7]
 b.ne1b
 ret
-- 
2.39.3 (Apple Git-146)

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


[FFmpeg-devel] [PATCH] aarch64: Factorize code for CPU feature detection on Apple platforms

2024-03-12 Thread Martin Storsjö
---
 libavutil/aarch64/cpu.c | 25 +
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/libavutil/aarch64/cpu.c b/libavutil/aarch64/cpu.c
index 7a05391343..196bdaf6b0 100644
--- a/libavutil/aarch64/cpu.c
+++ b/libavutil/aarch64/cpu.c
@@ -45,22 +45,23 @@ static int detect_flags(void)
 #elif defined(__APPLE__) && HAVE_SYSCTLBYNAME
 #include 
 
+static int have_feature(const char *feature) {
+uint32_t value = 0;
+size_t size = sizeof(value);
+if (!sysctlbyname(feature, , , NULL, 0))
+return value;
+return 0;
+}
+
 static int detect_flags(void)
 {
-uint32_t value = 0;
-size_t size;
 int flags = 0;
 
-size = sizeof(value);
-if (!sysctlbyname("hw.optional.arm.FEAT_DotProd", , , NULL, 0)) 
{
-if (value)
-flags |= AV_CPU_FLAG_DOTPROD;
-}
-size = sizeof(value);
-if (!sysctlbyname("hw.optional.arm.FEAT_I8MM", , , NULL, 0)) {
-if (value)
-flags |= AV_CPU_FLAG_I8MM;
-}
+if (have_feature("hw.optional.arm.FEAT_DotProd"))
+flags |= AV_CPU_FLAG_DOTPROD;
+if (have_feature("hw.optional.arm.FEAT_I8MM"))
+flags |= AV_CPU_FLAG_I8MM;
+
 return flags;
 }
 
-- 
2.34.1

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH 02/18] fftools/ffmpeg_filter: refactor setting input timebase

2024-03-11 Thread Martin Storsjö

On Mon, 11 Mar 2024, Anton Khirnov wrote:


Well it IS obsolete. AFAIK it was never a particularly popular codec,
and was only really used by the anime and ripping scenes in early 2000s,
and even they dropped it very quickly once x264 appeared.


Within the scene of mobile HW, they commonly had HW codecs for H263 and 
MPEG4 (or SW codecs), with many but not all also supporting H264. So for 
one specific generation of mobile devices, MPEG4 was the same level of 
lingua franca that H264 is today.


Obviously not a big use case today in nontrivial numbers of course, but 
it is an example of a "scene" where the codec did have a pretty broad 
adoption.


And again - that does not mean the capability should be removed, but it 
does mean that we shouldn't insist on tuning it for the smoothest user 
experience, since this time is then NOT spent doing something actually 
useful.


I guess that's true.

// Martin


___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH 02/18] fftools/ffmpeg_filter: refactor setting input timebase

2024-03-11 Thread Martin Storsjö via ffmpeg-devel

On Mon, 11 Mar 2024, Anton Khirnov wrote:

I think the point is, that one can't just dismiss that anybody would want 
to encode mpeg4 video any longer, even if it is obsolete. I also would 
like to keep being able to do that.


That capability is not going away though, and I'm not arguing that it
should.


Ok, good. The generally dismissive arguments about mpeg4 encoding being 
obsolete and something that nobody should be doing, could be interpreted 
in such a way.


That said, I haven't followed the discussion closely enough about what to 
do with the time bases.


The only change is that in some rare cases the automatically selected
timebase no longer fits into mpeg4 constraints, so the user has to
specify either the framerate or the timebase explicitly.


Right, I see.


Specifically, the commandline used by Michael involves the extremely
obscure case of converting subtitles to video (NOT harsubbing, but
really 1 sub -> 1 video). Since subtitle encoding API is hardcoded to
AV_TIME_BASE_Q, that timebase gets used for encoding, and the mpeg4
encoder rejects it. If it was hardsubbing (i.e. 1 video + 1 sub -> 1
video), the input video timebase should be used, which would probably
work.

I don't think it's that big of a deal to require users to specify the
timebase or framerate explicitly in such a sitation.
Inventing new APIs to cover it automagically seems like a waste of time,
unless somebody has actual (not potential) uses for this.


Right, I would agree with this. (If someone else would volunteer to add 
said API I would consider accepting it though.)


Is this a usecase that currently works, but would be go away by getting 
rid of codec specific code in the tools, or is it a nice-to-have new extra 
feature that is being requested?


// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH 02/18] fftools/ffmpeg_filter: refactor setting input timebase

2024-03-11 Thread Martin Storsjö

On Mon, 11 Mar 2024, Anton Khirnov wrote:


Quoting Tobias Rapp (2024-03-11 11:12:38)

On 10/03/2024 23:49, Anton Khirnov wrote:


Quoting James Almer (2024-03-10 23:29:27)

On 3/10/2024 7:24 PM, Anton Khirnov wrote:

Quoting Michael Niedermayer (2024-03-10 20:21:47)

On Sun, Mar 10, 2024 at 07:13:18AM +0100, Anton Khirnov wrote:

Quoting Michael Niedermayer (2024-03-10 04:36:29)

why not automatically choose a supported timebase ?
"[mpeg4 @ 0x55973c869f00] timebase 1/100 not supported by MPEG 4 standard, the 
maximum admitted value for the timebase denominator is 65535"

Because I don't want ffmpeg CLI to have codec-specific code for a codec
that's been obsolete for 15+ years. One could also potentially do it
inside the encoder itself, but it is nontrivial since the computations
are spread across a number of places in mpeg4videoenc.c and
mpegvideo_enc.c. And again, it seems like a waste of time - there is no
reason to encode mpeg4 today.

This is not mpeg4 specific, its just a new additional case that fails

The case you reported is mpeg4 specific.


./ffmpeg -i mm-small.mpg test.dv
[dvvideo @ 0x7f868800f100] Found no DV profile for 80x60 yuv420p video. Valid 
DV profiles are:

There is no mechanism for an encoder to export supported time bases.

Could it be added as an extension to AVProfile, or AVCodec?

The two cases are actually pretty different:
* mpeg4 has a constraint on the range of timebases, and actually does
   some perverted computations with the timestamps
* DV just needs your video to be CFR, with a list of supported
   framerates; dvenc should probably read AVCodecContext.framerate
   instead of time_base

But most importantly, is there an actual current use case for either of
those encoders? They have both been obsolete for close to two decades.
It seems silly to add new API that won't actually be useful to anyone.


Hardware doesn't get outdated as quickly as software. And there are
people that do not switch their full environment to a new codec every
decade just to be "in line".


And your point is...?


I think the point is, that one can't just dismiss that anybody would want 
to encode mpeg4 video any longer, even if it is obsolete. I also would 
like to keep being able to do that.


That said, I haven't followed the discussion closely enough about what to 
do with the time bases.


// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


[FFmpeg-devel] [PATCH 2/2] libavcodec: Don't include libavcodec/x86/vvc/Makefile on any architecture

2024-03-08 Thread Martin Storsjö
This currently builds files in the libavcodec/x86/{vvc,h26x}
subdirectories, which is somewhat unexpected when building for
another architecture than x86.

The regular arch subdirectories are handled with

-include $(SRC_PATH)/$(1)/$(ARCH)/Makefile

in the toplevel Makefile. Switch this to a similar optional
inclusion, using $(ARCH).
---
 libavcodec/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libavcodec/Makefile b/libavcodec/Makefile
index 5d99120aa9..708434ac76 100644
--- a/libavcodec/Makefile
+++ b/libavcodec/Makefile
@@ -64,7 +64,7 @@ OBJS = ac3_parser.o   
  \
 
 # subsystems
 include $(SRC_PATH)/libavcodec/vvc/Makefile
-include $(SRC_PATH)/libavcodec/x86/vvc/Makefile
+-include $(SRC_PATH)/libavcodec/$(ARCH)/vvc/Makefile
 OBJS-$(CONFIG_AANDCTTABLES)+= aandcttab.o
 OBJS-$(CONFIG_AC3DSP)  += ac3dsp.o ac3.o ac3tab.o
 OBJS-$(CONFIG_ADTS_HEADER) += adts_header.o 
mpeg4audio_sample_rates.o
-- 
2.39.3 (Apple Git-145)

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


[FFmpeg-devel] [PATCH 1/2] makefile: Clean up missed object files with "make clean"

2024-03-08 Thread Martin Storsjö
In some builds, the following object files could be left behind
after make clean:

./libavfilter/metal/utils.o
./libavfilter/metal/vf_yadif_videotoolbox.metallib.o
./libavcodec/x86/h26x/h2656dsp.o
./libavcodec/neon/mpegvideo.o
./ffbuild/bin2c_host.o
---
 ffbuild/common.mak  | 2 +-
 libavcodec/neon/Makefile| 3 +++
 libavcodec/x86/vvc/Makefile | 2 +-
 libavfilter/Makefile| 1 +
 4 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/ffbuild/common.mak b/ffbuild/common.mak
index ac54ac0681..87a3ffd2b0 100644
--- a/ffbuild/common.mak
+++ b/ffbuild/common.mak
@@ -140,7 +140,7 @@ else
 endif
 
 clean::
-   $(RM) $(BIN2CEXE)
+   $(RM) $(BIN2CEXE) $(CLEANSUFFIXES:%=ffbuild/%)
 
 %.c %.h %.pc %.ver %.version: TAG = GEN
 
diff --git a/libavcodec/neon/Makefile b/libavcodec/neon/Makefile
index 607f116a77..83c2f0051c 100644
--- a/libavcodec/neon/Makefile
+++ b/libavcodec/neon/Makefile
@@ -1 +1,4 @@
+clean::
+   $(RM) $(CLEANSUFFIXES:%=libavcodec/neon/%)
+
 OBJS-$(CONFIG_MPEGVIDEO)  += neon/mpegvideo.o
diff --git a/libavcodec/x86/vvc/Makefile b/libavcodec/x86/vvc/Makefile
index 82f281d1c7..d1623bd46a 100644
--- a/libavcodec/x86/vvc/Makefile
+++ b/libavcodec/x86/vvc/Makefile
@@ -1,5 +1,5 @@
 clean::
-   $(RM) $(CLEANSUFFIXES:%=libavcodec/x86/vvc/%)
+   $(RM) $(CLEANSUFFIXES:%=libavcodec/x86/vvc/%) 
$(CLEANSUFFIXES:%=libavcodec/x86/h26x/%)
 
 OBJS-$(CONFIG_VVC_DECODER) += x86/vvc/vvcdsp_init.o \
   x86/h26x/h2656dsp.o
diff --git a/libavfilter/Makefile b/libavfilter/Makefile
index f6c1d641d6..994d9773ba 100644
--- a/libavfilter/Makefile
+++ b/libavfilter/Makefile
@@ -666,6 +666,7 @@ TOOLS-$(CONFIG_LIBZMQ) += zmqsend
 
 clean::
$(RM) $(CLEANSUFFIXES:%=libavfilter/dnn/%) 
$(CLEANSUFFIXES:%=libavfilter/opencl/%) \
+  $(CLEANSUFFIXES:%=libavfilter/metal/%) \
   $(CLEANSUFFIXES:%=libavfilter/vulkan/%)
 
 OPENCL = $(subst $(SRC_PATH)/,,$(wildcard $(SRC_PATH)/libavfilter/opencl/*.cl))
-- 
2.39.3 (Apple Git-145)

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


[FFmpeg-devel] [PATCH] libavdevice: Fix the avfoundation device after switching to FFInputFormat

2024-03-08 Thread Martin Storsjö
This was missed in b800327f4c7233d09baca958121722a04c2035ff.
---
 libavdevice/avfoundation.m | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/libavdevice/avfoundation.m b/libavdevice/avfoundation.m
index a0ef87edff..d9b17ccdae 100644
--- a/libavdevice/avfoundation.m
+++ b/libavdevice/avfoundation.m
@@ -32,6 +32,7 @@
 #include "libavutil/pixdesc.h"
 #include "libavutil/opt.h"
 #include "libavutil/avstring.h"
+#include "libavformat/demux.h"
 #include "libavformat/internal.h"
 #include "libavutil/internal.h"
 #include "libavutil/parseutils.h"
@@ -1292,13 +1293,13 @@ static int avf_close(AVFormatContext *s)
 .category   = AV_CLASS_CATEGORY_DEVICE_VIDEO_INPUT,
 };
 
-const AVInputFormat ff_avfoundation_demuxer = {
-.name   = "avfoundation",
-.long_name  = NULL_IF_CONFIG_SMALL("AVFoundation input device"),
+const FFInputFormat ff_avfoundation_demuxer = {
+.p.name = "avfoundation",
+.p.long_name= NULL_IF_CONFIG_SMALL("AVFoundation input device"),
+.p.flags= AVFMT_NOFILE,
+.p.priv_class   = _class,
 .priv_data_size = sizeof(AVFContext),
 .read_header= avf_read_header,
 .read_packet= avf_read_packet,
 .read_close = avf_close,
-.flags  = AVFMT_NOFILE,
-.priv_class = _class,
 };
-- 
2.39.3 (Apple Git-145)

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH] lavc/aarch64/fdct: add neon-optimized fdct for aarch64

2024-03-06 Thread Martin Storsjö

On Wed, 6 Mar 2024, Ramiro Polla wrote:


ping



Did you miss my response here? 
https://ffmpeg.org/pipermail/ffmpeg-devel/2024-February/321448.html


// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH] aarch64: Use regular hwcaps flags instead of HWCAP_CPUID for CPU feature detection on Linux

2024-03-02 Thread Martin Storsjö

On Wed, 28 Feb 2024, Martin Storsjö wrote:


The CPU feature detection was added in
493fcde50a84cb23854335bcb0e55c6f383d55db, using HWCAP_CPUID.

The argument for using that, was that HWCAP_CPUID was added much
earlier in the kernel (in Linux v4.11), while the HWCAP flags for
individual features were added much later. And if compiling with
older userland headers that lack the bits for e.g. HWCAP_I8MM, we
wouldn't be able to detect that feature.

(In practice, e.g. Ubuntu 20.04 lacks HWCAP_I8MM in userland
headers, but the toolchain does support assembling such
instructions).

However, while the flag HWCAP_I8MM was addded only in Linux v5.10,
any CPU with that feature is most likely running a kernel that is
newer than that as well. So by using HWCAP_CPUID, we could detect
that feature on kernels between v4.11 and v5.10, but that is a
quite unlikely case in practice.

By using regular hwcaps flags, the code is much simplified, and
doesn't rely on inline assembly to read the cpu id registers.

And instead of requiring the userland headers to provide the
definitions of the hwcap flags, provide our own definitions of the
constants (they are fixed constants anyway), with names not conflicting
with the ones from system headers. This avoids a number of ifdefs, and
allows detecting these features even if building with userland headers
that don't contain these definitions yet.

Also, slightly older versions of QEMU, e.g. 6.2 in Ubuntu 22.04,
do expose these features via HWCAP flags, but the emulated cpuid
registers are missing the bits for exposing e.g. I8MM.
---
libavutil/aarch64/cpu.c | 30 --
1 file changed, 8 insertions(+), 22 deletions(-)


Will apply on Monday, if there's no objections.

// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH v4] avcodec/aarch64/hevc: add luma deblock NEON

2024-02-28 Thread Martin Storsjö

On Wed, 28 Feb 2024, J. Dekker wrote:



Martin Storsjö  writes:


On Wed, 28 Feb 2024, J. Dekker wrote:



Martin Storsjö  writes:


On Tue, 27 Feb 2024, J. Dekker wrote:


Benched using single-threaded full decode on an Ampere Altra.

Bpp Before  After  Speedup
8   73,3s   65,2s  1.124x
10  114,2s  104,0s 1.098x
12  125,8s  115,7s 1.087x

Signed-off-by: J. Dekker 
---

Slightly improved 12bit version.

libavcodec/aarch64/hevcdsp_deblock_neon.S | 417 ++
libavcodec/aarch64/hevcdsp_init_aarch64.c |  18 +
2 files changed, 435 insertions(+)

diff --git a/libavcodec/aarch64/hevcdsp_deblock_neon.S 
b/libavcodec/aarch64/hevcdsp_deblock_neon.S
index 8227f65649..581056a91e 100644
--- a/libavcodec/aarch64/hevcdsp_deblock_neon.S
+++ b/libavcodec/aarch64/hevcdsp_deblock_neon.S
@@ -181,3 +181,420 @@ hevc_h_loop_filter_chroma 12
hevc_v_loop_filter_chroma 8
hevc_v_loop_filter_chroma 10
hevc_v_loop_filter_chroma 12
+
+.macro hevc_loop_filter_luma_body bitdepth
+function hevc_loop_filter_luma_body_\bitdepth\()_neon, export=0
+.if \bitdepth > 8
+lsl w2, w2, #(\bitdepth - 8) // beta <<= BIT_DEPTH - 8
+.else
+uxtlv0.8h, v0.8b
+uxtlv1.8h, v1.8b
+uxtlv2.8h, v2.8b
+uxtlv3.8h, v3.8b
+uxtlv4.8h, v4.8b
+uxtlv5.8h, v5.8b
+uxtlv6.8h, v6.8b
+uxtlv7.8h, v7.8b
+.endif
+ldr w7, [x3] // tc[0]
+ldr w8, [x3, #4] // tc[1]
+dup v18.4h, w7
+dup v19.4h, w8
+trn1v18.2d, v18.2d, v19.2d
+.if \bitdepth > 8
+shl v18.8h, v18.8h, #(\bitdepth - 8)
+.endif
+dup v27.8h, w2 // beta
+// tc25
+shl v19.8h, v18.8h, #2 // * 4
+add v19.8h, v19.8h, v18.8h // (tc * 5)
+srshr   v19.8h, v19.8h, #1 // (tc * 5 + 1) >> 1
+sshrv17.8h, v27.8h, #2 // beta2
+
+// beta_2 check
+// dp0  = abs(P2  - 2 * P1  + P0)
+add v22.8h, v3.8h, v1.8h
+shl v23.8h, v2.8h, #1
+sabdv30.8h, v22.8h, v23.8h
+// dq0  = abs(Q2  - 2 * Q1  + Q0)
+add v21.8h, v6.8h, v4.8h
+shl v26.8h, v5.8h, #1
+sabdv31.8h, v21.8h, v26.8h
+// d0   = dp0 + dq0
+add v20.8h, v30.8h, v31.8h
+shl v25.8h, v20.8h, #1
+// (d0 << 1) < beta_2
+cmgtv23.8h, v17.8h, v25.8h
+
+// beta check
+// d0 + d3 < beta
+mov x9, #0x
+dup v24.2d, x9
+and v25.16b, v24.16b, v20.16b
+addpv25.8h, v25.8h, v25.8h // 1+0 0+1 1+0 0+1
+addpv25.4h, v25.4h, v25.4h // 1+0+0+1 1+0+0+1
+cmgtv25.4h, v27.4h, v25.4h // lower/upper mask in h[0/1]
+mov w9, v25.s[0]


I don't quite understand what this sequence does and/or how our data is laid
out in our registers - we have d0 on input in v20, where's d3? An doesn't the
"and" throw away half of the input elements here?

I see some similar patterns with the masking and handling below as well - I get
a feeling that I don't quite understand the algorithm here, and/or the data
layout.


We have d0, d1, d2, d3 for both 4 line blocks in v20, mask out d1/d2 and
use pair-wise adds to move our data around and calculate d0+d3
together. The first addp just moves elements around, the second addp
adds d0 + 0 + 0 + d3.


Right, I guess this is the bit that was surprising. I would have expected to
have e.g. all the d0 values for e.g. the 8 individual pixels in one SIMD
register, and all the d3 values for all pixels in another SIMD register.

So as we're operating on 8 pixels in parallel, each of those 8 pixels have
their own d0/d3 values, right? Or is this a case where we have just one d0/d3
value for a range of pixels?


Yes, d0/d1/d2/d3 are per 4 lines of 8 pixels, it's because d0 and d3 are
calculated within their own line, d0 from line 0, d3 from line 3. Maybe
it's more confusing since we are doing both halves of the filter at the
same time? v20 contains d0 d1 d2 d3 d0 d1 d2 d3, where the second d0 is
distinct from the first.

But essentially we're doing the same operation across the entire 8
lines, the filter just makes an overall skip decision for each block of
4 lines based on the sum of the result from line 0 and 3.


Ah, right, I see. I guess this makes sense then. Thanks!

Thus, no further objections to it; the optimizing of loading/storing can 
be done separately.


// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, 

Re: [FFmpeg-devel] [PATCH v4] avcodec/aarch64/hevc: add luma deblock NEON

2024-02-28 Thread Martin Storsjö

On Wed, 28 Feb 2024, J. Dekker wrote:



Martin Storsjö  writes:


On Tue, 27 Feb 2024, J. Dekker wrote:


Benched using single-threaded full decode on an Ampere Altra.

Bpp Before  After  Speedup
8   73,3s   65,2s  1.124x
10  114,2s  104,0s 1.098x
12  125,8s  115,7s 1.087x

Signed-off-by: J. Dekker 
---

Slightly improved 12bit version.

libavcodec/aarch64/hevcdsp_deblock_neon.S | 417 ++
libavcodec/aarch64/hevcdsp_init_aarch64.c |  18 +
2 files changed, 435 insertions(+)

diff --git a/libavcodec/aarch64/hevcdsp_deblock_neon.S 
b/libavcodec/aarch64/hevcdsp_deblock_neon.S
index 8227f65649..581056a91e 100644
--- a/libavcodec/aarch64/hevcdsp_deblock_neon.S
+++ b/libavcodec/aarch64/hevcdsp_deblock_neon.S
@@ -181,3 +181,420 @@ hevc_h_loop_filter_chroma 12
hevc_v_loop_filter_chroma 8
hevc_v_loop_filter_chroma 10
hevc_v_loop_filter_chroma 12
+
+.macro hevc_loop_filter_luma_body bitdepth
+function hevc_loop_filter_luma_body_\bitdepth\()_neon, export=0
+.if \bitdepth > 8
+lsl w2, w2, #(\bitdepth - 8) // beta <<= BIT_DEPTH - 8
+.else
+uxtlv0.8h, v0.8b
+uxtlv1.8h, v1.8b
+uxtlv2.8h, v2.8b
+uxtlv3.8h, v3.8b
+uxtlv4.8h, v4.8b
+uxtlv5.8h, v5.8b
+uxtlv6.8h, v6.8b
+uxtlv7.8h, v7.8b
+.endif
+ldr w7, [x3] // tc[0]
+ldr w8, [x3, #4] // tc[1]
+dup v18.4h, w7
+dup v19.4h, w8
+trn1v18.2d, v18.2d, v19.2d
+.if \bitdepth > 8
+shl v18.8h, v18.8h, #(\bitdepth - 8)
+.endif
+dup v27.8h, w2 // beta
+// tc25
+shl v19.8h, v18.8h, #2 // * 4
+add v19.8h, v19.8h, v18.8h // (tc * 5)
+srshr   v19.8h, v19.8h, #1 // (tc * 5 + 1) >> 1
+sshrv17.8h, v27.8h, #2 // beta2
+
+// beta_2 check
+// dp0  = abs(P2  - 2 * P1  + P0)
+add v22.8h, v3.8h, v1.8h
+shl v23.8h, v2.8h, #1
+sabdv30.8h, v22.8h, v23.8h
+// dq0  = abs(Q2  - 2 * Q1  + Q0)
+add v21.8h, v6.8h, v4.8h
+shl v26.8h, v5.8h, #1
+sabdv31.8h, v21.8h, v26.8h
+// d0   = dp0 + dq0
+add v20.8h, v30.8h, v31.8h
+shl v25.8h, v20.8h, #1
+// (d0 << 1) < beta_2
+cmgtv23.8h, v17.8h, v25.8h
+
+// beta check
+// d0 + d3 < beta
+mov x9, #0x
+dup v24.2d, x9
+and v25.16b, v24.16b, v20.16b
+addpv25.8h, v25.8h, v25.8h // 1+0 0+1 1+0 0+1
+addpv25.4h, v25.4h, v25.4h // 1+0+0+1 1+0+0+1
+cmgtv25.4h, v27.4h, v25.4h // lower/upper mask in h[0/1]
+mov w9, v25.s[0]


I don't quite understand what this sequence does and/or how our data is laid
out in our registers - we have d0 on input in v20, where's d3? An doesn't the
"and" throw away half of the input elements here?

I see some similar patterns with the masking and handling below as well - I get
a feeling that I don't quite understand the algorithm here, and/or the data
layout.


We have d0, d1, d2, d3 for both 4 line blocks in v20, mask out d1/d2 and
use pair-wise adds to move our data around and calculate d0+d3
together. The first addp just moves elements around, the second addp
adds d0 + 0 + 0 + d3.


Right, I guess this is the bit that was surprising. I would have expected 
to have e.g. all the d0 values for e.g. the 8 individual pixels in one 
SIMD register, and all the d3 values for all pixels in another SIMD 
register.


So as we're operating on 8 pixels in parallel, each of those 8 pixels have 
their own d0/d3 values, right? Or is this a case where we have just one 
d0/d3 value for a range of pixels?


// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


[FFmpeg-devel] [PATCH] aarch64: Use regular hwcaps flags instead of HWCAP_CPUID for CPU feature detection on Linux

2024-02-27 Thread Martin Storsjö
The CPU feature detection was added in
493fcde50a84cb23854335bcb0e55c6f383d55db, using HWCAP_CPUID.

The argument for using that, was that HWCAP_CPUID was added much
earlier in the kernel (in Linux v4.11), while the HWCAP flags for
individual features were added much later. And if compiling with
older userland headers that lack the bits for e.g. HWCAP_I8MM, we
wouldn't be able to detect that feature.

(In practice, e.g. Ubuntu 20.04 lacks HWCAP_I8MM in userland
headers, but the toolchain does support assembling such
instructions).

However, while the flag HWCAP_I8MM was addded only in Linux v5.10,
any CPU with that feature is most likely running a kernel that is
newer than that as well. So by using HWCAP_CPUID, we could detect
that feature on kernels between v4.11 and v5.10, but that is a
quite unlikely case in practice.

By using regular hwcaps flags, the code is much simplified, and
doesn't rely on inline assembly to read the cpu id registers.

And instead of requiring the userland headers to provide the
definitions of the hwcap flags, provide our own definitions of the
constants (they are fixed constants anyway), with names not conflicting
with the ones from system headers. This avoids a number of ifdefs, and
allows detecting these features even if building with userland headers
that don't contain these definitions yet.

Also, slightly older versions of QEMU, e.g. 6.2 in Ubuntu 22.04,
do expose these features via HWCAP flags, but the emulated cpuid
registers are missing the bits for exposing e.g. I8MM.
---
 libavutil/aarch64/cpu.c | 30 --
 1 file changed, 8 insertions(+), 22 deletions(-)

diff --git a/libavutil/aarch64/cpu.c b/libavutil/aarch64/cpu.c
index f27fef3992..7a05391343 100644
--- a/libavutil/aarch64/cpu.c
+++ b/libavutil/aarch64/cpu.c
@@ -24,34 +24,20 @@
 #include 
 #include 
 
-#define get_cpu_feature_reg(reg, val) \
-__asm__("mrs %0, " #reg : "=r" (val))
+#define HWCAP_AARCH64_ASIMDDP (1 << 20)
+#define HWCAP2_AARCH64_I8MM   (1 << 13)
 
 static int detect_flags(void)
 {
 int flags = 0;
 
-#if defined(HWCAP_CPUID) && HAVE_INLINE_ASM
 unsigned long hwcap = getauxval(AT_HWCAP);
-// We can check for DOTPROD and I8MM using HWCAP_ASIMDDP and
-// HWCAP2_I8MM too, avoiding to read the CPUID registers (which triggers
-// a trap, handled by the kernel). However the HWCAP_* defines for these
-// extensions are added much later than HWCAP_CPUID, so the userland
-// headers might lack support for them even if the binary later is run
-// on hardware that does support it (and where the kernel might support
-// HWCAP_CPUID).
-// See 
https://www.kernel.org/doc/html/latest/arm64/cpu-feature-registers.html
-if (hwcap & HWCAP_CPUID) {
-uint64_t tmp;
-
-get_cpu_feature_reg(ID_AA64ISAR0_EL1, tmp);
-if (((tmp >> 44) & 0xf) == 0x1)
-flags |= AV_CPU_FLAG_DOTPROD;
-get_cpu_feature_reg(ID_AA64ISAR1_EL1, tmp);
-if (((tmp >> 52) & 0xf) == 0x1)
-flags |= AV_CPU_FLAG_I8MM;
-}
-#endif
+unsigned long hwcap2 = getauxval(AT_HWCAP2);
+
+if (hwcap & HWCAP_AARCH64_ASIMDDP)
+flags |= AV_CPU_FLAG_DOTPROD;
+if (hwcap2 & HWCAP2_AARCH64_I8MM)
+flags |= AV_CPU_FLAG_I8MM;
 
 return flags;
 }
-- 
2.39.3 (Apple Git-145)

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH v4] avcodec/aarch64/hevc: add luma deblock NEON

2024-02-27 Thread Martin Storsjö

On Tue, 27 Feb 2024, J. Dekker wrote:


Benched using single-threaded full decode on an Ampere Altra.

Bpp Before  After  Speedup
8   73,3s   65,2s  1.124x
10  114,2s  104,0s 1.098x
12  125,8s  115,7s 1.087x

Signed-off-by: J. Dekker 
---

Slightly improved 12bit version.

libavcodec/aarch64/hevcdsp_deblock_neon.S | 417 ++
libavcodec/aarch64/hevcdsp_init_aarch64.c |  18 +
2 files changed, 435 insertions(+)

diff --git a/libavcodec/aarch64/hevcdsp_deblock_neon.S 
b/libavcodec/aarch64/hevcdsp_deblock_neon.S
index 8227f65649..581056a91e 100644
--- a/libavcodec/aarch64/hevcdsp_deblock_neon.S
+++ b/libavcodec/aarch64/hevcdsp_deblock_neon.S
@@ -181,3 +181,420 @@ hevc_h_loop_filter_chroma 12
hevc_v_loop_filter_chroma 8
hevc_v_loop_filter_chroma 10
hevc_v_loop_filter_chroma 12
+
+.macro hevc_loop_filter_luma_body bitdepth
+function hevc_loop_filter_luma_body_\bitdepth\()_neon, export=0
+.if \bitdepth > 8
+lsl w2, w2, #(\bitdepth - 8) // beta <<= BIT_DEPTH - 8
+.else
+uxtlv0.8h, v0.8b
+uxtlv1.8h, v1.8b
+uxtlv2.8h, v2.8b
+uxtlv3.8h, v3.8b
+uxtlv4.8h, v4.8b
+uxtlv5.8h, v5.8b
+uxtlv6.8h, v6.8b
+uxtlv7.8h, v7.8b
+.endif
+ldr w7, [x3] // tc[0]
+ldr w8, [x3, #4] // tc[1]
+dup v18.4h, w7
+dup v19.4h, w8
+trn1v18.2d, v18.2d, v19.2d
+.if \bitdepth > 8
+shl v18.8h, v18.8h, #(\bitdepth - 8)
+.endif
+dup v27.8h, w2 // beta
+// tc25
+shl v19.8h, v18.8h, #2 // * 4
+add v19.8h, v19.8h, v18.8h // (tc * 5)
+srshr   v19.8h, v19.8h, #1 // (tc * 5 + 1) >> 1
+sshrv17.8h, v27.8h, #2 // beta2
+
+// beta_2 check
+// dp0  = abs(P2  - 2 * P1  + P0)
+add v22.8h, v3.8h, v1.8h
+shl v23.8h, v2.8h, #1
+sabdv30.8h, v22.8h, v23.8h
+// dq0  = abs(Q2  - 2 * Q1  + Q0)
+add v21.8h, v6.8h, v4.8h
+shl v26.8h, v5.8h, #1
+sabdv31.8h, v21.8h, v26.8h
+// d0   = dp0 + dq0
+add v20.8h, v30.8h, v31.8h
+shl v25.8h, v20.8h, #1
+// (d0 << 1) < beta_2
+cmgtv23.8h, v17.8h, v25.8h
+
+// beta check
+// d0 + d3 < beta
+mov x9, #0x
+dup v24.2d, x9
+and v25.16b, v24.16b, v20.16b
+addpv25.8h, v25.8h, v25.8h // 1+0 0+1 1+0 0+1
+addpv25.4h, v25.4h, v25.4h // 1+0+0+1 1+0+0+1
+cmgtv25.4h, v27.4h, v25.4h // lower/upper mask in h[0/1]
+mov w9, v25.s[0]


I don't quite understand what this sequence does and/or how our data is 
laid out in our registers - we have d0 on input in v20, where's d3? An 
doesn't the "and" throw away half of the input elements here?


I see some similar patterns with the masking and handling below as well - 
I get a feeling that I don't quite understand the algorithm here, and/or 
the data layout.



+.if \bitdepth > 8
+ld1 {v0.8h}, [x0], x1
+ld1 {v1.8h}, [x0], x1
+ld1 {v2.8h}, [x0], x1
+ld1 {v3.8h}, [x0], x1
+ld1 {v4.8h}, [x0], x1
+ld1 {v5.8h}, [x0], x1
+ld1 {v6.8h}, [x0], x1
+ld1 {v7.8h}, [x0]
+mov w14, #((1 << \bitdepth) - 1)


For loads like these, we can generally save a bit by using two alternating 
registers for loading, with a double stride - see e.g. the vp9 loop 
filter implementations. But that's a micro optimization.


Other than that, this mostly looks reasaonble.

// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH 2/3] avcodec/x86: disable hevc 12b luma deblock

2024-02-24 Thread Martin Storsjö

On Sat, 24 Feb 2024, J. Dekker wrote:



Nuo Mi  writes:


On Wed, Feb 21, 2024 at 7:10 PM J. Dekker  wrote:


Over/underflow in some cases.

Signed-off-by: J. Dekker 
---
 libavcodec/x86/hevcdsp_init.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/libavcodec/x86/hevcdsp_init.c b/libavcodec/x86/hevcdsp_init.c
index 31e81eb11f..11cb1b3bfd 100644
--- a/libavcodec/x86/hevcdsp_init.c
+++ b/libavcodec/x86/hevcdsp_init.c
@@ -1205,10 +1205,11 @@ void ff_hevc_dsp_init_x86(HEVCDSPContext *c, const
int bit_depth)
 if (EXTERNAL_SSE2(cpu_flags)) {
 c->hevc_v_loop_filter_chroma =
ff_hevc_v_loop_filter_chroma_12_sse2;
 c->hevc_h_loop_filter_chroma =
ff_hevc_h_loop_filter_chroma_12_sse2;
-if (ARCH_X86_64) {
-c->hevc_v_loop_filter_luma =
ff_hevc_v_loop_filter_luma_12_sse2;
-c->hevc_h_loop_filter_luma =
ff_hevc_h_loop_filter_luma_12_sse2;
-}
+// FIXME: 12-bit luma deblock over/underflows in some cases
+// if (ARCH_X86_64) {
+// c->hevc_v_loop_filter_luma =
ff_hevc_v_loop_filter_luma_12_sse2;
+// c->hevc_h_loop_filter_luma =
ff_hevc_h_loop_filter_luma_12_sse2;
+// }
 SAO_BAND_INIT(12, sse2);
 SAO_EDGE_INIT(12, sse2);


Hi Dekker,
VVC will utilize this function as well.
Could you please share the HEVC clip or data that caused the overflow?
We'll make efforts to address it during the VVC porting



You can just run ./tests/checkasm/checkasm --test=hevc_deblock to
find a failing case.


To clarify, this is with the new checkasm test added in this patchset, not 
currently in git master - otherwise fate would be failing for everybody on 
x86.



My guess is that delta0 overflows before the right
shift, see the ARM64 asm which specfically widens this calculation on 12
bit variant but I'm not 100%, I don't know x86 asm.


Are you sure the input is within valid range? It's always possible that 
checkasm produces inputs that the real decoder wouldn't - but it's also 
possible that this is a real decoder bug that just hasn't been triggered 
by any other test yet.


// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


[FFmpeg-devel] [GASPP PATCH] Don't mangle .L local labels for ELF targets

2024-02-22 Thread Martin Storsjö
This fixes building FFmpeg's libavcodec/aarch64/h264idct_neon.S
for a Linux target. (It's not necessary to use gas-preprocessor for
such a target for a very long time, but it can be useful to be able
to test gas-preprocessor there still.)
---
 gas-preprocessor.pl | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/gas-preprocessor.pl b/gas-preprocessor.pl
index ba75611..2880858 100755
--- a/gas-preprocessor.pl
+++ b/gas-preprocessor.pl
@@ -738,7 +738,10 @@ sub handle_serialized_line {
 }
 
 # mach-o local symbol names start with L (no dot)
-$line =~ s/(?https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH 3/3] avcodec/aarch64: add hevc deblock NEON

2024-02-21 Thread Martin Storsjö

On Wed, 21 Feb 2024, J. Dekker wrote:


Benched using single-threaded full decode on an Ampere Altra.

Bpp Before  After  Speedup
8   73,3s   65,2s  1.124x
10  114,2s  104,0s 1.098x
12  125,8s  115,7s 1.087x

Signed-off-by: J. Dekker 
---
libavcodec/aarch64/hevcdsp_deblock_neon.S | 421 ++
libavcodec/aarch64/hevcdsp_init_aarch64.c |  18 +
2 files changed, 439 insertions(+)



+0:  // STRONG FILTER
+
+// P0 = p0 + av_clip(((p2 + 2 * p1 + 2 * p0 + 2 * q0 + q1 + 4) >> 3) - 
p0, -tc3, tc3);
+add v21.8h, v2.8h, v3.8h   // (p1 + p0
+add v21.8h, v4.8h, v21.8h  // + q0)
+shl v21.8h, v21.8h, #1 //   * 2
+add v22.8h, v1.8h, v5.8h   //   (p2 + q1)
+add v21.8h, v22.8h, v21.8h // +
+srshrv21.8h, v21.8h, #3 //   >> 3
+sub v21.8h, v21.8h, v3.8h  //- p0
+


The srshr line is incorrectly indented here (and elsewhere)


+sqxtun  v4.8b, v4.8h
+sqxtun  v5.8b, v5.8h
+sqxtun  v6.8b, v6.8h
+sqxtun  v7.8b, v7.8h
+.endif
+ret
+3:  ret x6


Please indent the "x6" here like other operands


+.macro hevc_loop_filter_luma dir bitdepth
+function ff_hevc_\dir\()_loop_filter_luma_\bitdepth\()_neon, export=1
+mov x6, x30
+.if \dir == v


In GAS assembler, .if does a numerical comparison - it can't do string 
comparisons.


The right way to do this is to do ".ifc \dir, v", which does a string 
comparison.


(If you really do need to do this like a numerical comparison, it's 
possible to define e.g. "v" as a numeric symbol as well, see e.g. 
https://code.videolan.org/videolan/dav1d/-/merge_requests/1603/diffs?commit_id=d4746c908c56cb2e8545efd348b8cdc13f2f2253 
but that's not really the nicest way to do it.)


This issue breaks compilation with Clang. With gas-preprocessor (for 
MSVC), it manages to build correctly, but does the wrong thing.



To avoid me having to test all these build configurations manually, 
remembering to check all these corner case build configurations and check 
indentation and all, I've set up a PoC for testing such things on Github 
Actions.


If you have a repo on github, grab my commits from 
https://github.com/mstorsjo/FFmpeg/commits/gha-aarch64 (there are a couple 
of them), add your changes on top of these, and push it as a branch to 
your own github repo, then check the output from the actions.


Here's the output of a run with the patches you just posted: 
https://github.com/mstorsjo/FFmpeg/actions/runs/7988312683


// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH] checkasm: Add a "run-checkasm" make target

2024-02-21 Thread Martin Storsjö

On Wed, 14 Feb 2024, Martin Storsjö wrote:


Contrary to the existing "fate-checkasm", this always prints the
tool output, and runs all tests at once instead of splitting it up
per target group. This is more useful when the user expects to
look directly at the tool output, instead of being part of a full
fate run.

(On failure with the regular "make fate-checkasm" targets, none of
the tool output is printed, but stored in files. If run with reporting
set up to the FATE website, the individual failures are uploaded there,
but if it is run in some sort of other CI setup, the intermediate files
might not be available afterwards for inspection.)
---
tests/checkasm/Makefile | 4 
1 file changed, 4 insertions(+)

diff --git a/tests/checkasm/Makefile b/tests/checkasm/Makefile
index 3562acb2b2..3af42a679b 100644
--- a/tests/checkasm/Makefile
+++ b/tests/checkasm/Makefile
@@ -91,6 +91,10 @@ CHECKASM := tests/checkasm/checkasm$(EXESUF)
$(CHECKASM): $(CHECKASMOBJS) $(FF_STATIC_DEP_LIBS)
$(LD) $(LDFLAGS) $(LDEXEFLAGS) $(LD_O) $(CHECKASMOBJS) 
$(FF_STATIC_DEP_LIBS) $(EXTRALIBS-avcodec) $(EXTRALIBS-avfilter) 
$(EXTRALIBS-avformat) $(EXTRALIBS-avutil) $(EXTRALIBS-swresample) $(EXTRALIBS)

+run-checkasm: $(CHECKASM)
+run-checkasm:
+   $(TARGET_EXEC) $(TARGET_PATH)/$(CHECKASM)


I've amended this locally with a $(Q) at the start, to silence the 
executed command, unless executed with V=1.


I'll push this patch later today if there aren't any objections.

// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH] avutil/intreadwrite: Remove obsolete warning

2024-02-19 Thread Martin Storsjö

On Mon, 19 Feb 2024, Andreas Rheinhardt wrote:


Andreas Rheinhardt:

Obsolete since 7ec2354c38978b918dc079b611393becb6c80bf7.

Signed-off-by: Andreas Rheinhardt 
---
 libavutil/intreadwrite.h | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/libavutil/intreadwrite.h b/libavutil/intreadwrite.h
index 21df7887f3..d0a5773b54 100644
--- a/libavutil/intreadwrite.h
+++ b/libavutil/intreadwrite.h
@@ -583,9 +583,7 @@ union unaligned_16 { uint16_t l; } __attribute__((packed)) 
av_alias;
 #endif

 /* Parameters for AV_COPY*, AV_SWAP*, AV_ZERO* must be
- * naturally aligned. They may be implemented using MMX,
- * so emms_c() must be called before using any float code
- * afterwards.
+ * naturally aligned.
  */

 #define AV_COPY(n, d, s) \


Will apply this patch tomorrow unless there are objections.


LGTM, thanks!

// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH] flvdec: Honor the "flv_metadata" option for the "datastream" metadata field

2024-02-19 Thread Martin Storsjö

On Fri, 9 Feb 2024, Martin Storsjö wrote:


By default the option "flv_metadata" (internally using the field
name "trust_metadata") is set to 0, meaning that we don't allocate
streams based on information in the metadata, only based on
actual streams we encounter. However the "datastream" metadata field
still would allocate a subtitle stream.

When muxing, the "datastream" field is added if either a data stream
or subtitle stream is present - but the same metadata field is used
to preemtively create a subtitle stream only. Thus, if the field
was added due to a data stream, not a subtitle stream, the demuxer
would create a stream which won't get any actual packets.

If there was such an extra, empty subtitle stream, running
avformat_find_stream_info still used to terminate within reasonable
time before 3749eede66c3774799766b1f246afae8a6ffc9bb. After that
commit, it no longer would terminate until it reaches the max
analyze duration, which is 90 seconds for flv streams (see
e6a084641aada7a2e4672172f2ee26642800a361,
24fdf7334d2bb9aab0abdbc878b8ae51eb57c86b and
f58e011a1f30332ba824c155078ca701e29aef63).

Before that commit (which removed the deprecated AVStream.codec), the
"st->codecpar->codec_id = AV_CODEC_ID_TEXT", set within the demuxer,
would get propagated into st->codec->codec_id by numerous
avcodec_parameters_to_context(st->codec, st->codecpar), then further
into st->internal->avctx->codec_id by update_stream_avctx within
read_frame_internal in libavformat/utils.c (demux.c these days).
---
libavformat/flvdec.c | 12 ++--
1 file changed, 6 insertions(+), 6 deletions(-)


Will push soon if there are no objections.

// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


[FFmpeg-devel] [PATCH] checkasm: Add a "run-checkasm" make target

2024-02-14 Thread Martin Storsjö
Contrary to the existing "fate-checkasm", this always prints the
tool output, and runs all tests at once instead of splitting it up
per target group. This is more useful when the user expects to
look directly at the tool output, instead of being part of a full
fate run.

(On failure with the regular "make fate-checkasm" targets, none of
the tool output is printed, but stored in files. If run with reporting
set up to the FATE website, the individual failures are uploaded there,
but if it is run in some sort of other CI setup, the intermediate files
might not be available afterwards for inspection.)
---
 tests/checkasm/Makefile | 4 
 1 file changed, 4 insertions(+)

diff --git a/tests/checkasm/Makefile b/tests/checkasm/Makefile
index 3562acb2b2..3af42a679b 100644
--- a/tests/checkasm/Makefile
+++ b/tests/checkasm/Makefile
@@ -91,6 +91,10 @@ CHECKASM := tests/checkasm/checkasm$(EXESUF)
 $(CHECKASM): $(CHECKASMOBJS) $(FF_STATIC_DEP_LIBS)
$(LD) $(LDFLAGS) $(LDEXEFLAGS) $(LD_O) $(CHECKASMOBJS) 
$(FF_STATIC_DEP_LIBS) $(EXTRALIBS-avcodec) $(EXTRALIBS-avfilter) 
$(EXTRALIBS-avformat) $(EXTRALIBS-avutil) $(EXTRALIBS-swresample) $(EXTRALIBS)
 
+run-checkasm: $(CHECKASM)
+run-checkasm:
+   $(TARGET_EXEC) $(TARGET_PATH)/$(CHECKASM)
+
 checkasm: $(CHECKASM)
 
 testclean:: checkasmclean
-- 
2.39.3 (Apple Git-145)

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH] lavc/aarch64/fdct: add neon-optimized fdct for aarch64

2024-02-14 Thread Martin Storsjö

Hi,

On Sun, 4 Feb 2024, Ramiro Polla wrote:


The code is imported from libjpeg-turbo-3.0.1. The neon registers used
have been changed to avoid modifying v8-v15.
---


I don't remember if we have any extra routines we need to do if importing 
foreign code with a differing license. The license here seems fine in any 
case though.


This seems to work fine in all my test environments. And thanks for making 
sure it doesn't use v8-v15!


I'm not so familiar with these DSP functions, whether it is norm to add a 
new constant like FF_DCT_NEON, but I guess it seems to match the pattern 
of the existing code.



I presume the main case that tests this is "make fate-dct8x8", which 
builds and executes libavcodec/tests/dct? How much work would it be to 
integrate testing of these routines into checkasm? That way we could rest 
assured that the assembly passes all such ABI checks that we do there, 
including what registers must not be clobbered.



The assembly uses a different indentation width than the rest of our 
assembly. I recently spent some effort on cleaning that up so that our 
code is mostly consistent, so I'd prefer not to add new code that deviates 
from it. It primarily looks like you'd need to add 4 spaces at the start 
of each line.


I've used a script for mostly automatically reindenting our arm assembly, 
you can grab it at https://martin.st/temp/ffmpeg-asm-indent.pl, run it as 
"cat file.S | ./ffmpeg-asm-indent.pl > tmp; mv tmp file.S". It's not 100% 
accurate, but mostly gets you there, but it's good to manually check it 
afterwards as well.


// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [FFmpeg-cvslog] lavf/assenc: normalize line endings to \n

2024-02-13 Thread Martin Storsjö

On Tue, 13 Feb 2024, Ridley Combs wrote:


It looks like checkout has different behavior from reset, and fate uses a
hard reset.
To test, I committed the change adding tests/ref/** -text,
unix2dos'd tests/ref/fate/sub-scc, then ran git -c core.autocrlf=true reset
--quiet --hard; this dos2unix'd the file as expected when run with a working
tree containing the .gitattributes change (but not otherwise).


Git doesn't have any "memory" of the CRLFiness of a file beyond the content
of the file itself (whether in the working tree or in committed blobs). It
just doesn't necessarily replace every file in checkout invocations when
they differ only in line endings. Windows was a mistake.


To rephrase; reset vs checkout doesn't make any difference here.

It seems to simply be the case, that as long as there are no changes to 
the file contents themselves between the relevant git commits, and the 
file isn't flagged as dirty in the stat cache of the local workdir, git 
never revisits the .gitattributes for this particular file.


// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH] fate/subtitles: Ignore line endings for sub-scc test

2024-02-13 Thread Martin Storsjö via ffmpeg-devel

On Tue, 13 Feb 2024, Andreas Rheinhardt wrote:


Since 7bf1b9b35769b37684dd2f18a54f01d852a540c8,
the test produces ordinary \n, yet this is not what the reference
file used for the most time, leading to test failures.

Signed-off-by: Andreas Rheinhardt 
---
tests/fate/subtitles.mak | 1 +
1 file changed, 1 insertion(+)

diff --git a/tests/fate/subtitles.mak b/tests/fate/subtitles.mak
index cea4c810dd..90412e9ac1 100644
--- a/tests/fate/subtitles.mak
+++ b/tests/fate/subtitles.mak
@@ -114,6 +114,7 @@ fate-sub-charenc: CMD = fmtstdout ass -sub_charenc cp1251 
-i $(TARGET_SAMPLES)/s

FATE_SUBTITLES-$(call DEMDEC, SCC, CCAPTION) += fate-sub-scc
fate-sub-scc: CMD = fmtstdout ass -ss 57 -i $(TARGET_SAMPLES)/sub/witch.scc
+fate-sub-scc: CMP = diff

FATE_SUBTITLES-$(call DEMMUX, SCC, SCC) += fate-sub-scc-remux
fate-sub-scc-remux: CMD = fmtstdout scc -i $(TARGET_SAMPLES)/sub/witch.scc -ss 
4:00 -map 0 -c copy
--
2.34.1


Looks ok to me, as a temporary measure until we figure out the best way to 
upgrade everybody's workdirs without needing interaction. (As an added 
note to the other thread; even if we could easily patch fate.sh, every 
current user's workdir is also prone to this issue, and the way of fixing 
it is kinda non-obvious.)


// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [FFmpeg-cvslog] lavf/assenc: normalize line endings to \n

2024-02-13 Thread Martin Storsjö

On Tue, 13 Feb 2024, Ridley Combs wrote:


It looks like checkout has different behavior from reset, and fate uses a
hard reset.
To test, I committed the change adding tests/ref/** -text,
unix2dos'd tests/ref/fate/sub-scc, then ran git -c core.autocrlf=true reset
--quiet --hard; this dos2unix'd the file as expected when run with a working
tree containing the .gitattributes change (but not otherwise).


The difference here seems to be that you actively modify 
tests/ref/fate/sub-scc, which causes git to consider the file as needing 
to be restored when you run git reset.


When fate updates from one version to another, the files won't be locally 
modified, i.e. git's stat cache or similar has this file flagged as "not 
dirty".


So I suggest you retry your procedure by not manually modifying the file, 
but just letting git handle it, simulating exactly what happens on fate 
instances when updating from one version to another.


I.e., first check out 7bf1b9b3576~, nuke the file and check it out again, 
make sure that it contains CRLF. Then check out current master, which 
lacks attributes, but the local file in your workdir still contains CRLF. 
Then do any series of "git reset --hard", with/without "-c core.autocrlf", 
to commits on your experimental branch, and it won't change the line 
endings of the ref file, unless there actually are content changes to that 
particular file, between the git commits that you do check out.


// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [FFmpeg-cvslog] lavf/assenc: normalize line endings to \n

2024-02-13 Thread Martin Storsjö

On Tue, 13 Feb 2024, Ridley Combs via ffmpeg-devel wrote:


On Feb 13, 2024, at 01:28, Anton Khirnov  wrote:

Quoting Martin Storsjö (2024-02-12 12:31:29)

On Mon, 12 Feb 2024, Hendrik Leppkes wrote:


On Mon, Feb 12, 2024 at 11:22 AM Martin Storsjö  wrote:


diff --git a/.gitattributes b/.gitattributes
index 5a19b963b6..a900528e47 100644
--- a/.gitattributes
+++ b/.gitattributes
@@ -1,2 +1 @@
*.pnm -diff -text
-tests/ref/fate/sub-scc eol=crlf


This change seems to have had a tricky effect on the
tests/ref/fate/sub-scc file. Previously, when checked out, users got the
file with CRLF newlines. When updating to this git commit, or past it,
that file remains untouched, with CRLF still present, and the
fate-sub-scc test fails. If one does "rm tests/ref/fate/sub-scc; git
checkout tests/ref/fate/sub-scc", then the file does get restored with LR
newlines, and the test passes.

It's easy to do this change manually in the source checkout of a fate
runner, but I'm not sure how easily we get all fate instances fixed that
way - currently this test is failing in most of them.



Can this be fixed by restoring the .gitattribute entry but with eol=lf?
Not sure if Git would reset the file then.


No, that doesn't seem to make any difference. Not sure if there are any 
other straightforward/elegant fixes, short of renaming the file, which I 
guess would require renaming the test itself.


I'm fine with renaming the test, unless anyone has a better fix.


We could probably tweak the fate runner script to make sure this gets 
fixed up; can anyone try this patch on one of the affected machines? 
https://gist.github.com/rcombs/c2ad470bf36c5cbd3fc33e699330eb15


That doesn't seem to make any difference.

Also, updating fate.sh doesn't necessarily propagate automatically to 
runners - in order to run fate, one needs to run fate.sh before it even 
clones/checks out the directory where it fetches the latest source. So 
unless one later has changed one's setup, to invoke a fate.sh from the 
checkout, most fate runners just use whatever copy of fate.sh they had 
when it was set up.


Alternately, we could set -text on all fate ref files, or explicitly set 
eol=of for them, to ensure their line endings never get rewritten like 
this regardless of git config. I think either of these solutions would 
fix this in fate, but only after the fix commit gets checked out 
*followed by* at least one other commit.


Neither of those seem to make any difference either.

It's quite easy to test for one self:

$ git checkout -b experiment
$ 
$ 

$ git checkout 7bf1b9b3576~ # Reset original state, for testing
$ rm tests/ref/fate/sub-scc; git checkout tests/ref/fate/sub-scc
$ vi tests/ref/fate/sub-scc # inspect that the file originally has CRLF
$ git checkout experiment~ # check out the commit setting attributes
$ git checkout experiment # check out the next commit, with the new attributes 
set
$ vi tests/ref/fate/sub-scc # observe that the file still has CRLF

$ git checkout --detach
$ git -c core.autocrlf=false reset --hard 7bf1b9b3576
$ vi tests/ref/fate/sub-scc # observe that the file still has CRLF


It seems to me (I haven't trid to dig into manuals) that the attribute 
gets stuck in whatever form it was when the file was first created in the 
workdir. E.g. doing a "git checkout d1df72a702~" (the commit before the 
file was originally added) followed by "git checkout 7bf1b9b3576" does fix 
it. This is at least observed with git 2.25.1. Not sure if this is 
intended behaviour or a bug from git's side.


// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [FFmpeg-cvslog] lavf/assenc: normalize line endings to \n

2024-02-12 Thread Martin Storsjö

On Mon, 12 Feb 2024, Hendrik Leppkes wrote:


On Mon, Feb 12, 2024 at 11:22 AM Martin Storsjö  wrote:

>
> diff --git a/.gitattributes b/.gitattributes
> index 5a19b963b6..a900528e47 100644
> --- a/.gitattributes
> +++ b/.gitattributes
> @@ -1,2 +1 @@
> *.pnm -diff -text
> -tests/ref/fate/sub-scc eol=crlf

This change seems to have had a tricky effect on the
tests/ref/fate/sub-scc file. Previously, when checked out, users got the
file with CRLF newlines. When updating to this git commit, or past it,
that file remains untouched, with CRLF still present, and the
fate-sub-scc test fails. If one does "rm tests/ref/fate/sub-scc; git
checkout tests/ref/fate/sub-scc", then the file does get restored with LR
newlines, and the test passes.

It's easy to do this change manually in the source checkout of a fate
runner, but I'm not sure how easily we get all fate instances fixed that
way - currently this test is failing in most of them.



Can this be fixed by restoring the .gitattribute entry but with eol=lf?
Not sure if Git would reset the file then.


No, that doesn't seem to make any difference. Not sure if there are any 
other straightforward/elegant fixes, short of renaming the file, which I 
guess would require renaming the test itself.


// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [FFmpeg-cvslog] lavf/assenc: normalize line endings to \n

2024-02-12 Thread Martin Storsjö

On Mon, 12 Feb 2024, rcombs wrote:


ffmpeg | branch: master | rcombs  | Sun Jan 28 14:27:17 2024 
-0800| [7bf1b9b35769b37684dd2f18a54f01d852a540c8] | committer: rcombs

lavf/assenc: normalize line endings to \n

Previously, we produced output with either \r\n or mixed line endings.
This was undesirable unto itself, but also made working with patches affecting
FATE output particularly challenging, especially via the mailing list.

Everything that consumes the SSA/ASS format is line-ending-agnostic,
so \n is selected to simplify git/ML usage in FATE.

Extra \r characters at the end of a packet are dropped. These are always
ignored by the renderer anyway.


http://git.videolan.org/gitweb.cgi/ffmpeg.git/?a=commit;h=7bf1b9b35769b37684dd2f18a54f01d852a540c8

---

.gitattributes  |   1 -
libavformat/assenc.c|  22 ++--
tests/ref/fate/sub-aqtitle  |  94 
tests/ref/fate/sub-ass-to-ass-transcode | 124 ++---
tests/ref/fate/sub-cc   |  32 +++---
tests/ref/fate/sub-cc-realtime  |  44 
tests/ref/fate/sub-cc-scte20|  34 +++---
tests/ref/fate/sub-charenc  | 128 +++---
tests/ref/fate/sub-jacosub  |  50 -
tests/ref/fate/sub-microdvd |  48 -
tests/ref/fate/sub-movtext  |  34 +++---
tests/ref/fate/sub-mpl2 |  36 +++
tests/ref/fate/sub-mpsub|  70 ++--
tests/ref/fate/sub-mpsub-frames |  32 +++---
tests/ref/fate/sub-pjs  |  34 +++---
tests/ref/fate/sub-realtext |  38 +++
tests/ref/fate/sub-sami |  46 
tests/ref/fate/sub-sami2| 186 
tests/ref/fate/sub-srt  | 102 +-
tests/ref/fate/sub-srt-badsyntax|  48 -
tests/ref/fate/sub-ssa-to-ass-remux | 168 ++---
tests/ref/fate/sub-stl  |  62 +--
tests/ref/fate/sub-subviewer|  34 +++---
tests/ref/fate/sub-subviewer1   |  48 -
tests/ref/fate/sub-vplayer  |  34 +++---
tests/ref/fate/sub-webvtt   |  58 +-
tests/ref/fate/sub-webvtt2  |  52 -
27 files changed, 831 insertions(+), 828 deletions(-)

diff --git a/.gitattributes b/.gitattributes
index 5a19b963b6..a900528e47 100644
--- a/.gitattributes
+++ b/.gitattributes
@@ -1,2 +1 @@
*.pnm -diff -text
-tests/ref/fate/sub-scc eol=crlf


This change seems to have had a tricky effect on the 
tests/ref/fate/sub-scc file. Previously, when checked out, users got the 
file with CRLF newlines. When updating to this git commit, or past it, 
that file remains untouched, with CRLF still present, and the 
fate-sub-scc test fails. If one does "rm tests/ref/fate/sub-scc; git 
checkout tests/ref/fate/sub-scc", then the file does get restored with LR 
newlines, and the test passes.


It's easy to do this change manually in the source checkout of a fate 
runner, but I'm not sure how easily we get all fate instances fixed that 
way - currently this test is failing in most of them.


// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH] avcodec/dca_core: Remove unused emms.h inclusion

2024-02-09 Thread Martin Storsjö

On Fri, 9 Feb 2024, Andreas Rheinhardt wrote:


Possible since 7ec2354c38978b918dc079b611393becb6c80bf7.

Signed-off-by: Andreas Rheinhardt 
---
libavcodec/dca_core.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/libavcodec/dca_core.c b/libavcodec/dca_core.c
index 5dd727fc72..697fc74295 100644
--- a/libavcodec/dca_core.c
+++ b/libavcodec/dca_core.c
@@ -19,7 +19,6 @@
 */

#include "libavutil/channel_layout.h"
-#include "libavutil/emms.h"
#include "dcaadpcm.h"
#include "dcadec.h"
#include "dcadata.h"
--
2.34.1


LGTM and thanks!

// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH v2] lavc/dxv: align to 4x4 blocks instead of 16x16

2024-02-09 Thread Martin Storsjö

On Fri, 9 Feb 2024, Connor Worley wrote:


The previous assumption that DXV needs to be aligned to 16x16 was
erroneous. 4x4 works just as well, and FATE decoder tests pass for all
texture formats.

On the encoder side, we should reject input that isn't 4x4 aligned,
like the HAP encoder does, and stop aligning to 16x16. This both solves
the uninitialized reads causing current FATE tests to fail and produces
smaller encoded outputs.

With regard to correctness, I've checked the decoding path by encoding a
real-world sample with git master, and decoding it with
 ffmpeg -i dxt1-master.mov -c:v rawvideo -f framecrc -
The results are exactly the same between master and this patch.

On the encoding side, I've encoded a real-world sample with both master
and this patch, and decoded both versions with
 ffmpeg -i dxt1-{master,patch}.mov -c:v rawvideo -f framecrc -
Under this patch, results for both inputs are exactly the same.

In other words, the extra padding gained by 16x16 alignment over 4x4
alignment has no impact on decoded video.

Signed-off-by: Connor Worley 
---
libavcodec/dxv.c|  6 +++---
libavcodec/dxvenc.c | 14 +++---
tests/ref/fate/dxv3enc-dxt1 |  2 +-
3 files changed, 15 insertions(+), 7 deletions(-)


LGTM, will push soon to get FATE back to green again.

// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH v2 2/2] avcodec/hevc_mp4toannexb: check bytes left for nalu_len

2024-02-09 Thread Martin Storsjö

On Fri, 9 Feb 2024, Nuo Mi wrote:


similar issue as in the previous commit
---
libavcodec/bsf/hevc_mp4toannexb.c | 6 --
1 file changed, 4 insertions(+), 2 deletions(-)


Keep in mind, that while the patches are posted together, they can end up 
at different places further in review, and in commits, so the commit 
messages should ideally be understandable standalone.


// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH] x86: Remove inline MMX assembly that clobbers the FPU state

2024-02-09 Thread Martin Storsjö

On Fri, 26 Jan 2024, Martin Storsjö wrote:


On Fri, 26 Jan 2024, Martin Storsjö wrote:


These inline implementations of AV_COPY64, AV_SWAP64 and AV_ZERO64
are known to clobber the FPU state - which has to be restored
with the 'emms' instruction afterwards.

This was known and signaled with the FF_COPY_SWAP_ZERO_USES_MMX
define, which calling code seems to have been supposed to check,
in order to call emms_c() after using them. See
0b1972d4096df5879038f0af776f87f41e90ebd4,
29c4c0886d143790fcbeddbe40a23dfc6f56345c and
df215e575850e41b19aeb1fd99e53372a6b3d537 for history on earlier
fixes in the same area.

However, new code can use these AV_*64() macros without knowing
about the need to call emms_c().

Just get rid of these dangerous inline assembly snippets; this
doesn't make any difference for 64 bit architectures anyway.

Signed-off-by: Martin Storsjö 
---
libavcodec/dca_core.c| 16 
libavutil/x86/intreadwrite.h | 36 
2 files changed, 52 deletions(-)


I forgot to add some more context here; the VVC tests fail on i386 in some 
cases. 
https://patchwork.ffmpeg.org/project/ffmpeg/patch/20240125170518.61211-1-p...@frankplowman.com/ 
fixes this, by using av_log2() instead of the float log2() in the VVC 
decoder. This patch fixes the same issue as well, by eliminating the FPU 
state clobbering (so that float math functions anywhere in decoders work as 
expected).


If there are no better suggestions here, I would like to go ahead and push 
this.


// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


[FFmpeg-devel] [PATCH] flvdec: Honor the "flv_metadata" option for the "datastream" metadata field

2024-02-09 Thread Martin Storsjö
By default the option "flv_metadata" (internally using the field
name "trust_metadata") is set to 0, meaning that we don't allocate
streams based on information in the metadata, only based on
actual streams we encounter. However the "datastream" metadata field
still would allocate a subtitle stream.

When muxing, the "datastream" field is added if either a data stream
or subtitle stream is present - but the same metadata field is used
to preemtively create a subtitle stream only. Thus, if the field
was added due to a data stream, not a subtitle stream, the demuxer
would create a stream which won't get any actual packets.

If there was such an extra, empty subtitle stream, running
avformat_find_stream_info still used to terminate within reasonable
time before 3749eede66c3774799766b1f246afae8a6ffc9bb. After that
commit, it no longer would terminate until it reaches the max
analyze duration, which is 90 seconds for flv streams (see
e6a084641aada7a2e4672172f2ee26642800a361,
24fdf7334d2bb9aab0abdbc878b8ae51eb57c86b and
f58e011a1f30332ba824c155078ca701e29aef63).

Before that commit (which removed the deprecated AVStream.codec), the
"st->codecpar->codec_id = AV_CODEC_ID_TEXT", set within the demuxer,
would get propagated into st->codec->codec_id by numerous
avcodec_parameters_to_context(st->codec, st->codecpar), then further
into st->internal->avctx->codec_id by update_stream_avctx within
read_frame_internal in libavformat/utils.c (demux.c these days).
---
 libavformat/flvdec.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/libavformat/flvdec.c b/libavformat/flvdec.c
index e25b5bd163..d898341871 100644
--- a/libavformat/flvdec.c
+++ b/libavformat/flvdec.c
@@ -627,12 +627,7 @@ static int amf_parse_object(AVFormatContext *s, AVStream 
*astream,
 else if (!strcmp(key, "audiodatarate") &&
  0 <= (int)(num_val * 1024.0))
 flv->audio_bit_rate = num_val * 1024.0;
-else if (!strcmp(key, "datastream")) {
-AVStream *st = create_stream(s, AVMEDIA_TYPE_SUBTITLE);
-if (!st)
-return AVERROR(ENOMEM);
-st->codecpar->codec_id = AV_CODEC_ID_TEXT;
-} else if (!strcmp(key, "framerate")) {
+else if (!strcmp(key, "framerate")) {
 flv->framerate = av_d2q(num_val, 1000);
 if (vstream)
 vstream->avg_frame_rate = flv->framerate;
@@ -654,6 +649,11 @@ static int amf_parse_object(AVFormatContext *s, AVStream 
*astream,
 vpar->width = num_val;
 } else if (!strcmp(key, "height") && vpar) {
 vpar->height = num_val;
+} else if (!strcmp(key, "datastream")) {
+AVStream *st = create_stream(s, AVMEDIA_TYPE_SUBTITLE);
+if (!st)
+return AVERROR(ENOMEM);
+st->codecpar->codec_id = AV_CODEC_ID_TEXT;
 }
 }
 }
-- 
2.39.3 (Apple Git-145)

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH 24/24] libs: bump major version for all libraries

2024-01-26 Thread Martin Storsjö

On Fri, 26 Jan 2024, James Almer wrote:


On 1/26/2024 1:52 PM, Martin Storsjö wrote:

On Fri, 26 Jan 2024, James Almer wrote:


On 1/26/2024 1:44 PM, Vittorio Giovara wrote:

On Thu, Jan 25, 2024 at 2:48 PM James Almer  wrote:


Signed-off-by: James Almer 
---
  doc/APIchanges    | 2 +-
  libavcodec/version.h  | 2 +-
  libavcodec/version_major.h    | 2 +-
  libavdevice/version.h | 2 +-
  libavdevice/version_major.h   | 2 +-
  libavfilter/version.h | 2 +-
  libavfilter/version_major.h   | 2 +-
  libavformat/version.h | 2 +-
  libavformat/version_major.h   | 2 +-
  libavutil/version.h   | 6 +++---
  libpostproc/version.h | 2 +-
  libpostproc/version_major.h   | 2 +-
  libswresample/version.h   | 2 +-
  libswresample/version_major.h | 2 +-
  libswscale/version.h  | 2 +-
  libswscale/version_major.h    | 2 +-
  16 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/doc/APIchanges b/doc/APIchanges
index e477ed78e0..60711379a1 100644
--- a/doc/APIchanges
+++ b/doc/APIchanges
@@ -1,4 +1,4 @@
-The last version increases of all libraries were on 2023-02-09
+The last version increases of all libraries were on 2024-01-xx

  API changes, most recent first:

diff --git a/libavcodec/version.h b/libavcodec/version.h
index 0fae3d06d3..8c3d476003 100644
--- a/libavcodec/version.h
+++ b/libavcodec/version.h
@@ -29,7 +29,7 @@

  #include "version_major.h"

-#define LIBAVCODEC_VERSION_MINOR  38
+#define LIBAVCODEC_VERSION_MINOR   0
  #define LIBAVCODEC_VERSION_MICRO 100



should we use this bump opportunity to reset MICRO to 0 too?


It's an option. I don't recall if we decided anything about it last 
bump or during a meeting. And i don't know how much code out there 
still bothers to check for it to distinguish projects. But i guess 
that after so many bumps, any existing library user has long since 
stopped looking at it.


VLC 3 (which still is the latest stable version) still has got such 
checks around. VLC git master also still does have some checks, but only 
for deciding which "AVPROVIDER" to print in log messages, no function 
differences.
VLC 3 surely wont compile and link with current ffmpeg, right? Or did 
they port it to the decoupled input/output decoder and encoder API, and 
even the new channel layout API?


They do backport updates to ffmpeg to VLC 3 in general, although it seems 
that they're still pretty far behind (at ffmpeg 4.4.4) indeed.


// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


  1   2   3   4   5   6   7   8   9   10   >