Re: [FFmpeg-devel] [PATCH 3/3] lavc/svq1enc: R-V V ssd_int8_vs_int16

2024-01-16 Thread flow gg
Okay, I updated it in the reply Rémi Denis-Courmont 于2024年1月17日周三 02:04写道: > +vsetvli t0, a2, e8, m2, tu, ma > +vle8.v v0, (a0) > +sub a2, a2, t0 > +vsetvli zero, t0, e16, m4, tu, ma > +vle16.v v8, (a1) > +vsetvli

Re: [FFmpeg-devel] [PATCH 3/3] lavc/svq1enc: R-V V ssd_int8_vs_int16

2024-01-16 Thread Rémi Denis-Courmont
Le sunnuntaina 7. tammikuuta 2024, 10.36.23 EET flow gg a écrit : > Alright, I learned a bit more, so should we not consider the internal > implementation? You asked what the reason was for your counter-intuitive observations, and I provided a plausible hypothesis. Nothing more ,nothing less.

Re: [FFmpeg-devel] [PATCH 3/3] lavc/svq1enc: R-V V ssd_int8_vs_int16

2024-01-16 Thread Rémi Denis-Courmont
+vsetvli t0, a2, e8, m2, tu, ma +vle8.v v0, (a0) +sub a2, a2, t0 +vsetvli zero, t0, e16, m4, tu, ma +vle16.v v8, (a1) +vsetvli zero, t0, e8, m2, tu, ma +vwsub.wv v16, v8, v0 +vsetvli zero,

Re: [FFmpeg-devel] [PATCH 3/3] lavc/svq1enc: R-V V ssd_int8_vs_int16

2024-01-07 Thread flow gg
Alright, I learned a bit more, so should we not consider the internal implementation? I've added this version that reduces one vset in this reply. Rémi Denis-Courmont 于2024年1月7日周日 16:03写道: > Le sunnuntaina 7. tammikuuta 2024, 3.33.39 EET flow gg a écrit : > > I tested it, and indeed using vwsub

Re: [FFmpeg-devel] [PATCH 3/3] lavc/svq1enc: R-V V ssd_int8_vs_int16

2024-01-07 Thread Rémi Denis-Courmont
Le sunnuntaina 7. tammikuuta 2024, 3.33.39 EET flow gg a écrit : > I tested it, and indeed using vwsub is faster. Updated it in the reply. > > --- > > I have a question: if I tweak the load order a bit, using one less vset, it > leads to being slower (the patch I submitted is 13.2, if I make the

Re: [FFmpeg-devel] [PATCH 3/3] lavc/svq1enc: R-V V ssd_int8_vs_int16

2024-01-06 Thread flow gg
I tested it, and indeed using vwsub is faster. Updated it in the reply. --- I have a question: if I tweak the load order a bit, using one less vset, it leads to being slower (the patch I submitted is 13.2, if I make the following change, the time would be 15.2). But I thought it would be faster.

Re: [FFmpeg-devel] [PATCH 3/3] lavc/svq1enc: R-V V ssd_int8_vs_int16

2024-01-06 Thread Rémi Denis-Courmont
Le perjantaina 5. tammikuuta 2024, 2.56.18 EET flow gg a écrit : > One vset can be reduced, but vwsub should not be used in this case. I > modified it in this reply. Fair enough, but are you sure that that's faster than keeping the vsetvli and removing the sign extension? > Rémi Denis-Courmont

Re: [FFmpeg-devel] [PATCH 3/3] lavc/svq1enc: R-V V ssd_int8_vs_int16

2024-01-04 Thread flow gg
One vset can be reduced, but vwsub should not be used in this case. I modified it in this reply. Rémi Denis-Courmont 于2024年1月5日周五 00:00写道: > Le lauantaina 30. joulukuuta 2023, 18.20.15 EET flow gg a écrit : > > I mistook it, seeing the vector length as the length of the vector > register > > ..

Re: [FFmpeg-devel] [PATCH 3/3] lavc/svq1enc: R-V V ssd_int8_vs_int16

2024-01-04 Thread Rémi Denis-Courmont
Le lauantaina 30. joulukuuta 2023, 18.20.15 EET flow gg a écrit : > I mistook it, seeing the vector length as the length of the vector register > .. > I have modified it in this reply. Setting element size to 8-bit is unnecessary, and a widening subtraction can presumably avoid the sign

Re: [FFmpeg-devel] [PATCH 3/3] lavc/svq1enc: R-V V ssd_int8_vs_int16

2023-12-30 Thread flow gg
I mistook it, seeing the vector length as the length of the vector register .. I have modified it in this reply. Rémi Denis-Courmont 于2023年12月30日周六 20:15写道: > > > Le 29 décembre 2023 12:57:20 GMT+01:00, flow gg a > écrit : > >C908 > >ssd_int8_vs_int16_c: 207.7 > >ssd_int8_vs_int16_rvv_i32:

Re: [FFmpeg-devel] [PATCH 3/3] lavc/svq1enc: R-V V ssd_int8_vs_int16

2023-12-30 Thread Rémi Denis-Courmont
Le 30 décembre 2023 15:00:53 GMT+01:00, flow gg a écrit : >> At a quick glance, it won't work if the input length is not a multiple of >the vector length. > >Why? You're not handling tails as far as I see. > I tried 1024, 32*3, 32*7 and all passed the test. They're all multiples of the

Re: [FFmpeg-devel] [PATCH 3/3] lavc/svq1enc: R-V V ssd_int8_vs_int16

2023-12-30 Thread flow gg
flow gg 于2023年12月30日周六 22:00写道: > > At a quick glance, it won't work if the input length is not a multiple > of the vector length. > > Why? I tried 1024, 32*3, 32*7 and all passed the test. > > > Also do you really need to extend accumulators to 32 bits? > > It won't overflow after the test is

Re: [FFmpeg-devel] [PATCH 3/3] lavc/svq1enc: R-V V ssd_int8_vs_int16

2023-12-30 Thread flow gg
> At a quick glance, it won't work if the input length is not a multiple of the vector length. Why? I tried 1024, 32*3, 32*7 and all passed the test. > Also do you really need to extend accumulators to 32 bits? It won't overflow after the test is changed, so it's not needed anymore. I have

Re: [FFmpeg-devel] [PATCH 3/3] lavc/svq1enc: R-V V ssd_int8_vs_int16

2023-12-30 Thread Rémi Denis-Courmont
Le 29 décembre 2023 12:57:20 GMT+01:00, flow gg a écrit : >C908 >ssd_int8_vs_int16_c: 207.7 >ssd_int8_vs_int16_rvv_i32: 28.0 At a quick glance, it won't work if the input length is not a multiple of the vector length. Also do you really need to extend accumulators to 32 bits?

[FFmpeg-devel] [PATCH 3/3] lavc/svq1enc: R-V V ssd_int8_vs_int16

2023-12-29 Thread flow gg
C908 ssd_int8_vs_int16_c: 207.7 ssd_int8_vs_int16_rvv_i32: 28.0 From 0fd1b7a34ab8794868d80233c35f70c8ad42b9fa Mon Sep 17 00:00:00 2001 From: sunyuechi Date: Fri, 29 Dec 2023 13:27:31 +0800 Subject: [PATCH 3/3] lavc/svq1enc: R-V V ssd_int8_vs_int16 C908 ssd_int8_vs_int16_c: 207.7