Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 SIMD for filter_column()

2019-12-04 Thread 徐鋆
Hi, chen

- 原始邮件 -
> 发件人: "chen" 
> 收件人: "FFmpeg development discussions and patches" 
> 发送时间: 星期二, 2019年 12 月 03日 下午 4:59:06
> 主题: Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 SIMD for 
> filter_column()

> comments inline in code
> 
> 
> At 2019-12-03 15:52:07, xuju...@sjtu.edu.cn wrote:
>>From: Xu Jun 
>>
>>+; void filter_column(uint8_t *dst, int height,
>>+; float rdiv, float bias, const int *const matrix,
>>+; const uint8_t *c[], int length, int radius,
>>+; int dstride, int stride);
>>+
>>+%if ARCH_X86_64
>>+INIT_XMM sse4
>>+%if UNIX64
>>+cglobal filter_column, 8, 15, 7, dst, height, matrix, ptr, width, rad, 
>>dstride,
>>stride, i, ci, dst_off, off16, c_off, sum, r
>>+%else
>>+cglobal filter_column, 8, 15, 7, dst, height, rdiv, bias, matrix, ptr, width,
>>rad, dstride, stride, i, ci, dst_off, off16, c_off, sum, r
> 
>>+%endif
> no idea, these are difficult to read and understand

I will rename some variables to make it more readable. Do I need to add some 
notes here?

> 
> 
> 
> 
>>+
>>+%if WIN64
>>+SWAP m0, m2
>>+SWAP m1, m3
>>+mov r2q, matrixmp
>>+mov r3q, ptrmp
>>+mov r4q, widthmp
>>+mov r5q, radmp
>>+mov r6q, dstridemp
>>+mov r7q, stridemp
>>+DEFINE_ARGS dst, height, matrix, ptr, width, rad, dstride, stride, i, ci,
>>dst_off, off16, c_off, sum, r
>>+%endif
>>+
>>+movsxdifnidn widthq, widthd
>>+movsxdifnidn radq, radd
>>+movsxdifnidn dstrideq, dstrided
>>+movsxdifnidn strideq, strided
>>+sal radq, 1
> 
>>+add radq, 1 ;2*radius+1
> I don't know how about compare to "LEA x,[y*2+1]"
> AndI want not discuss in between SAL and SHL
> 

I think lea is better and I will change in the next version.

> 
>>+movsxdifnidn heightq, heightd
>>+VBROADCASTSS m0, m0
>>+VBROADCASTSS m1, m1
>>+pxor m6, m6
>>+movss m5, [half]
>>+VBROADCASTSS m5, m5
>>+
>>+xor dst_offq, dst_offq
>>+xor c_offq, c_offq
>>+
>>+.loopy:
>>+xor off16q, off16q
>>+cmp widthq, mmsize/4
>>+jl .loopr
>>+
>>+mov rq, widthq
>>+and rq, mmsize/4-1
>>+sub widthq, rq
>>+
> 
>>+.loop16: ;parallel process 16 elements in a row
> Processing 4 column per loop, are you means, we want to save lots of unused
> register?
> We claim X64, so we have 16 of XMMs

Will use more XMMs and process 16 column at a time.

> 
> 
>>+pxor m4, m4
>>+xor iq, iq
>>+.loopi:
> 
>>+movss m2, [matrixq + 4*iq]
> no idea that you working on Float data path, we are lucky, Intel CPU sounds 
> not
> penalty in here.

Will change to Interger data path using movd.
And movd seems to have less CPI than movss.

> 
> 
>>+VBROADCASTSS m2, m2
>>+mov ciq, [ptrq + iq * gprsize]
>>+movss m3, [ciq + c_offq] ;c[i][y*stride + off16]
>>+punpcklbw m3, m6
> 
>>+punpcklwd m3, m6
> Since you claim SSE4, the instruction PMOVZXBD available, moreover, SSE4
> register can be full fill 16 of uint8, but load 4 of them only.

I thought that since I would multiply 4 ints, loading 4 uint8s per loop is OK.
Now I know that read 16 uint8s and shuffle them is faster.
Will change in next version.

> 
>>+pmulld m2, m3
>>+paddd m4, m2
>>+
>>+add iq, 1
> 
>>+cmp iq, radq
> When you initial iq to radq and decrement per loop, you can reduce one
> instruction
> I know iq is work as index in the loop, but we can found some trick over 
> there.

Will change in next V.

>>+jl .loopi
>>+
>>+cvtdq2ps m4, m4
>>+mulps m4, m0 ; sum *= rdiv
>>+addps m4, m1 ; sum += bias
> 
>>+addps m4, m5 ; sum += 0.5
> I don't know how about precision mismatch if we pre-compute (bias+0.5)

Here may not be modified after discussions.

> 
> 
>>+cvttps2dq m4, m4
>>+packssdw m4, m4
>>+packuswb m4, m4
>>+movss [dstq + dst_offq], m4
>>+add c_offq, mmsize/4
>>+add dst_offq, mmsize/4
>>+
>>+add off16q, mmsize/4
>>+cmp off16q, widthq
>>+jl .loop16
>>+
>>+add widthq, rq
>>+cmp off16q, widthq
>>+jge .paraend
>>+
> 
>>+.loopr:
> no idea about this loop, if we can read beyond, we can reu

Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 SIMD for filter_column()

2019-12-04 Thread chen


At 2019-12-04 16:51:52, "Paul B Mahol"  wrote:
>On 12/4/19, Song, Ruiling  wrote:
>>> -Original Message-
>>> From: ffmpeg-devel  On Behalf Of
>>> chen

>>> >> At 2019-12-03 15:52:07, xuju...@sjtu.edu.cn wrote:
>>> >> >From: Xu Jun 
>>> >[...]
>>> >> >+
>>> >> >+cvtdq2ps m4, m4
>>> >> >+mulps m4, m0 ; sum *= rdiv
>>> >> >+addps m4, m1 ; sum += bias
>>> >>
>>> >> >+addps m4, m5 ; sum += 0.5
>>> >> I don't know how about precision mismatch if we pre-compute (bias+0.5)
>>>
>>> >I think it is hard to prove it is safe to do pre-compute.
>>> Agree, I also worried precision issue since float operator is execute
>>> order
>>> dependent.
>>> How about ROUNDPS?

>> Seems no exactly match.
Funny, I guess it is other issue, such as mistake on instruction's imm field.


>>> >> >+cvttps2dq m4, m4
>>> >> >+packssdw m4, m4
>>> >> >+packuswb m4, m4
>>> >> >+movss [dstq + dst_offq], m4
>>> >> >+add c_offq, mmsize/4
>>> >> >+add dst_offq, mmsize/4
>>> >> >+
>>> >> >+add off16q, mmsize/4
>>> >> >+cmp off16q, widthq
>>> >> >+jl .loop16
>>> >> >+
>>> >> >+add widthq, rq
>>> >> >+cmp off16q, widthq
>>> >> >+jge .paraend
>>> >> >+
>>> >>
>>> >> >+.loopr:
>>> >> no idea about this loop, if we can read beyond, we can reuse above
>>> >> SIMD
>>> >> code
>>> >Reuse above SIMD code may write to the memory that does not belong to
>>> this slice-thread.
>>>
>>> >IMO, the code to handle remainder columns is still necessary.
>>>
>>>
>>> Depends on algorithm & size,
>>> For example width=23
>>> Process #0 [0:15]
>>> Process #1 [7:22]
>>> Both of them is multiple of 16
>> Sounds interesting. But FFmpeg does not do like this now.
>> One question is will this get a penalty for writing to same address of
>> memory (both are writing to 7-15) from different threads?
>
>Yes, and even bad results may happen.

>
This is my problem, I don't speak clean, the "Process #x" is one step of loops,
I guess the function must be atomic, we can place any threading that work on 
same address area.

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 SIMD for filter_column()

2019-12-04 Thread Paul B Mahol
On 12/4/19, Song, Ruiling  wrote:
>> -Original Message-
>> From: ffmpeg-devel  On Behalf Of
>> chen
>> Sent: Wednesday, December 4, 2019 9:36 AM
>> To: FFmpeg development discussions and patches > de...@ffmpeg.org>
>> Subject: Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86
>> SIMD for filter_column()
>>
>>
>>
>> At 2019-12-04 08:59:08, "Song, Ruiling"  wrote:
>> >> -Original Message-
>> >> From: ffmpeg-devel  On Behalf Of
>> >> chen
>> >> Sent: Tuesday, December 3, 2019 4:59 PM
>> >> To: FFmpeg development discussions and patches > >> de...@ffmpeg.org>
>> >> Subject: Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add
>> >> X86
>> >> SIMD for filter_column()
>> >>
>> >> comments inline in code
>> >>
>> >>
>> >> At 2019-12-03 15:52:07, xuju...@sjtu.edu.cn wrote:
>> >> >From: Xu Jun 
>> >[...]
>> >> >+
>> >> >+cvtdq2ps m4, m4
>> >> >+mulps m4, m0 ; sum *= rdiv
>> >> >+addps m4, m1 ; sum += bias
>> >>
>> >> >+addps m4, m5 ; sum += 0.5
>> >> I don't know how about precision mismatch if we pre-compute (bias+0.5)
>>
>> >I think it is hard to prove it is safe to do pre-compute.
>> Agree, I also worried precision issue since float operator is execute
>> order
>> dependent.
>> How about ROUNDPS?
> Seems no exactly match.
>>
>>
>> >
>> >>
>> >>
>> >> >+cvttps2dq m4, m4
>> >> >+packssdw m4, m4
>> >> >+packuswb m4, m4
>> >> >+movss [dstq + dst_offq], m4
>> >> >+add c_offq, mmsize/4
>> >> >+add dst_offq, mmsize/4
>> >> >+
>> >> >+add off16q, mmsize/4
>> >> >+cmp off16q, widthq
>> >> >+jl .loop16
>> >> >+
>> >> >+add widthq, rq
>> >> >+cmp off16q, widthq
>> >> >+jge .paraend
>> >> >+
>> >>
>> >> >+.loopr:
>> >> no idea about this loop, if we can read beyond, we can reuse above
>> >> SIMD
>> >> code
>> >Reuse above SIMD code may write to the memory that does not belong to
>> this slice-thread.
>>
>> >IMO, the code to handle remainder columns is still necessary.
>>
>>
>> Depends on algorithm & size,
>> For example width=23
>> Process #0 [0:15]
>> Process #1 [7:22]
>> Both of them is multiple of 16
> Sounds interesting. But FFmpeg does not do like this now.
> One question is will this get a penalty for writing to same address of
> memory (both are writing to 7-15) from different threads?

Yes, and even bad results may happen.

>
>>
>> ___
>> ffmpeg-devel mailing list
>> ffmpeg-devel@ffmpeg.org
>> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>>
>> To unsubscribe, visit link above, or email
>> ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 SIMD for filter_column()

2019-12-03 Thread Song, Ruiling
> -Original Message-
> From: ffmpeg-devel  On Behalf Of
> chen
> Sent: Wednesday, December 4, 2019 9:36 AM
> To: FFmpeg development discussions and patches  de...@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86
> SIMD for filter_column()
> 
> 
> 
> At 2019-12-04 08:59:08, "Song, Ruiling"  wrote:
> >> -Original Message-
> >> From: ffmpeg-devel  On Behalf Of
> >> chen
> >> Sent: Tuesday, December 3, 2019 4:59 PM
> >> To: FFmpeg development discussions and patches  >> de...@ffmpeg.org>
> >> Subject: Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86
> >> SIMD for filter_column()
> >>
> >> comments inline in code
> >>
> >>
> >> At 2019-12-03 15:52:07, xuju...@sjtu.edu.cn wrote:
> >> >From: Xu Jun 
> >[...]
> >> >+
> >> >+cvtdq2ps m4, m4
> >> >+mulps m4, m0 ; sum *= rdiv
> >> >+addps m4, m1 ; sum += bias
> >>
> >> >+addps m4, m5 ; sum += 0.5
> >> I don't know how about precision mismatch if we pre-compute (bias+0.5)
> 
> >I think it is hard to prove it is safe to do pre-compute.
> Agree, I also worried precision issue since float operator is execute order
> dependent.
> How about ROUNDPS?
Seems no exactly match.
> 
> 
> >
> >>
> >>
> >> >+cvttps2dq m4, m4
> >> >+packssdw m4, m4
> >> >+packuswb m4, m4
> >> >+movss [dstq + dst_offq], m4
> >> >+add c_offq, mmsize/4
> >> >+add dst_offq, mmsize/4
> >> >+
> >> >+add off16q, mmsize/4
> >> >+cmp off16q, widthq
> >> >+jl .loop16
> >> >+
> >> >+add widthq, rq
> >> >+cmp off16q, widthq
> >> >+jge .paraend
> >> >+
> >>
> >> >+.loopr:
> >> no idea about this loop, if we can read beyond, we can reuse above SIMD
> >> code
> >Reuse above SIMD code may write to the memory that does not belong to
> this slice-thread.
> 
> >IMO, the code to handle remainder columns is still necessary.
> 
> 
> Depends on algorithm & size,
> For example width=23
> Process #0 [0:15]
> Process #1 [7:22]
> Both of them is multiple of 16
Sounds interesting. But FFmpeg does not do like this now.
One question is will this get a penalty for writing to same address of memory 
(both are writing to 7-15) from different threads?

> 
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 
> To unsubscribe, visit link above, or email
> ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 SIMD for filter_column()

2019-12-03 Thread chen


At 2019-12-04 08:59:08, "Song, Ruiling"  wrote:
>> -Original Message-
>> From: ffmpeg-devel  On Behalf Of
>> chen
>> Sent: Tuesday, December 3, 2019 4:59 PM
>> To: FFmpeg development discussions and patches > de...@ffmpeg.org>
>> Subject: Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86
>> SIMD for filter_column()
>> 
>> comments inline in code
>> 
>> 
>> At 2019-12-03 15:52:07, xuju...@sjtu.edu.cn wrote:
>> >From: Xu Jun 
>[...]
>> >+
>> >+cvtdq2ps m4, m4
>> >+mulps m4, m0 ; sum *= rdiv
>> >+addps m4, m1 ; sum += bias
>> 
>> >+addps m4, m5 ; sum += 0.5
>> I don't know how about precision mismatch if we pre-compute (bias+0.5)

>I think it is hard to prove it is safe to do pre-compute.
Agree, I also worried precision issue since float operator is execute order 
dependent.
How about ROUNDPS?


>
>> 
>> 
>> >+cvttps2dq m4, m4
>> >+packssdw m4, m4
>> >+packuswb m4, m4
>> >+movss [dstq + dst_offq], m4
>> >+add c_offq, mmsize/4
>> >+add dst_offq, mmsize/4
>> >+
>> >+add off16q, mmsize/4
>> >+cmp off16q, widthq
>> >+jl .loop16
>> >+
>> >+add widthq, rq
>> >+cmp off16q, widthq
>> >+jge .paraend
>> >+
>> 
>> >+.loopr:
>> no idea about this loop, if we can read beyond, we can reuse above SIMD
>> code
>Reuse above SIMD code may write to the memory that does not belong to this 
>slice-thread.

>IMO, the code to handle remainder columns is still necessary.


Depends on algorithm & size,
For example width=23
Process #0 [0:15]
Process #1 [7:22]
Both of them is multiple of 16

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 SIMD for filter_column()

2019-12-03 Thread Song, Ruiling
> -Original Message-
> From: ffmpeg-devel  On Behalf Of
> chen
> Sent: Tuesday, December 3, 2019 4:59 PM
> To: FFmpeg development discussions and patches  de...@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86
> SIMD for filter_column()
> 
> comments inline in code
> 
> 
> At 2019-12-03 15:52:07, xuju...@sjtu.edu.cn wrote:
> >From: Xu Jun 
[...]
> >+
> >+cvtdq2ps m4, m4
> >+mulps m4, m0 ; sum *= rdiv
> >+addps m4, m1 ; sum += bias
> 
> >+addps m4, m5 ; sum += 0.5
> I don't know how about precision mismatch if we pre-compute (bias+0.5)
I think it is hard to prove it is safe to do pre-compute.

> 
> 
> >+cvttps2dq m4, m4
> >+packssdw m4, m4
> >+packuswb m4, m4
> >+movss [dstq + dst_offq], m4
> >+add c_offq, mmsize/4
> >+add dst_offq, mmsize/4
> >+
> >+add off16q, mmsize/4
> >+cmp off16q, widthq
> >+jl .loop16
> >+
> >+add widthq, rq
> >+cmp off16q, widthq
> >+jge .paraend
> >+
> 
> >+.loopr:
> no idea about this loop, if we can read beyond, we can reuse above SIMD
> code
Reuse above SIMD code may write to the memory that does not belong to this 
slice-thread.
IMO, the code to handle remainder columns is still necessary.

Ruiling
> 
> 
> >+xor sumd, sumd
> >+xor iq, iq
> >+.loopr_i:
> >+mov ciq, [ptrq + iq * gprsize]
> >+movzx rd, byte [ciq + c_offq]
> >+imul rd, [matrixq + 4*iq]
> >+add sumd, rd
> >+
> >+add iq, 1
> >+cmp iq, radq
> >+jl .loopr_i
> >+
> >+pxor m4, m4
> >+cvtsi2ss m4, sumd
> >+mulss m4, m0 ; sum *= rdiv
> >+addss m4, m1 ; sum += bias
> >+addss m4, m5 ; sum += 0.5
> >+cvttps2dq m4, m4
> >+packssdw m4, m4
> >+packuswb m4, m4
> >+movd sumd, m4
> >+mov [dstq + dst_offq], sumb
> >+add c_offq, 1
> >+add dst_offq, 1
> >+add off16q, 1
> >+cmp off16q, widthq
> >+jl .loopr
> >+
> >+.paraend:
> >+sub c_offq, widthq
> >+sub dst_offq, widthq
> >+add c_offq, strideq
> >+add dst_offq, dstrideq
> >+
> >+sub heightq, 1
> >+cmp heightq, 0
> >+jg .loopy
> >+
> >+.end:
> >+RET
> 
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 
> To unsubscribe, visit link above, or email
> ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 SIMD for filter_column()

2019-12-03 Thread chen
comments inline in code


At 2019-12-03 15:52:07, xuju...@sjtu.edu.cn wrote:
>From: Xu Jun 
>
>+; void filter_column(uint8_t *dst, int height,
>+; float rdiv, float bias, const int *const matrix,
>+; const uint8_t *c[], int length, int radius,
>+; int dstride, int stride);
>+
>+%if ARCH_X86_64
>+INIT_XMM sse4
>+%if UNIX64
>+cglobal filter_column, 8, 15, 7, dst, height, matrix, ptr, width, rad, 
>dstride, stride, i, ci, dst_off, off16, c_off, sum, r
>+%else
>+cglobal filter_column, 8, 15, 7, dst, height, rdiv, bias, matrix, ptr, width, 
>rad, dstride, stride, i, ci, dst_off, off16, c_off, sum, r

>+%endif
no idea, these are difficult to read and understand




>+
>+%if WIN64
>+SWAP m0, m2
>+SWAP m1, m3
>+mov r2q, matrixmp
>+mov r3q, ptrmp
>+mov r4q, widthmp
>+mov r5q, radmp
>+mov r6q, dstridemp
>+mov r7q, stridemp
>+DEFINE_ARGS dst, height, matrix, ptr, width, rad, dstride, stride, i, ci, 
>dst_off, off16, c_off, sum, r
>+%endif
>+
>+movsxdifnidn widthq, widthd
>+movsxdifnidn radq, radd
>+movsxdifnidn dstrideq, dstrided
>+movsxdifnidn strideq, strided
>+sal radq, 1

>+add radq, 1 ;2*radius+1
I don't know how about compare to "LEA x,[y*2+1]"
AndI want not discuss in between SAL and SHL


>+movsxdifnidn heightq, heightd
>+VBROADCASTSS m0, m0
>+VBROADCASTSS m1, m1
>+pxor m6, m6
>+movss m5, [half]
>+VBROADCASTSS m5, m5
>+
>+xor dst_offq, dst_offq
>+xor c_offq, c_offq
>+
>+.loopy:
>+xor off16q, off16q
>+cmp widthq, mmsize/4
>+jl .loopr
>+
>+mov rq, widthq
>+and rq, mmsize/4-1
>+sub widthq, rq
>+

>+.loop16: ;parallel process 16 elements in a row
Processing 4 column per loop, are you means, we want to save lots of unused 
register?
We claim X64, so we have 16 of XMMs


>+pxor m4, m4
>+xor iq, iq
>+.loopi:

>+movss m2, [matrixq + 4*iq]
no idea that you working on Float data path, we are lucky, Intel CPU sounds not 
penalty in here.


>+VBROADCASTSS m2, m2
>+mov ciq, [ptrq + iq * gprsize]
>+movss m3, [ciq + c_offq] ;c[i][y*stride + off16]
>+punpcklbw m3, m6

>+punpcklwd m3, m6
Since you claim SSE4, the instruction PMOVZXBD available, moreover, SSE4 
register can be full fill 16 of uint8, but load 4 of them only.


>+pmulld m2, m3
>+paddd m4, m2
>+
>+add iq, 1

>+cmp iq, radq
When you initial iq to radq and decrement per loop, you can reduce one 
instruction
I know iq is work as index in the loop, but we can found some trick over there.
>+jl .loopi
>+
>+cvtdq2ps m4, m4
>+mulps m4, m0 ; sum *= rdiv
>+addps m4, m1 ; sum += bias

>+addps m4, m5 ; sum += 0.5
I don't know how about precision mismatch if we pre-compute (bias+0.5)


>+cvttps2dq m4, m4
>+packssdw m4, m4
>+packuswb m4, m4
>+movss [dstq + dst_offq], m4
>+add c_offq, mmsize/4
>+add dst_offq, mmsize/4
>+
>+add off16q, mmsize/4
>+cmp off16q, widthq
>+jl .loop16
>+
>+add widthq, rq
>+cmp off16q, widthq
>+jge .paraend
>+

>+.loopr:
no idea about this loop, if we can read beyond, we can reuse above SIMD code


>+xor sumd, sumd
>+xor iq, iq
>+.loopr_i:
>+mov ciq, [ptrq + iq * gprsize]
>+movzx rd, byte [ciq + c_offq]
>+imul rd, [matrixq + 4*iq]
>+add sumd, rd
>+
>+add iq, 1
>+cmp iq, radq
>+jl .loopr_i
>+
>+pxor m4, m4
>+cvtsi2ss m4, sumd
>+mulss m4, m0 ; sum *= rdiv
>+addss m4, m1 ; sum += bias
>+addss m4, m5 ; sum += 0.5
>+cvttps2dq m4, m4
>+packssdw m4, m4
>+packuswb m4, m4
>+movd sumd, m4
>+mov [dstq + dst_offq], sumb
>+add c_offq, 1
>+add dst_offq, 1
>+add off16q, 1
>+cmp off16q, widthq
>+jl .loopr
>+
>+.paraend:
>+sub c_offq, widthq
>+sub dst_offq, widthq
>+add c_offq, strideq
>+add dst_offq, dstrideq
>+
>+sub heightq, 1
>+cmp heightq, 0
>+jg .loopy
>+
>+.end:
>+RET

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".