Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 SIMD for filter_column()
Hi, chen - 原始邮件 - > 发件人: "chen" > 收件人: "FFmpeg development discussions and patches" > 发送时间: 星期二, 2019年 12 月 03日 下午 4:59:06 > 主题: Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 SIMD for > filter_column() > comments inline in code > > > At 2019-12-03 15:52:07, xuju...@sjtu.edu.cn wrote: >>From: Xu Jun >> >>+; void filter_column(uint8_t *dst, int height, >>+; float rdiv, float bias, const int *const matrix, >>+; const uint8_t *c[], int length, int radius, >>+; int dstride, int stride); >>+ >>+%if ARCH_X86_64 >>+INIT_XMM sse4 >>+%if UNIX64 >>+cglobal filter_column, 8, 15, 7, dst, height, matrix, ptr, width, rad, >>dstride, >>stride, i, ci, dst_off, off16, c_off, sum, r >>+%else >>+cglobal filter_column, 8, 15, 7, dst, height, rdiv, bias, matrix, ptr, width, >>rad, dstride, stride, i, ci, dst_off, off16, c_off, sum, r > >>+%endif > no idea, these are difficult to read and understand I will rename some variables to make it more readable. Do I need to add some notes here? > > > > >>+ >>+%if WIN64 >>+SWAP m0, m2 >>+SWAP m1, m3 >>+mov r2q, matrixmp >>+mov r3q, ptrmp >>+mov r4q, widthmp >>+mov r5q, radmp >>+mov r6q, dstridemp >>+mov r7q, stridemp >>+DEFINE_ARGS dst, height, matrix, ptr, width, rad, dstride, stride, i, ci, >>dst_off, off16, c_off, sum, r >>+%endif >>+ >>+movsxdifnidn widthq, widthd >>+movsxdifnidn radq, radd >>+movsxdifnidn dstrideq, dstrided >>+movsxdifnidn strideq, strided >>+sal radq, 1 > >>+add radq, 1 ;2*radius+1 > I don't know how about compare to "LEA x,[y*2+1]" > AndI want not discuss in between SAL and SHL > I think lea is better and I will change in the next version. > >>+movsxdifnidn heightq, heightd >>+VBROADCASTSS m0, m0 >>+VBROADCASTSS m1, m1 >>+pxor m6, m6 >>+movss m5, [half] >>+VBROADCASTSS m5, m5 >>+ >>+xor dst_offq, dst_offq >>+xor c_offq, c_offq >>+ >>+.loopy: >>+xor off16q, off16q >>+cmp widthq, mmsize/4 >>+jl .loopr >>+ >>+mov rq, widthq >>+and rq, mmsize/4-1 >>+sub widthq, rq >>+ > >>+.loop16: ;parallel process 16 elements in a row > Processing 4 column per loop, are you means, we want to save lots of unused > register? > We claim X64, so we have 16 of XMMs Will use more XMMs and process 16 column at a time. > > >>+pxor m4, m4 >>+xor iq, iq >>+.loopi: > >>+movss m2, [matrixq + 4*iq] > no idea that you working on Float data path, we are lucky, Intel CPU sounds > not > penalty in here. Will change to Interger data path using movd. And movd seems to have less CPI than movss. > > >>+VBROADCASTSS m2, m2 >>+mov ciq, [ptrq + iq * gprsize] >>+movss m3, [ciq + c_offq] ;c[i][y*stride + off16] >>+punpcklbw m3, m6 > >>+punpcklwd m3, m6 > Since you claim SSE4, the instruction PMOVZXBD available, moreover, SSE4 > register can be full fill 16 of uint8, but load 4 of them only. I thought that since I would multiply 4 ints, loading 4 uint8s per loop is OK. Now I know that read 16 uint8s and shuffle them is faster. Will change in next version. > >>+pmulld m2, m3 >>+paddd m4, m2 >>+ >>+add iq, 1 > >>+cmp iq, radq > When you initial iq to radq and decrement per loop, you can reduce one > instruction > I know iq is work as index in the loop, but we can found some trick over > there. Will change in next V. >>+jl .loopi >>+ >>+cvtdq2ps m4, m4 >>+mulps m4, m0 ; sum *= rdiv >>+addps m4, m1 ; sum += bias > >>+addps m4, m5 ; sum += 0.5 > I don't know how about precision mismatch if we pre-compute (bias+0.5) Here may not be modified after discussions. > > >>+cvttps2dq m4, m4 >>+packssdw m4, m4 >>+packuswb m4, m4 >>+movss [dstq + dst_offq], m4 >>+add c_offq, mmsize/4 >>+add dst_offq, mmsize/4 >>+ >>+add off16q, mmsize/4 >>+cmp off16q, widthq >>+jl .loop16 >>+ >>+add widthq, rq >>+cmp off16q, widthq >>+jge .paraend >>+ > >>+.loopr: > no idea about this loop, if we can read beyond, we can reu
Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 SIMD for filter_column()
At 2019-12-04 16:51:52, "Paul B Mahol" wrote: >On 12/4/19, Song, Ruiling wrote: >>> -Original Message- >>> From: ffmpeg-devel On Behalf Of >>> chen >>> >> At 2019-12-03 15:52:07, xuju...@sjtu.edu.cn wrote: >>> >> >From: Xu Jun >>> >[...] >>> >> >+ >>> >> >+cvtdq2ps m4, m4 >>> >> >+mulps m4, m0 ; sum *= rdiv >>> >> >+addps m4, m1 ; sum += bias >>> >> >>> >> >+addps m4, m5 ; sum += 0.5 >>> >> I don't know how about precision mismatch if we pre-compute (bias+0.5) >>> >>> >I think it is hard to prove it is safe to do pre-compute. >>> Agree, I also worried precision issue since float operator is execute >>> order >>> dependent. >>> How about ROUNDPS? >> Seems no exactly match. Funny, I guess it is other issue, such as mistake on instruction's imm field. >>> >> >+cvttps2dq m4, m4 >>> >> >+packssdw m4, m4 >>> >> >+packuswb m4, m4 >>> >> >+movss [dstq + dst_offq], m4 >>> >> >+add c_offq, mmsize/4 >>> >> >+add dst_offq, mmsize/4 >>> >> >+ >>> >> >+add off16q, mmsize/4 >>> >> >+cmp off16q, widthq >>> >> >+jl .loop16 >>> >> >+ >>> >> >+add widthq, rq >>> >> >+cmp off16q, widthq >>> >> >+jge .paraend >>> >> >+ >>> >> >>> >> >+.loopr: >>> >> no idea about this loop, if we can read beyond, we can reuse above >>> >> SIMD >>> >> code >>> >Reuse above SIMD code may write to the memory that does not belong to >>> this slice-thread. >>> >>> >IMO, the code to handle remainder columns is still necessary. >>> >>> >>> Depends on algorithm & size, >>> For example width=23 >>> Process #0 [0:15] >>> Process #1 [7:22] >>> Both of them is multiple of 16 >> Sounds interesting. But FFmpeg does not do like this now. >> One question is will this get a penalty for writing to same address of >> memory (both are writing to 7-15) from different threads? > >Yes, and even bad results may happen. > This is my problem, I don't speak clean, the "Process #x" is one step of loops, I guess the function must be atomic, we can place any threading that work on same address area. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 SIMD for filter_column()
On 12/4/19, Song, Ruiling wrote: >> -Original Message- >> From: ffmpeg-devel On Behalf Of >> chen >> Sent: Wednesday, December 4, 2019 9:36 AM >> To: FFmpeg development discussions and patches > de...@ffmpeg.org> >> Subject: Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 >> SIMD for filter_column() >> >> >> >> At 2019-12-04 08:59:08, "Song, Ruiling" wrote: >> >> -Original Message- >> >> From: ffmpeg-devel On Behalf Of >> >> chen >> >> Sent: Tuesday, December 3, 2019 4:59 PM >> >> To: FFmpeg development discussions and patches > >> de...@ffmpeg.org> >> >> Subject: Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add >> >> X86 >> >> SIMD for filter_column() >> >> >> >> comments inline in code >> >> >> >> >> >> At 2019-12-03 15:52:07, xuju...@sjtu.edu.cn wrote: >> >> >From: Xu Jun >> >[...] >> >> >+ >> >> >+cvtdq2ps m4, m4 >> >> >+mulps m4, m0 ; sum *= rdiv >> >> >+addps m4, m1 ; sum += bias >> >> >> >> >+addps m4, m5 ; sum += 0.5 >> >> I don't know how about precision mismatch if we pre-compute (bias+0.5) >> >> >I think it is hard to prove it is safe to do pre-compute. >> Agree, I also worried precision issue since float operator is execute >> order >> dependent. >> How about ROUNDPS? > Seems no exactly match. >> >> >> > >> >> >> >> >> >> >+cvttps2dq m4, m4 >> >> >+packssdw m4, m4 >> >> >+packuswb m4, m4 >> >> >+movss [dstq + dst_offq], m4 >> >> >+add c_offq, mmsize/4 >> >> >+add dst_offq, mmsize/4 >> >> >+ >> >> >+add off16q, mmsize/4 >> >> >+cmp off16q, widthq >> >> >+jl .loop16 >> >> >+ >> >> >+add widthq, rq >> >> >+cmp off16q, widthq >> >> >+jge .paraend >> >> >+ >> >> >> >> >+.loopr: >> >> no idea about this loop, if we can read beyond, we can reuse above >> >> SIMD >> >> code >> >Reuse above SIMD code may write to the memory that does not belong to >> this slice-thread. >> >> >IMO, the code to handle remainder columns is still necessary. >> >> >> Depends on algorithm & size, >> For example width=23 >> Process #0 [0:15] >> Process #1 [7:22] >> Both of them is multiple of 16 > Sounds interesting. But FFmpeg does not do like this now. > One question is will this get a penalty for writing to same address of > memory (both are writing to 7-15) from different threads? Yes, and even bad results may happen. > >> >> ___ >> ffmpeg-devel mailing list >> ffmpeg-devel@ffmpeg.org >> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel >> >> To unsubscribe, visit link above, or email >> ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe". > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe". ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 SIMD for filter_column()
> -Original Message- > From: ffmpeg-devel On Behalf Of > chen > Sent: Wednesday, December 4, 2019 9:36 AM > To: FFmpeg development discussions and patches de...@ffmpeg.org> > Subject: Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 > SIMD for filter_column() > > > > At 2019-12-04 08:59:08, "Song, Ruiling" wrote: > >> -Original Message- > >> From: ffmpeg-devel On Behalf Of > >> chen > >> Sent: Tuesday, December 3, 2019 4:59 PM > >> To: FFmpeg development discussions and patches >> de...@ffmpeg.org> > >> Subject: Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 > >> SIMD for filter_column() > >> > >> comments inline in code > >> > >> > >> At 2019-12-03 15:52:07, xuju...@sjtu.edu.cn wrote: > >> >From: Xu Jun > >[...] > >> >+ > >> >+cvtdq2ps m4, m4 > >> >+mulps m4, m0 ; sum *= rdiv > >> >+addps m4, m1 ; sum += bias > >> > >> >+addps m4, m5 ; sum += 0.5 > >> I don't know how about precision mismatch if we pre-compute (bias+0.5) > > >I think it is hard to prove it is safe to do pre-compute. > Agree, I also worried precision issue since float operator is execute order > dependent. > How about ROUNDPS? Seems no exactly match. > > > > > >> > >> > >> >+cvttps2dq m4, m4 > >> >+packssdw m4, m4 > >> >+packuswb m4, m4 > >> >+movss [dstq + dst_offq], m4 > >> >+add c_offq, mmsize/4 > >> >+add dst_offq, mmsize/4 > >> >+ > >> >+add off16q, mmsize/4 > >> >+cmp off16q, widthq > >> >+jl .loop16 > >> >+ > >> >+add widthq, rq > >> >+cmp off16q, widthq > >> >+jge .paraend > >> >+ > >> > >> >+.loopr: > >> no idea about this loop, if we can read beyond, we can reuse above SIMD > >> code > >Reuse above SIMD code may write to the memory that does not belong to > this slice-thread. > > >IMO, the code to handle remainder columns is still necessary. > > > Depends on algorithm & size, > For example width=23 > Process #0 [0:15] > Process #1 [7:22] > Both of them is multiple of 16 Sounds interesting. But FFmpeg does not do like this now. One question is will this get a penalty for writing to same address of memory (both are writing to 7-15) from different threads? > > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe". ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 SIMD for filter_column()
At 2019-12-04 08:59:08, "Song, Ruiling" wrote: >> -Original Message- >> From: ffmpeg-devel On Behalf Of >> chen >> Sent: Tuesday, December 3, 2019 4:59 PM >> To: FFmpeg development discussions and patches > de...@ffmpeg.org> >> Subject: Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 >> SIMD for filter_column() >> >> comments inline in code >> >> >> At 2019-12-03 15:52:07, xuju...@sjtu.edu.cn wrote: >> >From: Xu Jun >[...] >> >+ >> >+cvtdq2ps m4, m4 >> >+mulps m4, m0 ; sum *= rdiv >> >+addps m4, m1 ; sum += bias >> >> >+addps m4, m5 ; sum += 0.5 >> I don't know how about precision mismatch if we pre-compute (bias+0.5) >I think it is hard to prove it is safe to do pre-compute. Agree, I also worried precision issue since float operator is execute order dependent. How about ROUNDPS? > >> >> >> >+cvttps2dq m4, m4 >> >+packssdw m4, m4 >> >+packuswb m4, m4 >> >+movss [dstq + dst_offq], m4 >> >+add c_offq, mmsize/4 >> >+add dst_offq, mmsize/4 >> >+ >> >+add off16q, mmsize/4 >> >+cmp off16q, widthq >> >+jl .loop16 >> >+ >> >+add widthq, rq >> >+cmp off16q, widthq >> >+jge .paraend >> >+ >> >> >+.loopr: >> no idea about this loop, if we can read beyond, we can reuse above SIMD >> code >Reuse above SIMD code may write to the memory that does not belong to this >slice-thread. >IMO, the code to handle remainder columns is still necessary. Depends on algorithm & size, For example width=23 Process #0 [0:15] Process #1 [7:22] Both of them is multiple of 16 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 SIMD for filter_column()
> -Original Message- > From: ffmpeg-devel On Behalf Of > chen > Sent: Tuesday, December 3, 2019 4:59 PM > To: FFmpeg development discussions and patches de...@ffmpeg.org> > Subject: Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 > SIMD for filter_column() > > comments inline in code > > > At 2019-12-03 15:52:07, xuju...@sjtu.edu.cn wrote: > >From: Xu Jun [...] > >+ > >+cvtdq2ps m4, m4 > >+mulps m4, m0 ; sum *= rdiv > >+addps m4, m1 ; sum += bias > > >+addps m4, m5 ; sum += 0.5 > I don't know how about precision mismatch if we pre-compute (bias+0.5) I think it is hard to prove it is safe to do pre-compute. > > > >+cvttps2dq m4, m4 > >+packssdw m4, m4 > >+packuswb m4, m4 > >+movss [dstq + dst_offq], m4 > >+add c_offq, mmsize/4 > >+add dst_offq, mmsize/4 > >+ > >+add off16q, mmsize/4 > >+cmp off16q, widthq > >+jl .loop16 > >+ > >+add widthq, rq > >+cmp off16q, widthq > >+jge .paraend > >+ > > >+.loopr: > no idea about this loop, if we can read beyond, we can reuse above SIMD > code Reuse above SIMD code may write to the memory that does not belong to this slice-thread. IMO, the code to handle remainder columns is still necessary. Ruiling > > > >+xor sumd, sumd > >+xor iq, iq > >+.loopr_i: > >+mov ciq, [ptrq + iq * gprsize] > >+movzx rd, byte [ciq + c_offq] > >+imul rd, [matrixq + 4*iq] > >+add sumd, rd > >+ > >+add iq, 1 > >+cmp iq, radq > >+jl .loopr_i > >+ > >+pxor m4, m4 > >+cvtsi2ss m4, sumd > >+mulss m4, m0 ; sum *= rdiv > >+addss m4, m1 ; sum += bias > >+addss m4, m5 ; sum += 0.5 > >+cvttps2dq m4, m4 > >+packssdw m4, m4 > >+packuswb m4, m4 > >+movd sumd, m4 > >+mov [dstq + dst_offq], sumb > >+add c_offq, 1 > >+add dst_offq, 1 > >+add off16q, 1 > >+cmp off16q, widthq > >+jl .loopr > >+ > >+.paraend: > >+sub c_offq, widthq > >+sub dst_offq, widthq > >+add c_offq, strideq > >+add dst_offq, dstrideq > >+ > >+sub heightq, 1 > >+cmp heightq, 0 > >+jg .loopy > >+ > >+.end: > >+RET > > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe". ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH 3/3] avfilter/vf_convolution: add X86 SIMD for filter_column()
comments inline in code At 2019-12-03 15:52:07, xuju...@sjtu.edu.cn wrote: >From: Xu Jun > >+; void filter_column(uint8_t *dst, int height, >+; float rdiv, float bias, const int *const matrix, >+; const uint8_t *c[], int length, int radius, >+; int dstride, int stride); >+ >+%if ARCH_X86_64 >+INIT_XMM sse4 >+%if UNIX64 >+cglobal filter_column, 8, 15, 7, dst, height, matrix, ptr, width, rad, >dstride, stride, i, ci, dst_off, off16, c_off, sum, r >+%else >+cglobal filter_column, 8, 15, 7, dst, height, rdiv, bias, matrix, ptr, width, >rad, dstride, stride, i, ci, dst_off, off16, c_off, sum, r >+%endif no idea, these are difficult to read and understand >+ >+%if WIN64 >+SWAP m0, m2 >+SWAP m1, m3 >+mov r2q, matrixmp >+mov r3q, ptrmp >+mov r4q, widthmp >+mov r5q, radmp >+mov r6q, dstridemp >+mov r7q, stridemp >+DEFINE_ARGS dst, height, matrix, ptr, width, rad, dstride, stride, i, ci, >dst_off, off16, c_off, sum, r >+%endif >+ >+movsxdifnidn widthq, widthd >+movsxdifnidn radq, radd >+movsxdifnidn dstrideq, dstrided >+movsxdifnidn strideq, strided >+sal radq, 1 >+add radq, 1 ;2*radius+1 I don't know how about compare to "LEA x,[y*2+1]" AndI want not discuss in between SAL and SHL >+movsxdifnidn heightq, heightd >+VBROADCASTSS m0, m0 >+VBROADCASTSS m1, m1 >+pxor m6, m6 >+movss m5, [half] >+VBROADCASTSS m5, m5 >+ >+xor dst_offq, dst_offq >+xor c_offq, c_offq >+ >+.loopy: >+xor off16q, off16q >+cmp widthq, mmsize/4 >+jl .loopr >+ >+mov rq, widthq >+and rq, mmsize/4-1 >+sub widthq, rq >+ >+.loop16: ;parallel process 16 elements in a row Processing 4 column per loop, are you means, we want to save lots of unused register? We claim X64, so we have 16 of XMMs >+pxor m4, m4 >+xor iq, iq >+.loopi: >+movss m2, [matrixq + 4*iq] no idea that you working on Float data path, we are lucky, Intel CPU sounds not penalty in here. >+VBROADCASTSS m2, m2 >+mov ciq, [ptrq + iq * gprsize] >+movss m3, [ciq + c_offq] ;c[i][y*stride + off16] >+punpcklbw m3, m6 >+punpcklwd m3, m6 Since you claim SSE4, the instruction PMOVZXBD available, moreover, SSE4 register can be full fill 16 of uint8, but load 4 of them only. >+pmulld m2, m3 >+paddd m4, m2 >+ >+add iq, 1 >+cmp iq, radq When you initial iq to radq and decrement per loop, you can reduce one instruction I know iq is work as index in the loop, but we can found some trick over there. >+jl .loopi >+ >+cvtdq2ps m4, m4 >+mulps m4, m0 ; sum *= rdiv >+addps m4, m1 ; sum += bias >+addps m4, m5 ; sum += 0.5 I don't know how about precision mismatch if we pre-compute (bias+0.5) >+cvttps2dq m4, m4 >+packssdw m4, m4 >+packuswb m4, m4 >+movss [dstq + dst_offq], m4 >+add c_offq, mmsize/4 >+add dst_offq, mmsize/4 >+ >+add off16q, mmsize/4 >+cmp off16q, widthq >+jl .loop16 >+ >+add widthq, rq >+cmp off16q, widthq >+jge .paraend >+ >+.loopr: no idea about this loop, if we can read beyond, we can reuse above SIMD code >+xor sumd, sumd >+xor iq, iq >+.loopr_i: >+mov ciq, [ptrq + iq * gprsize] >+movzx rd, byte [ciq + c_offq] >+imul rd, [matrixq + 4*iq] >+add sumd, rd >+ >+add iq, 1 >+cmp iq, radq >+jl .loopr_i >+ >+pxor m4, m4 >+cvtsi2ss m4, sumd >+mulss m4, m0 ; sum *= rdiv >+addss m4, m1 ; sum += bias >+addss m4, m5 ; sum += 0.5 >+cvttps2dq m4, m4 >+packssdw m4, m4 >+packuswb m4, m4 >+movd sumd, m4 >+mov [dstq + dst_offq], sumb >+add c_offq, 1 >+add dst_offq, 1 >+add off16q, 1 >+cmp off16q, widthq >+jl .loopr >+ >+.paraend: >+sub c_offq, widthq >+sub dst_offq, widthq >+add c_offq, strideq >+add dst_offq, dstrideq >+ >+sub heightq, 1 >+cmp heightq, 0 >+jg .loopy >+ >+.end: >+RET ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".