Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing
15.02.12 20:33, Dan Dennedy написав(ла): 2012/2/15 Maksym Veremeyenkove...@m1stereo.tv: 15.02.12 05:33, Dan Dennedy написав(ла): [...] OK, very close! But there is still one problem I noticed. On some geometry widths, the right edge of the B frame image is chopped off. This is reproduced in demo/mlt_my_name_is. On the first title that reads My name is Inigo Montoya notice how the right side of 'a' is cropped. i can't reproduce it... Look real closely - it occurs more at the beginning when the geometry is smaller. I can switch between the branch with this patch and master and see it is different. did you apply patch completely? because newer version has dropped lines: + dest += j * 2; + src += j * 2; + alpha_a += j; + alpha_b += j; because that values been updated in asm code... yes, just double-checked. I will see if I can figure it out this weekend because it always nice to refresh myself on some simd asm. seems problem in last 1..7 pixel of rows processed. may be gcc specific issue and values dest src alpha_a alpha_b should be sent to asm function throw stack copy -- Maksym Veremeyenko -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Mlt-devel mailing list Mlt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mlt-devel
Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing
15.02.12 20:33, Dan Dennedy написав(ла): [...] Look real closely - it occurs more at the beginning when the geometry is smaller. I can switch between the branch with this patch and master and see it is different. another one attempt. the only things i have a doubt is xmm register clobber list, currently comment out... -- Maksym Veremeyenko From 45c8b653808e8bee5c832095f37cac1f193404f0 Mon Sep 17 00:00:00 2001 From: Maksym Veremeyenko ve...@m1stereo.tv Date: Thu, 16 Feb 2012 19:10:00 +0200 Subject: [PATCH] use sse2 instruction for line compositing --- src/modules/core/composite_line_yuv_sse2_simple.c | 167 + src/modules/core/transition_composite.c | 19 ++- 2 files changed, 184 insertions(+), 2 deletions(-) create mode 100644 src/modules/core/composite_line_yuv_sse2_simple.c diff --git a/src/modules/core/composite_line_yuv_sse2_simple.c b/src/modules/core/composite_line_yuv_sse2_simple.c new file mode 100644 index 000..2ed4801 --- /dev/null +++ b/src/modules/core/composite_line_yuv_sse2_simple.c @@ -0,0 +1,167 @@ +void composite_line_yuv_sse2_simple(uint8_t *dest, uint8_t *src, int width, uint8_t *alpha_b, uint8_t *alpha_a, int weight) +{ +const static unsigned char const1[] = +{ +0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00 +}; + +__asm__ volatile +( +pxor %%xmm0, %%xmm0 \n\t /* clear zero register */ +movdqu (%4), %%xmm9\n\t /* load const1 */ +movd %0, %%xmm1 \n\t /* load weight and decompose */ +movlhps%%xmm1, %%xmm1 \n\t +pshuflw$0, %%xmm1, %%xmm1 \n\t +pshufhw$0, %%xmm1, %%xmm1 \n\t + +/* +xmm1 (weight) + +00 W 00 W 00 W 00 W 00 W 00 W 00 W 00 W +*/ +loop_start:\n\t +movq (%1), %%xmm2\n\t /* load source alpha */ +punpcklbw %%xmm0, %%xmm2 \n\t /* unpack alpha 8 8-bits alphas to 8 16-bits values */ + +/* +xmm2 (src alpha) +xmm3 (dst alpha) + +00 A8 00 A7 00 A6 00 A5 00 A4 00 A3 00 A2 00 A1 +*/ +pmullw %%xmm1, %%xmm2 \n\t /* premultiply source alpha */ +psrlw $8, %%xmm2 \n\t + +/* +xmm2 (premultiplied) + +00 A8 00 A7 00 A6 00 A5 00 A4 00 A3 00 A2 00 A1 +*/ + + +/* +DSTa = DSTa + (SRCa * (0xFF - DSTa)) 8 +*/ +movq (%5), %%xmm3\n\t /* load dst alpha */ +punpcklbw %%xmm0, %%xmm3 \n\t /* unpack dst 8 8-bits alphas to 8 16-bits values */ +movdqa %%xmm9, %%xmm4 \n\t +psubw %%xmm3, %%xmm4 \n\t +pmullw %%xmm2, %%xmm4 \n\t +psrlw $8, %%xmm4 \n\t +paddw %%xmm4, %%xmm3 \n\t +packuswb %%xmm0, %%xmm3 \n\t +movq %%xmm3, (%5)\n\t /* save dst alpha */ + +movdqu (%2), %%xmm3\n\t /* load src */ +movdqu (%3), %%xmm4\n\t /* load dst */ +movdqa %%xmm3, %%xmm5 \n\t /* dub src */ +movdqa %%xmm4, %%xmm6 \n\t /* dub dst */ + +/* +xmm3 (src) +xmm4 (dst) +xmm5 (src) +xmm6 (dst) + +U8 V8 U7 V7 U6 V6 U5 V5 U4 V4 U3 V3 U2 V2 U1 V1 +*/ + +punpcklbw %%xmm0, %%xmm5 \n\t /* unpack src low */ +punpcklbw %%xmm0, %%xmm6 \n\t /* unpack dst low */ +punpckhbw %%xmm0, %%xmm3 \n\t /* unpack src high */ +punpckhbw %%xmm0, %%xmm4 \n\t /* unpack dst high */ + +/* +xmm5 (src_l) +xmm6 (dst_l) + +00 U4 00 V4 00 U3 00 V3 00 U2 00 V2 00 U1 00 V1 + +xmm3 (src_u) +xmm4 (dst_u) + +00 U8 00 V8 00 U7 00 V7 00 U6 00 V6 00 U5 00 V5 +*/ + +movdqa %%xmm2, %%xmm7 \n\t /* dub alpha */ +movdqa %%xmm2, %%xmm8 \n\t /* dub alpha */ +movlhps%%xmm7, %%xmm7 \n\t /* dub low */ +movhlps%%xmm8, %%xmm8 \n\t /* dub high */ + +/* +xmm7 (src alpha) + +00 A4 00 A3 00 A2 00 A1 00 A4 00 A3 00 A2 00 A1 +xmm8 (src alpha) + +00 A8 00 A7 00 A6 00 A5 00 A8 00 A7 00 A6 00 A5 +*/ + +pshuflw$0x50, %%xmm7, %%xmm7 \n\t +pshuflw$0x50, %%xmm8, %%xmm8 \n\t +pshufhw$0xFA,
Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing
2012/2/16 Maksym Veremeyenko ve...@m1stereo.tv: 15.02.12 20:33, Dan Dennedy написав(ла): [...] Look real closely - it occurs more at the beginning when the geometry is smaller. I can switch between the branch with this patch and master and see it is different. another one attempt. this one works! the only things i have a doubt is xmm register clobber list, currently comment out... Do you think it is OK to merge now, or does this mean I should wait? -- +-DRD-+ -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Mlt-devel mailing list Mlt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mlt-devel
Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing
15.02.12 05:33, Dan Dennedy написав(ла): [...] OK, very close! But there is still one problem I noticed. On some geometry widths, the right edge of the B frame image is chopped off. This is reproduced in demo/mlt_my_name_is. On the first title that reads My name is Inigo Montoya notice how the right side of 'a' is cropped. i can't reproduce it... did you apply patch completely? because newer version has dropped lines: + dest += j * 2; + src += j * 2; + alpha_a += j; + alpha_b += j; because that values been updated in asm code... -- Maksym Veremeyenko -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Mlt-devel mailing list Mlt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mlt-devel
Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing
2012/2/15 Maksym Veremeyenko ve...@m1stereo.tv: 15.02.12 05:33, Dan Dennedy написав(ла): [...] OK, very close! But there is still one problem I noticed. On some geometry widths, the right edge of the B frame image is chopped off. This is reproduced in demo/mlt_my_name_is. On the first title that reads My name is Inigo Montoya notice how the right side of 'a' is cropped. i can't reproduce it... Look real closely - it occurs more at the beginning when the geometry is smaller. I can switch between the branch with this patch and master and see it is different. did you apply patch completely? because newer version has dropped lines: + dest += j * 2; + src += j * 2; + alpha_a += j; + alpha_b += j; because that values been updated in asm code... yes, just double-checked. I will see if I can figure it out this weekend because it always nice to refresh myself on some simd asm. -- +-DRD-+ -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Mlt-devel mailing list Mlt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mlt-devel
Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing
10.02.12 07:41, Dan Dennedy написав(ла): 2012/2/2 Maksym Veremeyenkove...@m1stereo.tv: Hi, attached patch perform line compositing for SSE2+ARCH_X86_64 build. It works for a case where luma is not defined... Hi Maksym, did some more testing and ran into a couple of image quality problems. First, alpha blending seems poor, mostly noticeable with a text with curvy typeface over video: melt clip1.dv -filter dynamictext:Hello size=200 outline=2 olcolour=white family=elegante bgcolor=0x0020 The first time you run that you will see that the alpha of bgcolour (black with 12.5% opacity) is not honored and the background is black. Set bgcolour=0 to make it completely transparent and look along curved edges to see the poor blending. The second problem is that key-framing opacity causes a repeating cycle of 100% A frame, A+B blended, and 100% B frame. The below reproduces it: melt color:red -track color:blue -transition composite out=99 geometry=0=0/0:100%x100%:0; 99=0/0:100%x100%:100 i wrongly assumed weight range in 0..255 - updated patch attached... -- Maksym Veremeyenko From e8c8a1dde7883f203f609f364a27ea6c1a77104f Mon Sep 17 00:00:00 2001 From: Maksym Veremeyenko ve...@m1stereo.tv Date: Tue, 14 Feb 2012 13:34:12 +0200 Subject: [PATCH] use sse2 instruction for line compositing --- src/modules/core/composite_line_yuv_sse2_simple.c | 164 + src/modules/core/transition_composite.c | 12 ++- 2 files changed, 174 insertions(+), 2 deletions(-) create mode 100644 src/modules/core/composite_line_yuv_sse2_simple.c diff --git a/src/modules/core/composite_line_yuv_sse2_simple.c b/src/modules/core/composite_line_yuv_sse2_simple.c new file mode 100644 index 000..f202828 --- /dev/null +++ b/src/modules/core/composite_line_yuv_sse2_simple.c @@ -0,0 +1,164 @@ +const static unsigned char const1[] = +{ +0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00 +}; + +__asm__ volatile +( +pxor %%xmm0, %%xmm0 \n\t /* clear zero register */ +movdqu (%4), %%xmm9\n\t /* load const1 */ +movd %0, %%xmm1 \n\t /* load weight and decompose */ +movlhps%%xmm1, %%xmm1 \n\t +pshuflw$0, %%xmm1, %%xmm1 \n\t +pshufhw$0, %%xmm1, %%xmm1 \n\t + +/* +xmm1 (weight) + +00 W 00 W 00 W 00 W 00 W 00 W 00 W 00 W +*/ +loop_start:\n\t +movq (%1), %%xmm2\n\t /* load source alpha */ +punpcklbw %%xmm0, %%xmm2 \n\t /* unpack alpha 8 8-bits alphas to 8 16-bits values */ + +/* +xmm2 (src alpha) +xmm3 (dst alpha) + +00 A8 00 A7 00 A6 00 A5 00 A4 00 A3 00 A2 00 A1 +*/ +pmullw %%xmm1, %%xmm2 \n\t /* premultiply source alpha */ +psrlw $8, %%xmm2 \n\t + +/* +xmm2 (premultiplied) + +00 A8 00 A7 00 A6 00 A5 00 A4 00 A3 00 A2 00 A1 +*/ + + +/* +DSTa = DSTa + (SRCa * (0xFF - DSTa)) 8 +*/ +movq (%5), %%xmm3\n\t /* load dst alpha */ +punpcklbw %%xmm0, %%xmm3 \n\t /* unpack dst 8 8-bits alphas to 8 16-bits values */ +movdqa %%xmm9, %%xmm4 \n\t +psubw %%xmm3, %%xmm4 \n\t +pmullw %%xmm2, %%xmm4 \n\t +psrlw $8, %%xmm4 \n\t +paddw %%xmm4, %%xmm3 \n\t +packuswb %%xmm0, %%xmm3 \n\t +movq %%xmm3, (%5)\n\t /* save dst alpha */ + +movdqu (%2), %%xmm3\n\t /* load src */ +movdqu (%3), %%xmm4\n\t /* load dst */ +movdqa %%xmm3, %%xmm5 \n\t /* dub src */ +movdqa %%xmm4, %%xmm6 \n\t /* dub dst */ + +/* +xmm3 (src) +xmm4 (dst) +xmm5 (src) +xmm6 (dst) + +U8 V8 U7 V7 U6 V6 U5 V5 U4 V4 U3 V3 U2 V2 U1 V1 +*/ + +punpcklbw %%xmm0, %%xmm5 \n\t /* unpack src low */ +punpcklbw %%xmm0, %%xmm6 \n\t /* unpack dst low */ +punpckhbw %%xmm0, %%xmm3 \n\t /* unpack src high */ +punpckhbw %%xmm0, %%xmm4 \n\t /* unpack dst high */ + +/* +xmm5 (src_l) +xmm6 (dst_l) + +00 U4 00 V4 00 U3 00 V3 00 U2 00 V2 00 U1 00 V1 + +xmm3 (src_u) +xmm4 (dst_u) + +00 U8 00 V8 00 U7 00 V7 00 U6 00 V6 00 U5 00 V5 +*/ + +movdqa
Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing
2012/2/14 Maksym Veremeyenko ve...@m1stereo.tv: 10.02.12 07:41, Dan Dennedy написав(ла): 2012/2/2 Maksym Veremeyenkove...@m1stereo.tv: Hi, attached patch perform line compositing for SSE2+ARCH_X86_64 build. It works for a case where luma is not defined... Hi Maksym, did some more testing and ran into a couple of image quality problems. First, alpha blending seems poor, mostly noticeable with a text with curvy typeface over video: melt clip1.dv -filter dynamictext:Hello size=200 outline=2 olcolour=white family=elegante bgcolor=0x0020 The first time you run that you will see that the alpha of bgcolour (black with 12.5% opacity) is not honored and the background is black. Set bgcolour=0 to make it completely transparent and look along curved edges to see the poor blending. The second problem is that key-framing opacity causes a repeating cycle of 100% A frame, A+B blended, and 100% B frame. The below reproduces it: melt color:red -track color:blue -transition composite out=99 geometry=0=0/0:100%x100%:0; 99=0/0:100%x100%:100 i wrongly assumed weight range in 0..255 - updated patch attached... OK, very close! But there is still one problem I noticed. On some geometry widths, the right edge of the B frame image is chopped off. This is reproduced in demo/mlt_my_name_is. On the first title that reads My name is Inigo Montoya notice how the right side of 'a' is cropped. -- +-DRD-+ -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Mlt-devel mailing list Mlt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mlt-devel
Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing
2012/2/2 Maksym Veremeyenko ve...@m1stereo.tv: Hi, attached patch perform line compositing for SSE2+ARCH_X86_64 build. It works for a case where luma is not defined... Hi Maksym, did some more testing and ran into a couple of image quality problems. First, alpha blending seems poor, mostly noticeable with a text with curvy typeface over video: melt clip1.dv -filter dynamictext:Hello size=200 outline=2 olcolour=white family=elegante bgcolor=0x0020 The first time you run that you will see that the alpha of bgcolour (black with 12.5% opacity) is not honored and the background is black. Set bgcolour=0 to make it completely transparent and look along curved edges to see the poor blending. The second problem is that key-framing opacity causes a repeating cycle of 100% A frame, A+B blended, and 100% B frame. The below reproduces it: melt color:red -track color:blue -transition composite out=99 geometry=0=0/0:100%x100%:0; 99=0/0:100%x100%:100 -- +-DRD-+ -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Mlt-devel mailing list Mlt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mlt-devel
Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing
08.02.12 06:45, Dan Dennedy написав(ла): On Tue, Feb 7, 2012 at 8:40 PM, Dan Dennedyd...@dennedy.org wrote: 2012/2/6 Maksym Veremeyenkove...@m1stereo.tv: 04.02.12 22:25, Dan Dennedy написав(ла): 2012/2/3 Maksym Veremeyenkove...@m1stereo.tv: 02.02.12 18:57, Maksym Veremeyenko написав(ла): Hi, attached patch perform line compositing for SSE2+ARCH_X86_64 build. It works for a case where luma is not defined... updated patch attached i am still testing it... may be you can take a look if all processed fine. I started testing with the demos in the demo/ directory. It mostly works; however, mlt_my_name_is crashes on me (and one time mlt_squeeze_box): (gdb) bt #0 0x7fffdecced14 in composite_line_yuv ( dest=0x25bae60 ~\037~ ~\037~!~-\177\064~2\177\065}5\200\064}3\200\062}3\201\064}6\201\063|0\202-|+\202+|+\202,|-\202/{0\203\060{1\203\061{1\203\061{1\203\061|2\202\062|1\202/{/\202\060{1\202-{,\202-{1\202\062{2\202\065{:\202\062{3\203\063{3\203\063z6\203;z@\203=z@\203BzD\203JzR\204WzW\204^ya\202fyj\202myn\202nyn\202wyv\202yy~\202\177y|\202|y\177\202{y{\202|y~\202\177y}\201|y}\201{y{\201{yy\201xyw\201wyx\201w{w\201t{p\201..., src=0x2c621f0 width=0, alpha_b=0x2c16e00 , Expanding the test to include width fixed that for me. if ( !luma width ) Can you think of some other checks we should add? seems i miss that checks, i think it should be: if ( !luma width 7 ) -- Maksym Veremeyenko -- Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d ___ Mlt-devel mailing list Mlt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mlt-devel
Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing
2012/2/6 Maksym Veremeyenko ve...@m1stereo.tv: 04.02.12 22:25, Dan Dennedy написав(ла): 2012/2/3 Maksym Veremeyenkove...@m1stereo.tv: 02.02.12 18:57, Maksym Veremeyenko написав(ла): Hi, attached patch perform line compositing for SSE2+ARCH_X86_64 build. It works for a case where luma is not defined... updated patch attached If I am not mistaken, this change reduces precision to 8 pixels. The existing transition is already limited to a 2 pixel precision, which I am not happy about. I do not want to further reduce the precision, give different results depending on CPU, and effectively introduce a regression, as far as the user is concerned. Maybe we should limit it to only apply when width is a multiple of 8. Then, it would still be used for fullscreen composite on most profiles' resolution. not exactly... sse2 code process 8-pixels-groups, tail with 1...7 pixels processed by native code - that why i did not create a standalone function but putted code into existing composite_yuv code... i am still testing it... may be you can take a look if all processed fine. I started testing with the demos in the demo/ directory. It mostly works; however, mlt_my_name_is crashes on me (and one time mlt_squeeze_box): (gdb) bt #0 0x7fffdecced14 in composite_line_yuv ( dest=0x25bae60 ~\037~ ~\037~!~-\177\064~2\177\065}5\200\064}3\200\062}3\201\064}6\201\063|0\202-|+\202+|+\202,|-\202/{0\203\060{1\203\061{1\203\061{1\203\061|2\202\062|1\202/{/\202\060{1\202-{,\202-{1\202\062{2\202\065{:\202\062{3\203\063{3\203\063z6\203;z@\203=z@\203BzD\203JzR\204WzW\204^ya\202fyj\202myn\202nyn\202wyv\202yy~\202\177y|\202|y\177\202{y{\202|y~\202\177y}\201|y}\201{y{\201{yy\201xyw\201wyx\201w{w\201t{p\201..., src=0x2c621f0 \020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200..., width=0, alpha_b=0x2c16e00 , alpha_a=0x1ee9a30 \377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377..., weight=65535, luma=0x0, soft=0, step=65535) at composite_line_yuv_sse2_simple.c:6 -- +-DRD-+ -- Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d ___ Mlt-devel mailing list Mlt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mlt-devel
Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing
2012/2/6 Maksym Veremeyenko ve...@m1stereo.tv: 04.02.12 22:25, Dan Dennedy написав(ла): 2012/2/3 Maksym Veremeyenkove...@m1stereo.tv: 02.02.12 18:57, Maksym Veremeyenko написав(ла): Hi, attached patch perform line compositing for SSE2+ARCH_X86_64 build. It works for a case where luma is not defined... updated patch attached If I am not mistaken, this change reduces precision to 8 pixels. The existing transition is already limited to a 2 pixel precision, which I am not happy about. I do not want to further reduce the precision, give different results depending on CPU, and effectively introduce a regression, as far as the user is concerned. Maybe we should limit it to only apply when width is a multiple of 8. Then, it would still be used for fullscreen composite on most profiles' resolution. not exactly... sse2 code process 8-pixels-groups, tail with 1...7 pixels processed by native code - that why i did not create a standalone function but putted code into existing composite_yuv code... ah, by native code I think you mean C, and if so, sorry, I overlooked that i am still testing it... may be you can take a look if all processed fine. I will give it another review and testing. Your other items are still on my todo list, but have been busy lately dealing with libav- and ffmpeg-integration bugs. -- +-DRD-+ -- Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 ___ Mlt-devel mailing list Mlt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mlt-devel
Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing
06.02.12 18:47, Dan Dennedy написав(ла): [...] I will give it another review and testing. Your other items are still on my todo list, but have been busy lately dealing with libav- and ffmpeg-integration bugs. offtopic: why does alpha plane created if not exists? why do not pass a NULL into the composite_yuv_??? function? i think for 1920x1080 creating non-transparent alpha plane could be bit slowdown system? -- Maksym Veremeyenko -- Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 ___ Mlt-devel mailing list Mlt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mlt-devel
Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing
2012/2/3 Maksym Veremeyenko ve...@m1stereo.tv: 02.02.12 18:57, Maksym Veremeyenko написав(ла): Hi, attached patch perform line compositing for SSE2+ARCH_X86_64 build. It works for a case where luma is not defined... updated patch attached If I am not mistaken, this change reduces precision to 8 pixels. The existing transition is already limited to a 2 pixel precision, which I am not happy about. I do not want to further reduce the precision, give different results depending on CPU, and effectively introduce a regression, as far as the user is concerned. Maybe we should limit it to only apply when width is a multiple of 8. Then, it would still be used for fullscreen composite on most profiles' resolution. -- +-DRD-+ -- Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 ___ Mlt-devel mailing list Mlt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mlt-devel
Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing
02.02.12 18:57, Maksym Veremeyenko написав(ла): Hi, attached patch perform line compositing for SSE2+ARCH_X86_64 build. It works for a case where luma is not defined... updated patch attached -- Maksym Veremeyenko From d0a46a3308b390228e6d4337b24010ae3cecef7f Mon Sep 17 00:00:00 2001 From: Maksym Veremeyenko ve...@m1stereo.tv Date: Fri, 3 Feb 2012 13:19:12 +0200 Subject: [PATCH] use sse2 instruction for line compositing --- src/modules/core/composite_line_yuv_sse2_simple.c | 164 + src/modules/core/transition_composite.c | 16 ++- 2 files changed, 178 insertions(+), 2 deletions(-) create mode 100644 src/modules/core/composite_line_yuv_sse2_simple.c diff --git a/src/modules/core/composite_line_yuv_sse2_simple.c b/src/modules/core/composite_line_yuv_sse2_simple.c new file mode 100644 index 000..bd977e1 --- /dev/null +++ b/src/modules/core/composite_line_yuv_sse2_simple.c @@ -0,0 +1,164 @@ +const static unsigned char const1[] = +{ +0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00 +}; + +__asm__ volatile +( +pxor %%xmm0, %%xmm0 \n\t /* clear zero register */ +movdqu (%4), %%xmm9\n\t /* load const1 */ +movd %0, %%xmm1 \n\t /* load weight and decompose */ +movlhps%%xmm1, %%xmm1 \n\t +pshuflw$0, %%xmm1, %%xmm1 \n\t +pshufhw$0, %%xmm1, %%xmm1 \n\t + +/* +xmm1 (weight) + +00 W 00 W 00 W 00 W 00 W 00 W 00 W 00 W +*/ +loop_start:\n\t +movq (%1), %%xmm2\n\t /* load source alpha */ +punpcklbw %%xmm0, %%xmm2 \n\t /* unpack alpha 8 8-bits alphas to 8 16-bits values */ + +/* +xmm2 (src alpha) +xmm3 (dst alpha) + +00 A8 00 A7 00 A6 00 A5 00 A4 00 A3 00 A2 00 A1 +*/ +pmullw %%xmm1, %%xmm2 \n\t /* premultiply source alpha */ +psrlw $8, %%xmm2 \n\t + +/* +xmm2 (premultiplied) + +00 A8 00 A7 00 A6 00 A5 00 A4 00 A3 00 A2 00 A1 +*/ + + +/* +DSTa = DSTa + (SRCa * (0xFF - DSTa)) 8 +*/ +movq (%5), %%xmm3\n\t /* load dst alpha */ +punpcklbw %%xmm0, %%xmm3 \n\t /* unpack dst 8 8-bits alphas to 8 16-bits values */ +movdqa %%xmm9, %%xmm4 \n\t +psubw %%xmm3, %%xmm4 \n\t +pmullw %%xmm2, %%xmm4 \n\t +psrlw $8, %%xmm4 \n\t +paddw %%xmm4, %%xmm3 \n\t +packuswb %%xmm0, %%xmm3 \n\t +movq %%xmm3, (%5)\n\t /* load dst alpha */ + +movdqu (%2), %%xmm3\n\t /* load src */ +movdqu (%3), %%xmm4\n\t /* load dst */ +movdqa %%xmm3, %%xmm5 \n\t /* dub src */ +movdqa %%xmm4, %%xmm6 \n\t /* dub dst */ + +/* +xmm3 (src) +xmm4 (dst) +xmm5 (src) +xmm6 (dst) + +U8 V8 U7 V7 U6 V6 U5 V5 U4 V4 U3 V3 U2 V2 U1 V1 +*/ + +punpcklbw %%xmm0, %%xmm5 \n\t /* unpack src low */ +punpcklbw %%xmm0, %%xmm6 \n\t /* unpack dst low */ +punpckhbw %%xmm0, %%xmm3 \n\t /* unpack src high */ +punpckhbw %%xmm0, %%xmm4 \n\t /* unpack dst high */ + +/* +xmm5 (src_l) +xmm6 (dst_l) + +00 U4 00 V4 00 U3 00 V3 00 U2 00 V2 00 U1 00 V1 + +xmm3 (src_u) +xmm4 (dst_u) + +00 U8 00 V8 00 U7 00 V7 00 U6 00 V6 00 U5 00 V5 +*/ + +movdqa %%xmm2, %%xmm7 \n\t /* dub alpha */ +movdqa %%xmm2, %%xmm8 \n\t /* dub alpha */ +movlhps%%xmm7, %%xmm7 \n\t /* dub low */ +movhlps%%xmm8, %%xmm8 \n\t /* dub high */ + +/* +xmm7 (src alpha) + +00 A4 00 A3 00 A2 00 A1 00 A4 00 A3 00 A2 00 A1 +xmm8 (src alpha) + +00 A8 00 A7 00 A6 00 A5 00 A8 00 A7 00 A6 00 A5 +*/ + +pshuflw$0x50, %%xmm7, %%xmm7 \n\t +pshuflw$0x50, %%xmm8, %%xmm8 \n\t +pshufhw$0xFA, %%xmm7, %%xmm7 \n\t +pshufhw$0xFA, %%xmm8, %%xmm8 \n\t + +/* +xmm7 (src alpha lower) + +00 A4 00 A4 00 A3 00 A3 00 A2 00 A2 00 A1 00 A1 + +xmm8 (src alpha upper) +
[Mlt-devel] [PATCH] use sse2 instruction for line compositing
Hi, attached patch perform line compositing for SSE2+ARCH_X86_64 build. It works for a case where luma is not defined... -- Maksym Veremeyenko From 73dca48f8e4a470140ab4d70d2002c6ff39017ef Mon Sep 17 00:00:00 2001 From: Maksym Veremeyenko ve...@m1stereo.tv Date: Thu, 2 Feb 2012 18:03:07 +0200 Subject: [PATCH] use sse2 instruction for line compositing --- src/modules/core/composite_line_yuv_sse2_simple.c | 164 + src/modules/core/transition_composite.c | 12 ++- 2 files changed, 174 insertions(+), 2 deletions(-) create mode 100644 src/modules/core/composite_line_yuv_sse2_simple.c diff --git a/src/modules/core/composite_line_yuv_sse2_simple.c b/src/modules/core/composite_line_yuv_sse2_simple.c new file mode 100644 index 000..bd977e1 --- /dev/null +++ b/src/modules/core/composite_line_yuv_sse2_simple.c @@ -0,0 +1,164 @@ +const static unsigned char const1[] = +{ +0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00 +}; + +__asm__ volatile +( +pxor %%xmm0, %%xmm0 \n\t /* clear zero register */ +movdqu (%4), %%xmm9\n\t /* load const1 */ +movd %0, %%xmm1 \n\t /* load weight and decompose */ +movlhps%%xmm1, %%xmm1 \n\t +pshuflw$0, %%xmm1, %%xmm1 \n\t +pshufhw$0, %%xmm1, %%xmm1 \n\t + +/* +xmm1 (weight) + +00 W 00 W 00 W 00 W 00 W 00 W 00 W 00 W +*/ +loop_start:\n\t +movq (%1), %%xmm2\n\t /* load source alpha */ +punpcklbw %%xmm0, %%xmm2 \n\t /* unpack alpha 8 8-bits alphas to 8 16-bits values */ + +/* +xmm2 (src alpha) +xmm3 (dst alpha) + +00 A8 00 A7 00 A6 00 A5 00 A4 00 A3 00 A2 00 A1 +*/ +pmullw %%xmm1, %%xmm2 \n\t /* premultiply source alpha */ +psrlw $8, %%xmm2 \n\t + +/* +xmm2 (premultiplied) + +00 A8 00 A7 00 A6 00 A5 00 A4 00 A3 00 A2 00 A1 +*/ + + +/* +DSTa = DSTa + (SRCa * (0xFF - DSTa)) 8 +*/ +movq (%5), %%xmm3\n\t /* load dst alpha */ +punpcklbw %%xmm0, %%xmm3 \n\t /* unpack dst 8 8-bits alphas to 8 16-bits values */ +movdqa %%xmm9, %%xmm4 \n\t +psubw %%xmm3, %%xmm4 \n\t +pmullw %%xmm2, %%xmm4 \n\t +psrlw $8, %%xmm4 \n\t +paddw %%xmm4, %%xmm3 \n\t +packuswb %%xmm0, %%xmm3 \n\t +movq %%xmm3, (%5)\n\t /* load dst alpha */ + +movdqu (%2), %%xmm3\n\t /* load src */ +movdqu (%3), %%xmm4\n\t /* load dst */ +movdqa %%xmm3, %%xmm5 \n\t /* dub src */ +movdqa %%xmm4, %%xmm6 \n\t /* dub dst */ + +/* +xmm3 (src) +xmm4 (dst) +xmm5 (src) +xmm6 (dst) + +U8 V8 U7 V7 U6 V6 U5 V5 U4 V4 U3 V3 U2 V2 U1 V1 +*/ + +punpcklbw %%xmm0, %%xmm5 \n\t /* unpack src low */ +punpcklbw %%xmm0, %%xmm6 \n\t /* unpack dst low */ +punpckhbw %%xmm0, %%xmm3 \n\t /* unpack src high */ +punpckhbw %%xmm0, %%xmm4 \n\t /* unpack dst high */ + +/* +xmm5 (src_l) +xmm6 (dst_l) + +00 U4 00 V4 00 U3 00 V3 00 U2 00 V2 00 U1 00 V1 + +xmm3 (src_u) +xmm4 (dst_u) + +00 U8 00 V8 00 U7 00 V7 00 U6 00 V6 00 U5 00 V5 +*/ + +movdqa %%xmm2, %%xmm7 \n\t /* dub alpha */ +movdqa %%xmm2, %%xmm8 \n\t /* dub alpha */ +movlhps%%xmm7, %%xmm7 \n\t /* dub low */ +movhlps%%xmm8, %%xmm8 \n\t /* dub high */ + +/* +xmm7 (src alpha) + +00 A4 00 A3 00 A2 00 A1 00 A4 00 A3 00 A2 00 A1 +xmm8 (src alpha) + +00 A8 00 A7 00 A6 00 A5 00 A8 00 A7 00 A6 00 A5 +*/ + +pshuflw$0x50, %%xmm7, %%xmm7 \n\t +pshuflw$0x50, %%xmm8, %%xmm8 \n\t +pshufhw$0xFA, %%xmm7, %%xmm7 \n\t +pshufhw$0xFA, %%xmm8, %%xmm8 \n\t + +/* +xmm7 (src alpha lower) + +00 A4 00 A4 00 A3 00 A3 00 A2 00 A2 00 A1 00 A1 + +xmm8 (src alpha upper) +00 A8 00 A8 00 A7 00 A7 00 A6 00 A6 00 A5 00 A5 +*/ + + +
Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing
Am 02.02.2012 17:57, schrieb Maksym Veremeyenko: Hi, attached patch perform line compositing for SSE2+ARCH_X86_64 build. It works for a case where luma is not defined... Is there a reason why it is only on amd64 available? Surely the code should be disabled if mlt is configured without SSE support, but a user with an modern CPU but i386 userland/kernel maybe still wants to benefit from it? -- /* Mit freundlichem Gruß / With kind regards, Patrick Matthäi GNU/Linux Debian Developer E-Mail: pmatth...@debian.org patr...@linux-dev.org */ signature.asc Description: OpenPGP digital signature -- Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d___ Mlt-devel mailing list Mlt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mlt-devel
Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing
02.02.12 19:01, Patrick Matthäi написав(ла): Am 02.02.2012 17:57, schrieb Maksym Veremeyenko: Hi, attached patch perform line compositing for SSE2+ARCH_X86_64 build. It works for a case where luma is not defined... Is there a reason why it is only on amd64 available? because it use xmm8 and xmm9 that available only for x64 mode... Surely the code should be disabled if mlt is configured without SSE support, it used only: [...] #if defined(USE_SSE) defined(ARCH_X86_64) but a user with an modern CPU but i386 userland/kernel maybe still wants to benefit from it? code could be optimized by dropping keeping in register two constants... -- Maksym Veremeyenko -- Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d ___ Mlt-devel mailing list Mlt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mlt-devel