Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing

2012-02-16 Thread Maksym Veremeyenko
15.02.12 20:33, Dan Dennedy написав(ла):
 2012/2/15 Maksym Veremeyenkove...@m1stereo.tv:
 15.02.12 05:33, Dan Dennedy написав(ла):
 [...]

 OK, very close! But there is still one problem I noticed. On some
 geometry widths, the right edge of the B frame image is chopped off.
 This is reproduced in demo/mlt_my_name_is. On the first title that
 reads My name is Inigo Montoya notice how the right side of 'a' is
 cropped.


 i can't reproduce it...

 Look real closely - it occurs more at the beginning when the geometry
 is smaller. I can switch between the branch with this patch and master
 and see it is different.

 did you apply patch completely? because newer
 version has dropped lines:

 +   dest += j * 2;
 +   src += j * 2;
 +   alpha_a += j;
 +   alpha_b += j;

 because that values been updated in asm code...

 yes, just double-checked. I will see if I can figure it out this
 weekend because it always nice to refresh myself on some simd asm.


seems problem in last 1..7 pixel of rows processed. may be gcc specific 
issue and values dest src alpha_a alpha_b should be sent to asm function 
throw stack copy

-- 

Maksym Veremeyenko

--
Virtualization  Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Mlt-devel mailing list
Mlt-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mlt-devel


Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing

2012-02-16 Thread Maksym Veremeyenko

15.02.12 20:33, Dan Dennedy написав(ла):
[...]

Look real closely - it occurs more at the beginning when the geometry
is smaller. I can switch between the branch with this patch and master
and see it is different.

another one attempt.
the only things i have a doubt is xmm register clobber list, currently 
comment out...


--

Maksym Veremeyenko
From 45c8b653808e8bee5c832095f37cac1f193404f0 Mon Sep 17 00:00:00 2001
From: Maksym Veremeyenko ve...@m1stereo.tv
Date: Thu, 16 Feb 2012 19:10:00 +0200
Subject: [PATCH] use sse2 instruction for line compositing

---
 src/modules/core/composite_line_yuv_sse2_simple.c |  167 +
 src/modules/core/transition_composite.c   |   19 ++-
 2 files changed, 184 insertions(+), 2 deletions(-)
 create mode 100644 src/modules/core/composite_line_yuv_sse2_simple.c

diff --git a/src/modules/core/composite_line_yuv_sse2_simple.c b/src/modules/core/composite_line_yuv_sse2_simple.c
new file mode 100644
index 000..2ed4801
--- /dev/null
+++ b/src/modules/core/composite_line_yuv_sse2_simple.c
@@ -0,0 +1,167 @@
+void composite_line_yuv_sse2_simple(uint8_t *dest, uint8_t *src, int width, uint8_t *alpha_b, uint8_t *alpha_a, int weight)
+{
+const static unsigned char const1[] =
+{
+0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00
+};
+
+__asm__ volatile
+(
+pxor   %%xmm0, %%xmm0  \n\t   /* clear zero register */
+movdqu (%4), %%xmm9\n\t   /* load const1 */
+movd   %0, %%xmm1  \n\t   /* load weight and decompose */
+movlhps%%xmm1, %%xmm1  \n\t
+pshuflw$0, %%xmm1, %%xmm1  \n\t
+pshufhw$0, %%xmm1, %%xmm1  \n\t
+
+/*
+xmm1 (weight)
+
+00  W 00  W 00  W 00  W 00  W 00  W 00  W 00  W
+*/
+loop_start:\n\t
+movq   (%1), %%xmm2\n\t   /* load source alpha */
+punpcklbw  %%xmm0, %%xmm2  \n\t   /* unpack alpha 8 8-bits alphas to 8 16-bits values */
+
+/*
+xmm2 (src alpha)
+xmm3 (dst alpha)
+
+00 A8 00 A7 00 A6 00 A5 00 A4 00 A3 00 A2 00 A1
+*/
+pmullw %%xmm1, %%xmm2  \n\t   /* premultiply source alpha */
+psrlw  $8, %%xmm2  \n\t
+
+/*
+xmm2 (premultiplied)
+
+00 A8 00 A7 00 A6 00 A5 00 A4 00 A3 00 A2 00 A1
+*/
+
+
+/*
+DSTa = DSTa + (SRCa * (0xFF - DSTa))  8
+*/
+movq   (%5), %%xmm3\n\t   /* load dst alpha */
+punpcklbw  %%xmm0, %%xmm3  \n\t   /* unpack dst 8 8-bits alphas to 8 16-bits values */
+movdqa %%xmm9, %%xmm4  \n\t
+psubw  %%xmm3, %%xmm4  \n\t
+pmullw %%xmm2, %%xmm4  \n\t
+psrlw  $8, %%xmm4  \n\t
+paddw  %%xmm4, %%xmm3  \n\t
+packuswb   %%xmm0, %%xmm3  \n\t
+movq   %%xmm3, (%5)\n\t   /* save dst alpha */
+
+movdqu (%2), %%xmm3\n\t   /* load src */
+movdqu (%3), %%xmm4\n\t   /* load dst */
+movdqa %%xmm3, %%xmm5  \n\t   /* dub src */
+movdqa %%xmm4, %%xmm6  \n\t   /* dub dst */
+
+/*
+xmm3 (src)
+xmm4 (dst)
+xmm5 (src)
+xmm6 (dst)
+
+U8 V8 U7 V7 U6 V6 U5 V5 U4 V4 U3 V3 U2 V2 U1 V1
+*/
+
+punpcklbw  %%xmm0, %%xmm5  \n\t   /* unpack src low */
+punpcklbw  %%xmm0, %%xmm6  \n\t   /* unpack dst low */
+punpckhbw  %%xmm0, %%xmm3  \n\t   /* unpack src high */
+punpckhbw  %%xmm0, %%xmm4  \n\t   /* unpack dst high */
+
+/*
+xmm5 (src_l)
+xmm6 (dst_l)
+
+00 U4 00 V4 00 U3 00 V3 00 U2 00 V2 00 U1 00 V1
+
+xmm3 (src_u)
+xmm4 (dst_u)
+
+00 U8 00 V8 00 U7 00 V7 00 U6 00 V6 00 U5 00 V5
+*/
+
+movdqa %%xmm2, %%xmm7  \n\t   /* dub alpha */
+movdqa %%xmm2, %%xmm8  \n\t   /* dub alpha */
+movlhps%%xmm7, %%xmm7  \n\t   /* dub low */
+movhlps%%xmm8, %%xmm8  \n\t   /* dub high */
+
+/*
+xmm7 (src alpha)
+
+00 A4 00 A3 00 A2 00 A1 00 A4 00 A3 00 A2 00 A1
+xmm8 (src alpha)
+
+00 A8 00 A7 00 A6 00 A5 00 A8 00 A7 00 A6 00 A5
+*/
+
+pshuflw$0x50, %%xmm7, %%xmm7 \n\t
+pshuflw$0x50, %%xmm8, %%xmm8 \n\t
+pshufhw$0xFA, 

Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing

2012-02-16 Thread Dan Dennedy
2012/2/16 Maksym Veremeyenko ve...@m1stereo.tv:
 15.02.12 20:33, Dan Dennedy написав(ла):
 [...]

 Look real closely - it occurs more at the beginning when the geometry

 is smaller. I can switch between the branch with this patch and master
 and see it is different.

 another one attempt.

this one works!

 the only things i have a doubt is xmm register clobber list, currently
 comment out...

Do you think it is OK to merge now, or does this mean I should wait?

-- 
+-DRD-+

--
Virtualization  Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Mlt-devel mailing list
Mlt-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mlt-devel


Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing

2012-02-15 Thread Maksym Veremeyenko
15.02.12 05:33, Dan Dennedy написав(ла):
[...]
 OK, very close! But there is still one problem I noticed. On some
 geometry widths, the right edge of the B frame image is chopped off.
 This is reproduced in demo/mlt_my_name_is. On the first title that
 reads My name is Inigo Montoya notice how the right side of 'a' is
 cropped.


i can't reproduce it... did you apply patch completely? because newer 
version has dropped lines:

+   dest += j * 2;
+   src += j * 2;
+   alpha_a += j;
+   alpha_b += j;

because that values been updated in asm code...

-- 

Maksym Veremeyenko

--
Virtualization  Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Mlt-devel mailing list
Mlt-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mlt-devel


Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing

2012-02-15 Thread Dan Dennedy
2012/2/15 Maksym Veremeyenko ve...@m1stereo.tv:
 15.02.12 05:33, Dan Dennedy написав(ла):
 [...]

 OK, very close! But there is still one problem I noticed. On some
 geometry widths, the right edge of the B frame image is chopped off.
 This is reproduced in demo/mlt_my_name_is. On the first title that
 reads My name is Inigo Montoya notice how the right side of 'a' is
 cropped.


 i can't reproduce it...

Look real closely - it occurs more at the beginning when the geometry
is smaller. I can switch between the branch with this patch and master
and see it is different.

 did you apply patch completely? because newer
 version has dropped lines:

 +               dest += j * 2;
 +               src += j * 2;
 +               alpha_a += j;
 +               alpha_b += j;

 because that values been updated in asm code...

yes, just double-checked. I will see if I can figure it out this
weekend because it always nice to refresh myself on some simd asm.

-- 
+-DRD-+

--
Virtualization  Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Mlt-devel mailing list
Mlt-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mlt-devel


Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing

2012-02-14 Thread Maksym Veremeyenko

10.02.12 07:41, Dan Dennedy написав(ла):

2012/2/2 Maksym Veremeyenkove...@m1stereo.tv:

Hi,

attached patch perform line compositing for SSE2+ARCH_X86_64 build. It works
for a case where luma is not defined...


Hi Maksym, did some more testing and ran into a couple of image
quality problems. First, alpha blending seems poor, mostly noticeable
with a text with curvy typeface over video:

melt clip1.dv -filter dynamictext:Hello size=200 outline=2
olcolour=white family=elegante bgcolor=0x0020

The first time you run that you will see that the alpha of bgcolour
(black with 12.5% opacity) is not honored and the background is black.
Set bgcolour=0 to make it completely transparent and look along curved
edges to see the poor blending.

The second problem is that key-framing opacity causes a repeating
cycle of 100% A frame, A+B blended, and 100% B frame. The below
reproduces it:

melt color:red -track color:blue -transition composite out=99
geometry=0=0/0:100%x100%:0; 99=0/0:100%x100%:100



i wrongly assumed weight range in 0..255 - updated patch attached...


--

Maksym Veremeyenko
From e8c8a1dde7883f203f609f364a27ea6c1a77104f Mon Sep 17 00:00:00 2001
From: Maksym Veremeyenko ve...@m1stereo.tv
Date: Tue, 14 Feb 2012 13:34:12 +0200
Subject: [PATCH] use sse2 instruction for line compositing

---
 src/modules/core/composite_line_yuv_sse2_simple.c |  164 +
 src/modules/core/transition_composite.c   |   12 ++-
 2 files changed, 174 insertions(+), 2 deletions(-)
 create mode 100644 src/modules/core/composite_line_yuv_sse2_simple.c

diff --git a/src/modules/core/composite_line_yuv_sse2_simple.c b/src/modules/core/composite_line_yuv_sse2_simple.c
new file mode 100644
index 000..f202828
--- /dev/null
+++ b/src/modules/core/composite_line_yuv_sse2_simple.c
@@ -0,0 +1,164 @@
+const static unsigned char const1[] =
+{
+0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00
+};
+
+__asm__ volatile
+(
+pxor   %%xmm0, %%xmm0  \n\t   /* clear zero register */
+movdqu (%4), %%xmm9\n\t   /* load const1 */
+movd   %0, %%xmm1  \n\t   /* load weight and decompose */
+movlhps%%xmm1, %%xmm1  \n\t
+pshuflw$0, %%xmm1, %%xmm1  \n\t
+pshufhw$0, %%xmm1, %%xmm1  \n\t
+
+/*
+xmm1 (weight)
+
+00  W 00  W 00  W 00  W 00  W 00  W 00  W 00  W
+*/
+loop_start:\n\t
+movq   (%1), %%xmm2\n\t   /* load source alpha */
+punpcklbw  %%xmm0, %%xmm2  \n\t   /* unpack alpha 8 8-bits alphas to 8 16-bits values */
+
+/*
+xmm2 (src alpha)
+xmm3 (dst alpha)
+
+00 A8 00 A7 00 A6 00 A5 00 A4 00 A3 00 A2 00 A1
+*/
+pmullw %%xmm1, %%xmm2  \n\t   /* premultiply source alpha */
+psrlw  $8, %%xmm2  \n\t
+
+/*
+xmm2 (premultiplied)
+
+00 A8 00 A7 00 A6 00 A5 00 A4 00 A3 00 A2 00 A1
+*/
+
+
+/*
+DSTa = DSTa + (SRCa * (0xFF - DSTa))  8
+*/
+movq   (%5), %%xmm3\n\t   /* load dst alpha */
+punpcklbw  %%xmm0, %%xmm3  \n\t   /* unpack dst 8 8-bits alphas to 8 16-bits values */
+movdqa %%xmm9, %%xmm4  \n\t
+psubw  %%xmm3, %%xmm4  \n\t
+pmullw %%xmm2, %%xmm4  \n\t
+psrlw  $8, %%xmm4  \n\t
+paddw  %%xmm4, %%xmm3  \n\t
+packuswb   %%xmm0, %%xmm3  \n\t
+movq   %%xmm3, (%5)\n\t   /* save dst alpha */
+
+movdqu (%2), %%xmm3\n\t   /* load src */
+movdqu (%3), %%xmm4\n\t   /* load dst */
+movdqa %%xmm3, %%xmm5  \n\t   /* dub src */
+movdqa %%xmm4, %%xmm6  \n\t   /* dub dst */
+
+/*
+xmm3 (src)
+xmm4 (dst)
+xmm5 (src)
+xmm6 (dst)
+
+U8 V8 U7 V7 U6 V6 U5 V5 U4 V4 U3 V3 U2 V2 U1 V1
+*/
+
+punpcklbw  %%xmm0, %%xmm5  \n\t   /* unpack src low */
+punpcklbw  %%xmm0, %%xmm6  \n\t   /* unpack dst low */
+punpckhbw  %%xmm0, %%xmm3  \n\t   /* unpack src high */
+punpckhbw  %%xmm0, %%xmm4  \n\t   /* unpack dst high */
+
+/*
+xmm5 (src_l)
+xmm6 (dst_l)
+
+00 U4 00 V4 00 U3 00 V3 00 U2 00 V2 00 U1 00 V1
+
+xmm3 (src_u)
+xmm4 (dst_u)
+
+00 U8 00 V8 00 U7 00 V7 00 U6 00 V6 00 U5 00 V5
+*/
+
+movdqa 

Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing

2012-02-14 Thread Dan Dennedy
2012/2/14 Maksym Veremeyenko ve...@m1stereo.tv:
 10.02.12 07:41, Dan Dennedy написав(ла):

 2012/2/2 Maksym Veremeyenkove...@m1stereo.tv:

 Hi,

 attached patch perform line compositing for SSE2+ARCH_X86_64 build. It
 works
 for a case where luma is not defined...


 Hi Maksym, did some more testing and ran into a couple of image
 quality problems. First, alpha blending seems poor, mostly noticeable
 with a text with curvy typeface over video:

 melt clip1.dv -filter dynamictext:Hello size=200 outline=2
 olcolour=white family=elegante bgcolor=0x0020

 The first time you run that you will see that the alpha of bgcolour
 (black with 12.5% opacity) is not honored and the background is black.
 Set bgcolour=0 to make it completely transparent and look along curved
 edges to see the poor blending.

 The second problem is that key-framing opacity causes a repeating
 cycle of 100% A frame, A+B blended, and 100% B frame. The below
 reproduces it:

 melt color:red -track color:blue -transition composite out=99
 geometry=0=0/0:100%x100%:0; 99=0/0:100%x100%:100


 i wrongly assumed weight range in 0..255 - updated patch attached...

OK, very close! But there is still one problem I noticed. On some
geometry widths, the right edge of the B frame image is chopped off.
This is reproduced in demo/mlt_my_name_is. On the first title that
reads My name is Inigo Montoya notice how the right side of 'a' is
cropped.

-- 
+-DRD-+

--
Virtualization  Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Mlt-devel mailing list
Mlt-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mlt-devel


Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing

2012-02-09 Thread Dan Dennedy
2012/2/2 Maksym Veremeyenko ve...@m1stereo.tv:
 Hi,

 attached patch perform line compositing for SSE2+ARCH_X86_64 build. It works
 for a case where luma is not defined...

Hi Maksym, did some more testing and ran into a couple of image
quality problems. First, alpha blending seems poor, mostly noticeable
with a text with curvy typeface over video:

melt clip1.dv -filter dynamictext:Hello size=200 outline=2
olcolour=white family=elegante bgcolor=0x0020

The first time you run that you will see that the alpha of bgcolour
(black with 12.5% opacity) is not honored and the background is black.
Set bgcolour=0 to make it completely transparent and look along curved
edges to see the poor blending.

The second problem is that key-framing opacity causes a repeating
cycle of 100% A frame, A+B blended, and 100% B frame. The below
reproduces it:

melt color:red -track color:blue -transition composite out=99
geometry=0=0/0:100%x100%:0; 99=0/0:100%x100%:100

-- 
+-DRD-+

--
Virtualization  Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Mlt-devel mailing list
Mlt-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mlt-devel


Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing

2012-02-08 Thread Maksym Veremeyenko
08.02.12 06:45, Dan Dennedy написав(ла):
 On Tue, Feb 7, 2012 at 8:40 PM, Dan Dennedyd...@dennedy.org  wrote:
 2012/2/6 Maksym Veremeyenkove...@m1stereo.tv:
 04.02.12 22:25, Dan Dennedy написав(ла):

 2012/2/3 Maksym Veremeyenkove...@m1stereo.tv:

 02.02.12 18:57, Maksym Veremeyenko написав(ла):

 Hi,

 attached patch perform line compositing for SSE2+ARCH_X86_64 build. It
 works for a case where luma is not defined...



 updated patch attached



 i am still testing it... may be you can take a look if all processed fine.

 I started testing with the demos in the demo/ directory. It mostly
 works; however, mlt_my_name_is crashes on me (and one time
 mlt_squeeze_box):

 (gdb) bt
 #0  0x7fffdecced14 in composite_line_yuv (
 dest=0x25bae60  ~\037~
 ~\037~!~-\177\064~2\177\065}5\200\064}3\200\062}3\201\064}6\201\063|0\202-|+\202+|+\202,|-\202/{0\203\060{1\203\061{1\203\061{1\203\061|2\202\062|1\202/{/\202\060{1\202-{,\202-{1\202\062{2\202\065{:\202\062{3\203\063{3\203\063z6\203;z@\203=z@\203BzD\203JzR\204WzW\204^ya\202fyj\202myn\202nyn\202wyv\202yy~\202\177y|\202|y\177\202{y{\202|y~\202\177y}\201|y}\201{y{\201{yy\201xyw\201wyx\201w{w\201t{p\201...,
 src=0x2c621f0
 width=0, alpha_b=0x2c16e00 ,

 Expanding the test to include width fixed that for me.

   if ( !luma  width )

 Can you think of some other checks we should add?

seems i miss that checks, i think it should be:

 if ( !luma   width  7 )

-- 

Maksym Veremeyenko

--
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
___
Mlt-devel mailing list
Mlt-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mlt-devel


Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing

2012-02-07 Thread Dan Dennedy
2012/2/6 Maksym Veremeyenko ve...@m1stereo.tv:
 04.02.12 22:25, Dan Dennedy написав(ла):

 2012/2/3 Maksym Veremeyenkove...@m1stereo.tv:

 02.02.12 18:57, Maksym Veremeyenko написав(ла):

 Hi,

 attached patch perform line compositing for SSE2+ARCH_X86_64 build. It
 works for a case where luma is not defined...



 updated patch attached


 If I am not mistaken, this change reduces precision to 8 pixels. The
 existing transition is already limited to a 2 pixel precision, which I
 am not happy about. I do not want to further reduce the precision,
 give different results depending on CPU, and effectively introduce a
 regression, as far as the user is concerned. Maybe we should limit it
 to only
 apply when width is a multiple of 8. Then, it would still be used for
 fullscreen composite on most profiles' resolution.


 not exactly...

 sse2 code process 8-pixels-groups, tail with 1...7 pixels processed by
 native code - that why i did not create a standalone function but putted
 code into existing composite_yuv code...

 i am still testing it... may be you can take a look if all processed fine.

I started testing with the demos in the demo/ directory. It mostly
works; however, mlt_my_name_is crashes on me (and one time
mlt_squeeze_box):

(gdb) bt
#0  0x7fffdecced14 in composite_line_yuv (
dest=0x25bae60  ~\037~
~\037~!~-\177\064~2\177\065}5\200\064}3\200\062}3\201\064}6\201\063|0\202-|+\202+|+\202,|-\202/{0\203\060{1\203\061{1\203\061{1\203\061|2\202\062|1\202/{/\202\060{1\202-{,\202-{1\202\062{2\202\065{:\202\062{3\203\063{3\203\063z6\203;z@\203=z@\203BzD\203JzR\204WzW\204^ya\202fyj\202myn\202nyn\202wyv\202yy~\202\177y|\202|y\177\202{y{\202|y~\202\177y}\201|y}\201{y{\201{yy\201xyw\201wyx\201w{w\201t{p\201...,
src=0x2c621f0
\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200\020\200...,
width=0, alpha_b=0x2c16e00 ,
alpha_a=0x1ee9a30
\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377...,
weight=65535, luma=0x0, soft=0,
step=65535) at composite_line_yuv_sse2_simple.c:6


-- 
+-DRD-+

--
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
___
Mlt-devel mailing list
Mlt-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mlt-devel


Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing

2012-02-06 Thread Dan Dennedy
2012/2/6 Maksym Veremeyenko ve...@m1stereo.tv:
 04.02.12 22:25, Dan Dennedy написав(ла):

 2012/2/3 Maksym Veremeyenkove...@m1stereo.tv:

 02.02.12 18:57, Maksym Veremeyenko написав(ла):

 Hi,

 attached patch perform line compositing for SSE2+ARCH_X86_64 build. It
 works for a case where luma is not defined...



 updated patch attached


 If I am not mistaken, this change reduces precision to 8 pixels. The
 existing transition is already limited to a 2 pixel precision, which I
 am not happy about. I do not want to further reduce the precision,
 give different results depending on CPU, and effectively introduce a
 regression, as far as the user is concerned. Maybe we should limit it
 to only
 apply when width is a multiple of 8. Then, it would still be used for
 fullscreen composite on most profiles' resolution.


 not exactly...

 sse2 code process 8-pixels-groups, tail with 1...7 pixels processed by
 native code - that why i did not create a standalone function but putted
 code into existing composite_yuv code...

ah, by native code I think you mean C, and if so, sorry, I overlooked that

 i am still testing it... may be you can take a look if all processed fine.

I will give it another review and testing. Your other items are still
on my todo list, but have been busy lately dealing with libav- and
ffmpeg-integration bugs.

-- 
+-DRD-+

--
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
___
Mlt-devel mailing list
Mlt-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mlt-devel


Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing

2012-02-06 Thread Maksym Veremeyenko
06.02.12 18:47, Dan Dennedy написав(ла):
[...]
 I will give it another review and testing. Your other items are still
 on my todo list, but have been busy lately dealing with libav- and
 ffmpeg-integration bugs.


offtopic: why does alpha plane created if not exists? why do not pass a 
NULL into the composite_yuv_??? function? i think for 1920x1080 creating 
non-transparent alpha plane could be bit slowdown system?

-- 

Maksym Veremeyenko

--
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
___
Mlt-devel mailing list
Mlt-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mlt-devel


Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing

2012-02-04 Thread Dan Dennedy
2012/2/3 Maksym Veremeyenko ve...@m1stereo.tv:
 02.02.12 18:57, Maksym Veremeyenko написав(ла):

 Hi,

 attached patch perform line compositing for SSE2+ARCH_X86_64 build. It
 works for a case where luma is not defined...


 updated patch attached

If I am not mistaken, this change reduces precision to 8 pixels. The
existing transition is already limited to a 2 pixel precision, which I
am not happy about. I do not want to further reduce the precision,
give different results depending on CPU, and effectively introduce a
regression, as far as the user is concerned. Maybe we should limit it
to only
apply when width is a multiple of 8. Then, it would still be used for
fullscreen composite on most profiles' resolution.

-- 
+-DRD-+

--
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
___
Mlt-devel mailing list
Mlt-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mlt-devel


Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing

2012-02-03 Thread Maksym Veremeyenko

02.02.12 18:57, Maksym Veremeyenko написав(ла):

Hi,

attached patch perform line compositing for SSE2+ARCH_X86_64 build. It
works for a case where luma is not defined...


updated patch attached

--

Maksym Veremeyenko
From d0a46a3308b390228e6d4337b24010ae3cecef7f Mon Sep 17 00:00:00 2001
From: Maksym Veremeyenko ve...@m1stereo.tv
Date: Fri, 3 Feb 2012 13:19:12 +0200
Subject: [PATCH] use sse2 instruction for line compositing

---
 src/modules/core/composite_line_yuv_sse2_simple.c |  164 +
 src/modules/core/transition_composite.c   |   16 ++-
 2 files changed, 178 insertions(+), 2 deletions(-)
 create mode 100644 src/modules/core/composite_line_yuv_sse2_simple.c

diff --git a/src/modules/core/composite_line_yuv_sse2_simple.c b/src/modules/core/composite_line_yuv_sse2_simple.c
new file mode 100644
index 000..bd977e1
--- /dev/null
+++ b/src/modules/core/composite_line_yuv_sse2_simple.c
@@ -0,0 +1,164 @@
+const static unsigned char const1[] =
+{
+0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00
+};
+
+__asm__ volatile
+(
+pxor   %%xmm0, %%xmm0  \n\t   /* clear zero register */
+movdqu (%4), %%xmm9\n\t   /* load const1 */
+movd   %0, %%xmm1  \n\t   /* load weight and decompose */
+movlhps%%xmm1, %%xmm1  \n\t
+pshuflw$0, %%xmm1, %%xmm1  \n\t
+pshufhw$0, %%xmm1, %%xmm1  \n\t
+
+/*
+xmm1 (weight)
+
+00  W 00  W 00  W 00  W 00  W 00  W 00  W 00  W
+*/
+loop_start:\n\t
+movq   (%1), %%xmm2\n\t   /* load source alpha */
+punpcklbw  %%xmm0, %%xmm2  \n\t   /* unpack alpha 8 8-bits alphas to 8 16-bits values */
+
+/*
+xmm2 (src alpha)
+xmm3 (dst alpha)
+
+00 A8 00 A7 00 A6 00 A5 00 A4 00 A3 00 A2 00 A1
+*/
+pmullw %%xmm1, %%xmm2  \n\t   /* premultiply source alpha */
+psrlw  $8, %%xmm2  \n\t
+
+/*
+xmm2 (premultiplied)
+
+00 A8 00 A7 00 A6 00 A5 00 A4 00 A3 00 A2 00 A1
+*/
+
+
+/*
+DSTa = DSTa + (SRCa * (0xFF - DSTa))  8
+*/
+movq   (%5), %%xmm3\n\t   /* load dst alpha */
+punpcklbw  %%xmm0, %%xmm3  \n\t   /* unpack dst 8 8-bits alphas to 8 16-bits values */
+movdqa %%xmm9, %%xmm4  \n\t
+psubw  %%xmm3, %%xmm4  \n\t
+pmullw %%xmm2, %%xmm4  \n\t
+psrlw  $8, %%xmm4  \n\t
+paddw  %%xmm4, %%xmm3  \n\t
+packuswb   %%xmm0, %%xmm3  \n\t
+movq   %%xmm3, (%5)\n\t   /* load dst alpha */
+
+movdqu (%2), %%xmm3\n\t   /* load src */
+movdqu (%3), %%xmm4\n\t   /* load dst */
+movdqa %%xmm3, %%xmm5  \n\t   /* dub src */
+movdqa %%xmm4, %%xmm6  \n\t   /* dub dst */
+
+/*
+xmm3 (src)
+xmm4 (dst)
+xmm5 (src)
+xmm6 (dst)
+
+U8 V8 U7 V7 U6 V6 U5 V5 U4 V4 U3 V3 U2 V2 U1 V1
+*/
+
+punpcklbw  %%xmm0, %%xmm5  \n\t   /* unpack src low */
+punpcklbw  %%xmm0, %%xmm6  \n\t   /* unpack dst low */
+punpckhbw  %%xmm0, %%xmm3  \n\t   /* unpack src high */
+punpckhbw  %%xmm0, %%xmm4  \n\t   /* unpack dst high */
+
+/*
+xmm5 (src_l)
+xmm6 (dst_l)
+
+00 U4 00 V4 00 U3 00 V3 00 U2 00 V2 00 U1 00 V1
+
+xmm3 (src_u)
+xmm4 (dst_u)
+
+00 U8 00 V8 00 U7 00 V7 00 U6 00 V6 00 U5 00 V5
+*/
+
+movdqa %%xmm2, %%xmm7  \n\t   /* dub alpha */
+movdqa %%xmm2, %%xmm8  \n\t   /* dub alpha */
+movlhps%%xmm7, %%xmm7  \n\t   /* dub low */
+movhlps%%xmm8, %%xmm8  \n\t   /* dub high */
+
+/*
+xmm7 (src alpha)
+
+00 A4 00 A3 00 A2 00 A1 00 A4 00 A3 00 A2 00 A1
+xmm8 (src alpha)
+
+00 A8 00 A7 00 A6 00 A5 00 A8 00 A7 00 A6 00 A5
+*/
+
+pshuflw$0x50, %%xmm7, %%xmm7 \n\t
+pshuflw$0x50, %%xmm8, %%xmm8 \n\t
+pshufhw$0xFA, %%xmm7, %%xmm7 \n\t
+pshufhw$0xFA, %%xmm8, %%xmm8 \n\t
+
+/*
+xmm7 (src alpha lower)
+
+00 A4 00 A4 00 A3 00 A3 00 A2 00 A2 00 A1 00 A1
+
+xmm8 (src alpha upper)
+  

[Mlt-devel] [PATCH] use sse2 instruction for line compositing

2012-02-02 Thread Maksym Veremeyenko

Hi,

attached patch perform line compositing for SSE2+ARCH_X86_64 build. It 
works for a case where luma is not defined...


--

Maksym Veremeyenko
From 73dca48f8e4a470140ab4d70d2002c6ff39017ef Mon Sep 17 00:00:00 2001
From: Maksym Veremeyenko ve...@m1stereo.tv
Date: Thu, 2 Feb 2012 18:03:07 +0200
Subject: [PATCH] use sse2 instruction for line compositing

---
 src/modules/core/composite_line_yuv_sse2_simple.c |  164 +
 src/modules/core/transition_composite.c   |   12 ++-
 2 files changed, 174 insertions(+), 2 deletions(-)
 create mode 100644 src/modules/core/composite_line_yuv_sse2_simple.c

diff --git a/src/modules/core/composite_line_yuv_sse2_simple.c b/src/modules/core/composite_line_yuv_sse2_simple.c
new file mode 100644
index 000..bd977e1
--- /dev/null
+++ b/src/modules/core/composite_line_yuv_sse2_simple.c
@@ -0,0 +1,164 @@
+const static unsigned char const1[] =
+{
+0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0x00
+};
+
+__asm__ volatile
+(
+pxor   %%xmm0, %%xmm0  \n\t   /* clear zero register */
+movdqu (%4), %%xmm9\n\t   /* load const1 */
+movd   %0, %%xmm1  \n\t   /* load weight and decompose */
+movlhps%%xmm1, %%xmm1  \n\t
+pshuflw$0, %%xmm1, %%xmm1  \n\t
+pshufhw$0, %%xmm1, %%xmm1  \n\t
+
+/*
+xmm1 (weight)
+
+00  W 00  W 00  W 00  W 00  W 00  W 00  W 00  W
+*/
+loop_start:\n\t
+movq   (%1), %%xmm2\n\t   /* load source alpha */
+punpcklbw  %%xmm0, %%xmm2  \n\t   /* unpack alpha 8 8-bits alphas to 8 16-bits values */
+
+/*
+xmm2 (src alpha)
+xmm3 (dst alpha)
+
+00 A8 00 A7 00 A6 00 A5 00 A4 00 A3 00 A2 00 A1
+*/
+pmullw %%xmm1, %%xmm2  \n\t   /* premultiply source alpha */
+psrlw  $8, %%xmm2  \n\t
+
+/*
+xmm2 (premultiplied)
+
+00 A8 00 A7 00 A6 00 A5 00 A4 00 A3 00 A2 00 A1
+*/
+
+
+/*
+DSTa = DSTa + (SRCa * (0xFF - DSTa))  8
+*/
+movq   (%5), %%xmm3\n\t   /* load dst alpha */
+punpcklbw  %%xmm0, %%xmm3  \n\t   /* unpack dst 8 8-bits alphas to 8 16-bits values */
+movdqa %%xmm9, %%xmm4  \n\t
+psubw  %%xmm3, %%xmm4  \n\t
+pmullw %%xmm2, %%xmm4  \n\t
+psrlw  $8, %%xmm4  \n\t
+paddw  %%xmm4, %%xmm3  \n\t
+packuswb   %%xmm0, %%xmm3  \n\t
+movq   %%xmm3, (%5)\n\t   /* load dst alpha */
+
+movdqu (%2), %%xmm3\n\t   /* load src */
+movdqu (%3), %%xmm4\n\t   /* load dst */
+movdqa %%xmm3, %%xmm5  \n\t   /* dub src */
+movdqa %%xmm4, %%xmm6  \n\t   /* dub dst */
+
+/*
+xmm3 (src)
+xmm4 (dst)
+xmm5 (src)
+xmm6 (dst)
+
+U8 V8 U7 V7 U6 V6 U5 V5 U4 V4 U3 V3 U2 V2 U1 V1
+*/
+
+punpcklbw  %%xmm0, %%xmm5  \n\t   /* unpack src low */
+punpcklbw  %%xmm0, %%xmm6  \n\t   /* unpack dst low */
+punpckhbw  %%xmm0, %%xmm3  \n\t   /* unpack src high */
+punpckhbw  %%xmm0, %%xmm4  \n\t   /* unpack dst high */
+
+/*
+xmm5 (src_l)
+xmm6 (dst_l)
+
+00 U4 00 V4 00 U3 00 V3 00 U2 00 V2 00 U1 00 V1
+
+xmm3 (src_u)
+xmm4 (dst_u)
+
+00 U8 00 V8 00 U7 00 V7 00 U6 00 V6 00 U5 00 V5
+*/
+
+movdqa %%xmm2, %%xmm7  \n\t   /* dub alpha */
+movdqa %%xmm2, %%xmm8  \n\t   /* dub alpha */
+movlhps%%xmm7, %%xmm7  \n\t   /* dub low */
+movhlps%%xmm8, %%xmm8  \n\t   /* dub high */
+
+/*
+xmm7 (src alpha)
+
+00 A4 00 A3 00 A2 00 A1 00 A4 00 A3 00 A2 00 A1
+xmm8 (src alpha)
+
+00 A8 00 A7 00 A6 00 A5 00 A8 00 A7 00 A6 00 A5
+*/
+
+pshuflw$0x50, %%xmm7, %%xmm7 \n\t
+pshuflw$0x50, %%xmm8, %%xmm8 \n\t
+pshufhw$0xFA, %%xmm7, %%xmm7 \n\t
+pshufhw$0xFA, %%xmm8, %%xmm8 \n\t
+
+/*
+xmm7 (src alpha lower)
+
+00 A4 00 A4 00 A3 00 A3 00 A2 00 A2 00 A1 00 A1
+
+xmm8 (src alpha upper)
+00 A8 00 A8 00 A7 00 A7 00 A6 00 A6 00 A5 00 A5
+*/
+
+
+ 

Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing

2012-02-02 Thread Patrick Matthäi
Am 02.02.2012 17:57, schrieb Maksym Veremeyenko:
 Hi,
 
 attached patch perform line compositing for SSE2+ARCH_X86_64 build. It
 works for a case where luma is not defined...
 

Is there a reason why it is only on amd64 available?
Surely the code should be disabled if mlt is configured without SSE
support, but a user with an modern CPU but i386 userland/kernel maybe
still wants to benefit from it?


-- 
/*
Mit freundlichem Gruß / With kind regards,
 Patrick Matthäi
 GNU/Linux Debian Developer

E-Mail: pmatth...@debian.org
patr...@linux-dev.org
*/



signature.asc
Description: OpenPGP digital signature
--
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d___
Mlt-devel mailing list
Mlt-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mlt-devel


Re: [Mlt-devel] [PATCH] use sse2 instruction for line compositing

2012-02-02 Thread Maksym Veremeyenko
02.02.12 19:01, Patrick Matthäi написав(ла):
 Am 02.02.2012 17:57, schrieb Maksym Veremeyenko:
 Hi,

 attached patch perform line compositing for SSE2+ARCH_X86_64 build. It
 works for a case where luma is not defined...


 Is there a reason why it is only on amd64 available?
because it use xmm8 and xmm9 that available only for x64 mode...

 Surely the code should be disabled if mlt is configured without SSE
 support,
it used only:
[...]
#if defined(USE_SSE)  defined(ARCH_X86_64)

 but a user with an modern CPU but i386 userland/kernel maybe
 still wants to benefit from it?
code could be optimized by dropping keeping in register two constants...

-- 

Maksym Veremeyenko

--
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
___
Mlt-devel mailing list
Mlt-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mlt-devel