Re: [osg-users] [osg-submissions] Matrixf multiply Optimization

2008-07-27 Thread David Spilling
I think that this general question (of SSE integration) ought to be pushed
out onto the osg-users mailing list. For example, I can't see any reason why
all Vec4f and Matrix4f can't always be aligned anyway, although I realise
that my range of apps might be limited. Even Vec4d and Matrix 4d might
benefit from SSE2, for example.

From my experience, SSE doesn't hurt performance. I agree with Robert that
the most benefit comes from array operations; using SSE to perform a single
vector x-product (i.e. horizontal operations) doesn't help _that_ much,
but it does help a bit. My main issue with going in the direction of array
operations is that I don't think we could offer sufficient operators to be
useful in the general case - people do all kinds of maths things specific to
their problem - but SSEing the simple operations where the maths is obvious,
e.g. James' attack on the Vec/Matrix libraries does seem to be appropriate.
Supporting the general SSE case with aligned vectors and things would be
good (e.g. the osgsharedarray example class is very useful to provide
aligned wrappers).

David
___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] [osg-submissions] Matrixf multiply Optimization

2008-07-27 Thread Philip Taylor
As I understand it, the the pragma alignment only applies to predefined
objects and data.

Dynamically allocated objects are created at the mercy of the memory
allocator (via new) which uses byte alignment, as it does not know anything
about pragma definitions.

The only approach is to dynamically create data with extra padding, and then
do some alignment coding, something like:

char* pBuffer = new char[ (200 * sizeof(int32)) + sizeof(int64) ];
  int32* p = new (pBuffer) int32[ 200 ];

  and then later

  delete [] pBuffer;  // DO NOT delete p
pBuffer = 0;
p = 0;

instead of the simpler but unaligned

int32* p = new int32[ 200 ];

Youw will have to experiment the actual code required, as I am not too
certain, and my brain is still in bed.


PhilT

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Behalf Of Gordon
Tomlinson
Sent: 27 July 2008 03:05
To: osg-users@lists.openscenegraph.org
Subject: Re: [osg-users] [osg-submissions] Matrixf multiply Optimization



Can you not use an alignment #pragma around the struct to force alignment
size ?


#pragma pack( push, 16 )

 union
 {
struct
{
__m128 _R0,_R1,_R2,_R3;
};
value_type _mat[4][4];
 }

#pragma pack( pop )


__
Gordon Tomlinson
__


-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of James
Killian
Sent: Saturday, July 26, 2008 7:23 PM
To: OpenSceneGraph Submissions
Subject: Re: [osg-submissions] Matrixf multiply Optimization


That is cool if that is all that needs to be fixed... I'll make a generic
version of F32vec4, and include it next submission to see if it can build on

other platforms.

James Killian
- Original Message -
From: David Guthrie [EMAIL PROTECTED]
To: OpenSceneGraph Submissions [EMAIL PROTECTED]
Sent: Friday, July 25, 2008 9:07 PM
Subject: Re: [osg-submissions] Matrixf multiply Optimization


I looked at the code, and it should work cross platform, at least for
intel CPU's.  the fvec.h header doesn't seem to exist, but from what I  can

tell, it doesn't have an magic in it.  The few types you used may  be easy
to just replace.  They seemed just to be unions, anyway.

 David

 On Jul 25, 2008, at 5:49 PM, James Killian wrote:


 It is good to hold off as this is still work in progress.  In the  mean
 time
 what would be cool is for others to code review the work I've  checked in
 thus far.  If I recall the FFmpeg community has found a way to use
 intrinsics in a way that is platform independent, once I get the win32
 version polished I may research that.

 For anyone interested the C version of the matrix multiply uses 64
 multiplies and adds, while the SSE version uses only 16 of each.

 In regards to going in and out of SSE I tried this:
 union
 {
struct
{
__m128 _R0,_R1,_R2,_R3;
};
value_type _mat[4][4];
 }

 And this works as it forces the array to be 16 byte aligned
 implicitly...
 unfortunately I ran into problems where some code was using the  matrix
 in a
 vector would throw compiler errors saying it can't align it.  (I may
 revisit
 that case and see why that is)


 What I am hoping will happen is that this new code will work out,  and we

 can
 gradually transition some of the most used pieces to take advantage  of
 the
 instruction set. (platform independent of course).



 - Original Message -
 From: Robert Osfield [EMAIL PROTECTED]
 To: OpenSceneGraph Submissions
 [EMAIL PROTECTED]
 
 Sent: Friday, July 25, 2008 3:09 PM
 Subject: Re: [osg-submissions] Matrixf multiply Optimization


 Hi James,

 I will put this submission on hold till after 2.6 as we now at  feature
 freeze.

 W.r.t SSE optimizations, in the past I have consider the possibility,
 but haven't taken the step - there's always been bigger bottlenecks  to
 address.  One concern I have is the cost of going in and out of SEE
 mode.  I suspect the most efficient way to do it would be to provide
 array operators.

 It think these type of optimizations would be worth raising on the
 mailing lists as there is lot of knowledge out there and whole range
 of topics.

 Robert.

 On Fri, Jul 25, 2008 at 8:55 PM, James Killian
 [EMAIL PROTECTED] wrote:

 Attached is the 3 matrix cpp files that are merged with 8686.  For
 non-win32
 platforms there is no change, for win32 platforms I've added SSE
 optimization for Matrix::mult  premult and postmult.  This  currently
 is
 the
 first draft which will yield about 35-40% improvement over matrixf  or
 matrixd.  I may pursue alignment strategies which have yielded 50%
 improvement (this is yet to come).   I also may want to look to
 improve
 premult.

 Our game uses approximately 25% of all processing to these functions
 (the
 KBDtree optimization is enabled), so if anyone else is doing the  same
 kind
 of stresses hopefully you should see

Re: [osg-users] [osg-submissions] Matrixf multiply Optimization

2008-07-27 Thread David Spilling
MS uses _aligned_malloc (and _aligned_free), _declspec(align(16)).

I think gcc uses something like __attribute__((__aligned__(16))), but I'm
not sure whether that's OK for dynamic allocation.

Intel's MKL, and others, provide cross-platform aligned mallocs, so we might
be able to find something similar. Or just create a new Vec4f / Matrix4f
type with an overriden new operator.

David
___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] [osg-submissions] Matrixf multiply Optimization

2008-07-26 Thread Gordon Tomlinson

Can you not use an alignment #pragma around the struct to force alignment
size ?


#pragma pack( push, 16 )

 union
 {
struct
{
__m128 _R0,_R1,_R2,_R3;
};
value_type _mat[4][4];
 }

#pragma pack( pop )


__
Gordon Tomlinson 
__


-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of James
Killian
Sent: Saturday, July 26, 2008 7:23 PM
To: OpenSceneGraph Submissions
Subject: Re: [osg-submissions] Matrixf multiply Optimization


That is cool if that is all that needs to be fixed... I'll make a generic 
version of F32vec4, and include it next submission to see if it can build on

other platforms.

James Killian
- Original Message - 
From: David Guthrie [EMAIL PROTECTED]
To: OpenSceneGraph Submissions [EMAIL PROTECTED]
Sent: Friday, July 25, 2008 9:07 PM
Subject: Re: [osg-submissions] Matrixf multiply Optimization


I looked at the code, and it should work cross platform, at least for 
intel CPU's.  the fvec.h header doesn't seem to exist, but from what I  can

tell, it doesn't have an magic in it.  The few types you used may  be easy 
to just replace.  They seemed just to be unions, anyway.

 David

 On Jul 25, 2008, at 5:49 PM, James Killian wrote:


 It is good to hold off as this is still work in progress.  In the  mean 
 time
 what would be cool is for others to code review the work I've  checked in
 thus far.  If I recall the FFmpeg community has found a way to use
 intrinsics in a way that is platform independent, once I get the win32
 version polished I may research that.

 For anyone interested the C version of the matrix multiply uses 64
 multiplies and adds, while the SSE version uses only 16 of each.

 In regards to going in and out of SSE I tried this:
 union
 {
struct
{
__m128 _R0,_R1,_R2,_R3;
};
value_type _mat[4][4];
 }

 And this works as it forces the array to be 16 byte aligned 
 implicitly...
 unfortunately I ran into problems where some code was using the  matrix 
 in a
 vector would throw compiler errors saying it can't align it.  (I may 
 revisit
 that case and see why that is)


 What I am hoping will happen is that this new code will work out,  and we

 can
 gradually transition some of the most used pieces to take advantage  of 
 the
 instruction set. (platform independent of course).



 - Original Message -
 From: Robert Osfield [EMAIL PROTECTED]
 To: OpenSceneGraph Submissions 
 [EMAIL PROTECTED]
 
 Sent: Friday, July 25, 2008 3:09 PM
 Subject: Re: [osg-submissions] Matrixf multiply Optimization


 Hi James,

 I will put this submission on hold till after 2.6 as we now at  feature
 freeze.

 W.r.t SSE optimizations, in the past I have consider the possibility,
 but haven't taken the step - there's always been bigger bottlenecks  to
 address.  One concern I have is the cost of going in and out of SEE
 mode.  I suspect the most efficient way to do it would be to provide
 array operators.

 It think these type of optimizations would be worth raising on the
 mailing lists as there is lot of knowledge out there and whole range
 of topics.

 Robert.

 On Fri, Jul 25, 2008 at 8:55 PM, James Killian
 [EMAIL PROTECTED] wrote:

 Attached is the 3 matrix cpp files that are merged with 8686.  For
 non-win32
 platforms there is no change, for win32 platforms I've added SSE
 optimization for Matrix::mult  premult and postmult.  This  currently 
 is
 the
 first draft which will yield about 35-40% improvement over matrixf  or
 matrixd.  I may pursue alignment strategies which have yielded 50%
 improvement (this is yet to come).   I also may want to look to 
 improve
 premult.

 Our game uses approximately 25% of all processing to these functions
 (the
 KBDtree optimization is enabled), so if anyone else is doing the  same
 kind
 of stresses hopefully you should see improvement as well.

 There may be a way to enable intrinsic code across all platforms.  if 
 so
 we
 may want to pursue that.
 You should be able to drop these files right in and build. (Win32 
 users
 be
 sure to use matrix float in the cmake configuration).
 I did not try to optimize Matrixd I don't think intrinsics can offer
 much
 improvement for it (yet). so it has not changed.

 ___
 osg-submissions mailing list
 [EMAIL PROTECTED]


http://lists.openscenegraph.org/listinfo.cgi/osg-submissions-openscenegraph.
org


 ___
 osg-submissions mailing list
 [EMAIL PROTECTED]


http://lists.openscenegraph.org/listinfo.cgi/osg-submissions-openscenegraph.
org


 ___
 osg-submissions mailing list
 [EMAIL PROTECTED]

http://lists.openscenegraph.org/listinfo.cgi/osg-submissions-openscenegraph.
org

 ___
 osg-submissions mailing list
 [EMAIL PROTECTED]