Re: [osg-users] [osg-submissions] Matrixf multiply Optimization
I think that this general question (of SSE integration) ought to be pushed out onto the osg-users mailing list. For example, I can't see any reason why all Vec4f and Matrix4f can't always be aligned anyway, although I realise that my range of apps might be limited. Even Vec4d and Matrix 4d might benefit from SSE2, for example. From my experience, SSE doesn't hurt performance. I agree with Robert that the most benefit comes from array operations; using SSE to perform a single vector x-product (i.e. horizontal operations) doesn't help _that_ much, but it does help a bit. My main issue with going in the direction of array operations is that I don't think we could offer sufficient operators to be useful in the general case - people do all kinds of maths things specific to their problem - but SSEing the simple operations where the maths is obvious, e.g. James' attack on the Vec/Matrix libraries does seem to be appropriate. Supporting the general SSE case with aligned vectors and things would be good (e.g. the osgsharedarray example class is very useful to provide aligned wrappers). David ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] [osg-submissions] Matrixf multiply Optimization
As I understand it, the the pragma alignment only applies to predefined objects and data. Dynamically allocated objects are created at the mercy of the memory allocator (via new) which uses byte alignment, as it does not know anything about pragma definitions. The only approach is to dynamically create data with extra padding, and then do some alignment coding, something like: char* pBuffer = new char[ (200 * sizeof(int32)) + sizeof(int64) ]; int32* p = new (pBuffer) int32[ 200 ]; and then later delete [] pBuffer; // DO NOT delete p pBuffer = 0; p = 0; instead of the simpler but unaligned int32* p = new int32[ 200 ]; Youw will have to experiment the actual code required, as I am not too certain, and my brain is still in bed. PhilT -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Gordon Tomlinson Sent: 27 July 2008 03:05 To: osg-users@lists.openscenegraph.org Subject: Re: [osg-users] [osg-submissions] Matrixf multiply Optimization Can you not use an alignment #pragma around the struct to force alignment size ? #pragma pack( push, 16 ) union { struct { __m128 _R0,_R1,_R2,_R3; }; value_type _mat[4][4]; } #pragma pack( pop ) __ Gordon Tomlinson __ -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of James Killian Sent: Saturday, July 26, 2008 7:23 PM To: OpenSceneGraph Submissions Subject: Re: [osg-submissions] Matrixf multiply Optimization That is cool if that is all that needs to be fixed... I'll make a generic version of F32vec4, and include it next submission to see if it can build on other platforms. James Killian - Original Message - From: David Guthrie [EMAIL PROTECTED] To: OpenSceneGraph Submissions [EMAIL PROTECTED] Sent: Friday, July 25, 2008 9:07 PM Subject: Re: [osg-submissions] Matrixf multiply Optimization I looked at the code, and it should work cross platform, at least for intel CPU's. the fvec.h header doesn't seem to exist, but from what I can tell, it doesn't have an magic in it. The few types you used may be easy to just replace. They seemed just to be unions, anyway. David On Jul 25, 2008, at 5:49 PM, James Killian wrote: It is good to hold off as this is still work in progress. In the mean time what would be cool is for others to code review the work I've checked in thus far. If I recall the FFmpeg community has found a way to use intrinsics in a way that is platform independent, once I get the win32 version polished I may research that. For anyone interested the C version of the matrix multiply uses 64 multiplies and adds, while the SSE version uses only 16 of each. In regards to going in and out of SSE I tried this: union { struct { __m128 _R0,_R1,_R2,_R3; }; value_type _mat[4][4]; } And this works as it forces the array to be 16 byte aligned implicitly... unfortunately I ran into problems where some code was using the matrix in a vector would throw compiler errors saying it can't align it. (I may revisit that case and see why that is) What I am hoping will happen is that this new code will work out, and we can gradually transition some of the most used pieces to take advantage of the instruction set. (platform independent of course). - Original Message - From: Robert Osfield [EMAIL PROTECTED] To: OpenSceneGraph Submissions [EMAIL PROTECTED] Sent: Friday, July 25, 2008 3:09 PM Subject: Re: [osg-submissions] Matrixf multiply Optimization Hi James, I will put this submission on hold till after 2.6 as we now at feature freeze. W.r.t SSE optimizations, in the past I have consider the possibility, but haven't taken the step - there's always been bigger bottlenecks to address. One concern I have is the cost of going in and out of SEE mode. I suspect the most efficient way to do it would be to provide array operators. It think these type of optimizations would be worth raising on the mailing lists as there is lot of knowledge out there and whole range of topics. Robert. On Fri, Jul 25, 2008 at 8:55 PM, James Killian [EMAIL PROTECTED] wrote: Attached is the 3 matrix cpp files that are merged with 8686. For non-win32 platforms there is no change, for win32 platforms I've added SSE optimization for Matrix::mult premult and postmult. This currently is the first draft which will yield about 35-40% improvement over matrixf or matrixd. I may pursue alignment strategies which have yielded 50% improvement (this is yet to come). I also may want to look to improve premult. Our game uses approximately 25% of all processing to these functions (the KBDtree optimization is enabled), so if anyone else is doing the same kind of stresses hopefully you should see
Re: [osg-users] [osg-submissions] Matrixf multiply Optimization
MS uses _aligned_malloc (and _aligned_free), _declspec(align(16)). I think gcc uses something like __attribute__((__aligned__(16))), but I'm not sure whether that's OK for dynamic allocation. Intel's MKL, and others, provide cross-platform aligned mallocs, so we might be able to find something similar. Or just create a new Vec4f / Matrix4f type with an overriden new operator. David ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] [osg-submissions] Matrixf multiply Optimization
Can you not use an alignment #pragma around the struct to force alignment size ? #pragma pack( push, 16 ) union { struct { __m128 _R0,_R1,_R2,_R3; }; value_type _mat[4][4]; } #pragma pack( pop ) __ Gordon Tomlinson __ -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of James Killian Sent: Saturday, July 26, 2008 7:23 PM To: OpenSceneGraph Submissions Subject: Re: [osg-submissions] Matrixf multiply Optimization That is cool if that is all that needs to be fixed... I'll make a generic version of F32vec4, and include it next submission to see if it can build on other platforms. James Killian - Original Message - From: David Guthrie [EMAIL PROTECTED] To: OpenSceneGraph Submissions [EMAIL PROTECTED] Sent: Friday, July 25, 2008 9:07 PM Subject: Re: [osg-submissions] Matrixf multiply Optimization I looked at the code, and it should work cross platform, at least for intel CPU's. the fvec.h header doesn't seem to exist, but from what I can tell, it doesn't have an magic in it. The few types you used may be easy to just replace. They seemed just to be unions, anyway. David On Jul 25, 2008, at 5:49 PM, James Killian wrote: It is good to hold off as this is still work in progress. In the mean time what would be cool is for others to code review the work I've checked in thus far. If I recall the FFmpeg community has found a way to use intrinsics in a way that is platform independent, once I get the win32 version polished I may research that. For anyone interested the C version of the matrix multiply uses 64 multiplies and adds, while the SSE version uses only 16 of each. In regards to going in and out of SSE I tried this: union { struct { __m128 _R0,_R1,_R2,_R3; }; value_type _mat[4][4]; } And this works as it forces the array to be 16 byte aligned implicitly... unfortunately I ran into problems where some code was using the matrix in a vector would throw compiler errors saying it can't align it. (I may revisit that case and see why that is) What I am hoping will happen is that this new code will work out, and we can gradually transition some of the most used pieces to take advantage of the instruction set. (platform independent of course). - Original Message - From: Robert Osfield [EMAIL PROTECTED] To: OpenSceneGraph Submissions [EMAIL PROTECTED] Sent: Friday, July 25, 2008 3:09 PM Subject: Re: [osg-submissions] Matrixf multiply Optimization Hi James, I will put this submission on hold till after 2.6 as we now at feature freeze. W.r.t SSE optimizations, in the past I have consider the possibility, but haven't taken the step - there's always been bigger bottlenecks to address. One concern I have is the cost of going in and out of SEE mode. I suspect the most efficient way to do it would be to provide array operators. It think these type of optimizations would be worth raising on the mailing lists as there is lot of knowledge out there and whole range of topics. Robert. On Fri, Jul 25, 2008 at 8:55 PM, James Killian [EMAIL PROTECTED] wrote: Attached is the 3 matrix cpp files that are merged with 8686. For non-win32 platforms there is no change, for win32 platforms I've added SSE optimization for Matrix::mult premult and postmult. This currently is the first draft which will yield about 35-40% improvement over matrixf or matrixd. I may pursue alignment strategies which have yielded 50% improvement (this is yet to come). I also may want to look to improve premult. Our game uses approximately 25% of all processing to these functions (the KBDtree optimization is enabled), so if anyone else is doing the same kind of stresses hopefully you should see improvement as well. There may be a way to enable intrinsic code across all platforms. if so we may want to pursue that. You should be able to drop these files right in and build. (Win32 users be sure to use matrix float in the cmake configuration). I did not try to optimize Matrixd I don't think intrinsics can offer much improvement for it (yet). so it has not changed. ___ osg-submissions mailing list [EMAIL PROTECTED] http://lists.openscenegraph.org/listinfo.cgi/osg-submissions-openscenegraph. org ___ osg-submissions mailing list [EMAIL PROTECTED] http://lists.openscenegraph.org/listinfo.cgi/osg-submissions-openscenegraph. org ___ osg-submissions mailing list [EMAIL PROTECTED] http://lists.openscenegraph.org/listinfo.cgi/osg-submissions-openscenegraph. org ___ osg-submissions mailing list [EMAIL PROTECTED]