Re: [osg-users] Using SSE within OSG
I would like to take a moment to show a snap shot of how these optimizations have impacted our game. To interpret the data, they show frames per second where the first column keeps an average of the lowest times, the middle keeps the overall average, and the right keeps track for the highest times. We do at least 3 runs to get a good solid average. Here is the fps without any of the SSE optimizations: Framerates: (23.3, 41.3, 54.2) Framerates: (27.5, 41.9, 50.4) Framerates: (30.6, 41.8, 53.3) AVERAGE:(27.1, 41.7, 52.6) Here is my submissions with SSE optimizations Framerates: (30.2, 48.7, 58.1) Framerates: (30.9, 49.6, 60.5) Framerates: (36.8, 50.0, 60.5) AVERAGE:(32.6, 49.4, 59.7) Here is a combination of my Submission and Mathias submission VS 9 (current) "..\Game Scripts\Miramar_001.lua" -perf 0 60 0 -stats 10 60 VS9_Perf.txt Framerates: (40.9, 53.2, 65.6) Framerates: (34.5, 50.3, 60.9) Framerates: (39.5, 49.9, 63.2) AVERAGE:(38.3, 51.1, 63.2) So basically in this test, both of our optimizations have yielded a solid +10 fps for this machine. ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] Using SSE within OSG
Thanks for posting this link. I'll definitely want to look at this. James Killian - Original Message - From: "Benjamin Eikel" <[EMAIL PROTECTED]> To: "OpenSceneGraph Users" Sent: Tuesday, August 05, 2008 3:11 AM Subject: Re: [osg-users] Using SSE within OSG Hello, some days ago I stumbled upon a library: liboil [1]. Maybe some of the routines implemented there could be used for OSG. The library contains different functions (e. g. arithmetic ones) that are optimized for different processeor architectures (it uses SSE or Altivec for example). Maybe using these functions would be easier than implementing them anew. Functions needed by OSG which are not yet part of liboil might be added to it. Regards, Benjamin [1] http://liboil.freedesktop.org/wiki/ ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] Using SSE within OSG
Hello, some days ago I stumbled upon a library: liboil [1]. Maybe some of the routines implemented there could be used for OSG. The library contains different functions (e. g. arithmetic ones) that are optimized for different processeor architectures (it uses SSE or Altivec for example). Maybe using these functions would be easier than implementing them anew. Functions needed by OSG which are not yet part of liboil might be added to it. Regards, Benjamin [1] http://liboil.freedesktop.org/wiki/ ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] Using SSE within OSG
Thanks for the reply. The delay of review will buy me some time to present the aligned matrixf. It is true that these maths are not the largest bottleneck even for our game, but they still are significant! especially during heavy use of collision detection. I would like to know if Mathias submission would be considered now as 99% of it is a c solution to reduce the number of multiplies needed. It did bring the numbers down in our game too. If so, I would like to write SSE forms of these new functions (e.g. preMultTranslate) to the aligned matrixf and make them run even faster. I would be interested in pursuing the traversal related methods, but I have a feeling the solution would entail a design solution with C and not an SSE one; However, if performance increase is not the top priority on anyone's list I'd be willing to look into this and see if I can help. James Killian - Original Message - From: Robert Osfield To: "OpenSceneGraph Users" Sent: Sunday, Aug 3, 2008 05:58 AM Subject: Re: [osg-users] Using SSE within OSG Hi Guys, I've read through the correspondence on this issue, but won't dive in with reviewing submissions on this topic till well after 2.6.0 is out the door. As a general note, there seems to be two related topics - data alignment and then SSE instructions, they are of course related but I'd suggest we tackle them separately. As another general note, in my experience the most common bottleneck of scene graph based applications is that of CPU memory bandwidth, maths functions are much less of a bottleneck, and there cost in fact largely hidden by the cost of waiting for the cache to be filled. The performance profiles provided in this threaded suggest this as well - with the traversal related methods being the biggest bottleneck. How to address this bottleneck is a topic for another thread. Robert. ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] Using SSE within OSG
Hi Guys, I've read through the correspondence on this issue, but won't dive in with reviewing submissions on this topic till well after 2.6.0 is out the door. As a general note, there seems to be two related topics - data alignment and then SSE instructions, they are of course related but I'd suggest we tackle them separately. As another general note, in my experience the most common bottleneck of scene graph based applications is that of CPU memory bandwidth, maths functions are much less of a bottleneck, and there cost in fact largely hidden by the cost of waiting for the cache to be filled. The performance profiles provided in this threaded suggest this as well - with the traversal related methods being the biggest bottleneck. How to address this bottleneck is a topic for another thread. Robert. ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] Using SSE within OSG
James, The most obvious problem: Group::traverse ... Is one of the visitors you use a TRAVERSE_ALL_CHILDREN visitor? If so, the Group::traverse profile makes sense. Make sure that you do traverse only this subgraphps you need to traverse. You can then minimize that calls too. Will help overall!!! Ok, so the PositionAttitudeTransform is the matrix multiplication problem. Try that specialized transform patch I have sent to you. That will help a bit here. But you might do even better: You are talking about a game. So I expect that you have transform nodes to animate parts of the scenegraph. I agree that you will need the full PositionAttitudeTransform in some cases. But I can well imagine to have special transforms in such a game where you can make use of specialized implementations. Specialized with respect to: * The kind of the transform. Often you just have to rotate around the origin. Nothing more. Or you might have some linear transform to make something move but no rotation and no scaling. For this case implement you own say LinearTransform or RotationTransform nodes derived from osg::Transform and reimplement the the computeLocalToWorldMatrix and computeWorldToLocalMatrix and computeBound methods with something more optimized. May be use that specialized preMultTranslate or equivalent methods from the patch I sent. You can avoid many matrix multiplications for that. * Recomputation of the bounding sphere. Sometimes with such special transforms, you do not need to dirty the bounding sphere. Take a rotation. Say you have a leg that can rotate around the knee. Just compute the bounding sphere for all possible rotation values of that rotation. With that you will have slightly worse bounding spheres, but You do not need to walk large scenegraphs to invalidate the bound and you do not need to recompute the bound for large parts of the scene again and again. If you have a human body for example with many transform nodes for arms legs and fingers and so on. Your human body bounding sphere will not be much larger with that kind of bounding box compared to the exact case. The interresting cull case is to cull away the *whole* human body which will happen about the same as with the exact bounding spheres. Translations along an axis for example are a bit more difficult in this case since they would blow up the sphere to infinity if you want to catch any translation value. But if you have a translation axis, a maximum scalar value a minimum scalar value and a current scalar translation value, you can do about the same. Hope this helps. Greetings Mathais -- Dr. Mathias Fröhlich, science + computing ag, Software Solutions Hagellocher Weg 71-75, D-72070 Tuebingen, Germany Phone: +49 7071 9457-268, Fax: +49 7071 9457-511 -- Vorstand/Board of Management: Dr. Bernd Finkbeiner, Dr. Florian Geyer, Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech Vorsitzender des Aufsichtsrats/ Chairman of the Supervisory Board: Prof. Dr. Hanns Ruder Sitz/Registered Office: Tuebingen Registergericht/Registration Court: Stuttgart Registernummer/Commercial Register No.: HRB 382196 ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] Using SSE within OSG
Ok I thought it was the collision detection but that is not the case here are some of the numbers with collision disabled: CS:EIP Symbol + Offset 64-bit Timer samples 0x10083cc0 osg::Group::traverse 1434 0x10083d60 osg::Group::computeBound 1391 0x10099ca0 osg::Matrixf::mult 833 0x1001a9d0 osg::PositionAttitudeTransform::accept 409 0x10099370 osg::Matrixf::preMult 407 0x1000e840 osg::AnimationPathCallback::update 352 0x1009bb50 osg::Node::dirtyBound 340 0x100dcee0 osg::Transform::computeBound 318 0x100a9df0 osg::PositionAttitudeTransform::computeLocalToWorldMatrix 294 0x100126f0 osg::AnimationPath::getInterpolatedControlPoint 285 0x10009c70 osg::AnimationPathCallback::setPause 251 0x1000c8e0 osg::StateSet::requiresUpdateTraversal 228 12 functions, 806 instructions, Total: 6542 samples, 50.85% of samples in the module, 16.36% of total session samples Ok here is with collision detection: === CS:EIP Symbol + Offset 64-bit Timer samples 0x10083cc0 osg::Group::traverse 1382 0x10083d60 osg::Group::computeBound 1237 0x10099ca0 osg::Matrixf::mult 924 0x10099370 osg::Matrixf::preMult 600 0x1001a9d0 osg::PositionAttitudeTransform::accept 394 0x1000e840 osg::AnimationPathCallback::update 292 0x100dcee0 osg::Transform::computeBound 284 0x1009bb50 osg::Node::dirtyBound 280 0x100126f0 osg::AnimationPath::getInterpolatedControlPoint 274 0x100a9df0 osg::PositionAttitudeTransform::computeLocalToWorldMatrix 230 0x10009c70 osg::AnimationPathCallback::setPause 225 0x10002e00 osg::Matrixf::preMult 210 12 functions, 846 instructions, Total: 6332 samples, 51.35% of samples in the module, 15.83% of total session samples Here is with both matrixf and invert4x4 optimized: = CS:EIP Symbol + Offset 64-bit Timer samples 0x10083cb0 osg::Group::traverse 1362 0x10083d50 osg::Group::computeBound 1142 0x1009a180 osg::Matrixf::mult 922 0x1001ac70 osg::PositionAttitudeTransform::accept 381 0x1000e650 osg::AnimationPathCallback::update 354 0x100dcf30 osg::Transform::computeBound 306 0x1009bcf0 osg::Node::dirtyBound 274 0x100124f0 osg::AnimationPath::getInterpolatedControlPoint 257 0x1009a340 osg::Matrixf::invert_4x3 252 0x10009bb0 osg::GraphicsContext::ScreenIdentifier::~ScreenIdentifier 248 0x100a9b20 osg::PositionAttitudeTransform::computeLocalToWorldMatrix 245 0x10002d00 osg::Matrixf::preMult 214 0x10002c70 osg::Matrixf::preMult 197 0x1000c6b0 osg::StateSet::requiresUpdateTraversal 178 14 functions, 829 instructions, Total: 6332 samples, 54.18% of samples in the module, 15.84% of total session samples For the optimized profile it did push down the Invert4x4 way to the bottom (I did not want to show that here). If you want the complete list let me know and I'll resend as attachments. Actually you cannot really use this to see how much better the performance is, because the Matrixf Mult is still needed just as much, the actual way to tell would be to show the framerate of the game; however here is where I can show the optimization: Avarage time using the D3DXMATRIX class: 402.54 Avarage time using the SPMatrix class:277.69 Avarage time using the Matrixf class:297.40 Avarage time using the ScalarDP class:400.21 Avarage time using the DPMatrix class:1418.11 Avarage time using the Matrixd class:471.69 Here is the result for postMult where matrixf use to be the same as Matrixd. The 277.69 is what would have been for Matrixf is it was aligned. Avarage time using the D3DXMATRIX class: 1035.63 Avarage time using the SPMatrix class:365.36 Avarage time using the Matrixf class:706.09 Avarage time using the ScalarDP class:664.13 Avarage time using the DPMatrix class:2052.29 Avarage time using the Matrixd class:2125.93 Here is the results for Invert4x4 where Matrixf also was the same as Marixd, and the 365 is what it would have been if the data was aligned. This stress code is part of the matlib2 with a little tweaking of the osg code to add into the mix. James Killian - Original Message - From: "Mathias Fröhlich" <[EMAIL PROTECTED]> To: "OpenSceneGraph Users" Sent: Tuesday, July 29, 2008 10:14 AM Subject: Re: [osg-users] Using SSE within OSG James, On Tuesday 29 July 2008 16:59, James Killian wrote: Paul asked me the same question a few days ago, and I just realized that we took that offline so I'll repost here: One of the things I should add is the actual profile dump, since that shows a more comprehensive picture. The actual game demo is free to download and play here: http://www.fringe-online.com/ The current installer of the game does not have my optimization in it yet, but it should be noted even with the optimization the postmult is still at the top. The Invert4x4() however got pushed way down to the bottom (which is great). I
Re: [osg-users] Using SSE within OSG
Thanks for the reply. We could resolve this argument if any one of the "low level masters" cares to email me offline [EMAIL PROTECTED], but I'd be open to believe an argument could be made for the context of moving around large amounts of data. In regards to moving data, SSE/SSE2 is really better suited for code which requires a lot of math like 3d computations. Perhaps the heart of SSE would be the packed multiply and add, where it can do 4 multiplies and 4 adds in one clock cycle (or a half cycle if paired properly). Thus, code which requires heavy math like many of the OSG matrix computations could really benefit from it. I would profile cases like this against hand written assembly since this is what OSG would care about. I looked at the assembly code produced by VS 9 for the optimized matrixf multiply, and I could not have scheduled it better myself by hand. - Original Message - From: "Gordon Tomlinson" <[EMAIL PROTECTED]> To: "'OpenSceneGraph Users'" Sent: Tuesday, July 29, 2008 2:58 PM Subject: Re: [osg-users] Using SSE within OSG > HI > > I can only go buy our low level masters and their profiling shows that the > hand road asm'ed SSE code is significantly fasted than MS VS compiled code > > Obviously this our experience in our environments and we computationally > heavily and moving and editing terra-bytes of data around in real-time > > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of James > Killian > Sent: Tuesday, July 29, 2008 11:38 AM > To: OpenSceneGraph Users > Subject: Re: [osg-users] Using SSE within OSG > > > Sorry... > > I interpreted Gordon's response as follows: > MS does a poor job (insert here with compiling SSE intrinsics), as a result > most of his SSE is asm'ed. > The asm'ed approach is where you don't trust the compiler to do the right > thing with intrinsics, where it has the flexibility of scheduling and > assigning registers etc. > > I disagree with "MS does a poor job compiling intrinsic code", and that you > should not *ever need to resort to __asm anymore. > *this is not absolute, there was once a rare case where we found a strange > anomaly, but later solved by doing an un-intuitive c code change > > >Do you find that MS compilers will produce SSE vectorised code > >_without_ > use of intrinsics or raw __asm? > Ah this is a tricky question. There is in fact an option in VS 8 and VS 9 > project settings to generate SSE or SSE2 code. What this does is that it > will evaluate c code and try to use SSE for it. I was surprised to find > that this actually lowered the performance of c code, especially c code for > matrixf. I'm so glad that the project settings for osg do not turn this on, > and I'd recommend not using that, but instead write intrisics ourselves for > places that need it. > > I hope this clears things up. > > > - Original Message - > From: "David Spilling" <[EMAIL PROTECTED]> > To: "OpenSceneGraph Users" > Sent: Tuesday, July 29, 2008 10:17 AM > Subject: Re: [osg-users] Using SSE within OSG > > > > James, > > > > > > > I have to disagree, using VS 7 and up to VS 9. > > > > > > Just to clarify - what are you disagreeing with? Do you find that MS > > compilers will produce SSE vectorised code _without_ use of intrinsics or > > raw __asm? > > > > David > > > > > -- -- > > > > > ___ > > osg-users mailing list > > osg-users@lists.openscenegraph.org > > http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org > > > > ___ > osg-users mailing list > osg-users@lists.openscenegraph.org > http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org > > > ___ > osg-users mailing list > osg-users@lists.openscenegraph.org > http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org > ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] Using SSE within OSG
HI I can only go buy our low level masters and their profiling shows that the hand road asm'ed SSE code is significantly fasted than MS VS compiled code Obviously this our experience in our environments and we computationally heavily and moving and editing terra-bytes of data around in real-time -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of James Killian Sent: Tuesday, July 29, 2008 11:38 AM To: OpenSceneGraph Users Subject: Re: [osg-users] Using SSE within OSG Sorry... I interpreted Gordon's response as follows: MS does a poor job (insert here with compiling SSE intrinsics), as a result most of his SSE is asm'ed. The asm'ed approach is where you don't trust the compiler to do the right thing with intrinsics, where it has the flexibility of scheduling and assigning registers etc. I disagree with "MS does a poor job compiling intrinsic code", and that you should not *ever need to resort to __asm anymore. *this is not absolute, there was once a rare case where we found a strange anomaly, but later solved by doing an un-intuitive c code change >Do you find that MS compilers will produce SSE vectorised code >_without_ use of intrinsics or raw __asm? Ah this is a tricky question. There is in fact an option in VS 8 and VS 9 project settings to generate SSE or SSE2 code. What this does is that it will evaluate c code and try to use SSE for it. I was surprised to find that this actually lowered the performance of c code, especially c code for matrixf. I'm so glad that the project settings for osg do not turn this on, and I'd recommend not using that, but instead write intrisics ourselves for places that need it. I hope this clears things up. - Original Message - From: "David Spilling" <[EMAIL PROTECTED]> To: "OpenSceneGraph Users" Sent: Tuesday, July 29, 2008 10:17 AM Subject: Re: [osg-users] Using SSE within OSG > James, > > > > I have to disagree, using VS 7 and up to VS 9. > > > Just to clarify - what are you disagreeing with? Do you find that MS > compilers will produce SSE vectorised code _without_ use of intrinsics or > raw __asm? > > David > > ___ > osg-users mailing list > osg-users@lists.openscenegraph.org > http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org > ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] Using SSE within OSG
" Ok, you can do here much for the collision detection. I expect that you should optimize that algorithmically and gain magnitudes without sse. " You are probably right, but Rick is the OSG guru in the team, and I do not understand osg that well yet as most of my time has been spent developing other aspects of the game. My strength lies with general purpose optimizing code, and so this helps our game tremendously. The least I can do is contribute this to the community and hope others get a boost as well. " So the question is more if such optimizations will bring performance improvements for the usual scenegraph case. " When I get home tonight I'll disable the collision detection and run the profiler again, and post the results, hopefully that data should answer this question. - Original Message - From: "Mathias Fröhlich" <[EMAIL PROTECTED]> To: "OpenSceneGraph Users" Sent: Tuesday, July 29, 2008 10:14 AM Subject: Re: [osg-users] Using SSE within OSG James, On Tuesday 29 July 2008 16:59, James Killian wrote: > Paul asked me the same question a few days ago, and I just realized that we > took that offline so I'll repost here: > One of the things I should add is the actual profile dump, since that shows > a more comprehensive picture. The actual game demo is free to download and > play here: > http://www.fringe-online.com/ > > The current installer of the game does not have my optimization in it yet, > but it should be noted even with the optimization the postmult is still at > the top. The Invert4x4() however got pushed way down to the bottom (which > is great). I'll post my profiles when I get home. > > > -snip- - >- --- > That is a good question, and I believe the answer is collision detection. > I should disable it and run the numbers again to confirm. All ships fire > machine guns at a fast rate, and each bullet that gets close enough to a > bounding box/sphere region has to go through the osg code to get the > precise point where it hit. Rick would probably have a better explanation > of this and other factors since he coded the bulk of the collision > detection (and osg integration). Most of my time development in the game > has been spent on the physics and flight dynamics (and now optimization). > > It may turn out that we could find some caching technique to reduce the > collision stress (like the KBDtree), but in the mean time, matrix > optimizations can benefit the whole community if we do them right, and I > would like to make some contribution to the community. Ok, you can do here much for the collision detection. I expect that you should optimize that algorithmically and gain magnitudes without sse. So the question is more if such optimizations will bring performance improovements for the usual scenegraph case. Greetings Mathias -- Dr. Mathias Fröhlich, science + computing ag, Software Solutions Hagellocher Weg 71-75, D-72070 Tuebingen, Germany Phone: +49 7071 9457-268, Fax: +49 7071 9457-511 -- Vorstand/Board of Management: Dr. Bernd Finkbeiner, Dr. Florian Geyer, Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech Vorsitzender des Aufsichtsrats/ Chairman of the Supervisory Board: Prof. Dr. Hanns Ruder Sitz/Registered Office: Tuebingen Registergericht/Registration Court: Stuttgart Registernummer/Commercial Register No.: HRB 382196 ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] Using SSE within OSG
Sorry... I interpreted Gordon's response as follows: MS does a poor job (insert here with compiling SSE intrinsics), as a result most of his SSE is asm'ed. The asm'ed approach is where you don't trust the compiler to do the right thing with intrinsics, where it has the flexibility of scheduling and assigning registers etc. I disagree with "MS does a poor job compiling intrinsic code", and that you should not *ever need to resort to __asm anymore. *this is not absolute, there was once a rare case where we found a strange anomaly, but later solved by doing an un-intuitive c code change >Do you find that MS compilers will produce SSE vectorised code _without_ use of intrinsics or raw __asm? Ah this is a tricky question. There is in fact an option in VS 8 and VS 9 project settings to generate SSE or SSE2 code. What this does is that it will evaluate c code and try to use SSE for it. I was surprised to find that this actually lowered the performance of c code, especially c code for matrixf. I'm so glad that the project settings for osg do not turn this on, and I'd recommend not using that, but instead write intrisics ourselves for places that need it. I hope this clears things up. - Original Message - From: "David Spilling" <[EMAIL PROTECTED]> To: "OpenSceneGraph Users" Sent: Tuesday, July 29, 2008 10:17 AM Subject: Re: [osg-users] Using SSE within OSG > James, > > > > I have to disagree, using VS 7 and up to VS 9. > > > Just to clarify - what are you disagreeing with? Do you find that MS > compilers will produce SSE vectorised code _without_ use of intrinsics or > raw __asm? > > David > > ___ > osg-users mailing list > osg-users@lists.openscenegraph.org > http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org > ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] Using SSE within OSG
James, > I have to disagree, using VS 7 and up to VS 9. Just to clarify - what are you disagreeing with? Do you find that MS compilers will produce SSE vectorised code _without_ use of intrinsics or raw __asm? David ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] Using SSE within OSG
James, On Tuesday 29 July 2008 16:59, James Killian wrote: > Paul asked me the same question a few days ago, and I just realized that we > took that offline so I'll repost here: > One of the things I should add is the actual profile dump, since that shows > a more comprehensive picture. The actual game demo is free to download and > play here: > http://www.fringe-online.com/ > > The current installer of the game does not have my optimization in it yet, > but it should be noted even with the optimization the postmult is still at > the top. The Invert4x4() however got pushed way down to the bottom (which > is great). I'll post my profiles when I get home. > > > -snip-- >- --- > That is a good question, and I believe the answer is collision detection. > I should disable it and run the numbers again to confirm. All ships fire > machine guns at a fast rate, and each bullet that gets close enough to a > bounding box/sphere region has to go through the osg code to get the > precise point where it hit. Rick would probably have a better explanation > of this and other factors since he coded the bulk of the collision > detection (and osg integration). Most of my time development in the game > has been spent on the physics and flight dynamics (and now optimization). > > It may turn out that we could find some caching technique to reduce the > collision stress (like the KBDtree), but in the mean time, matrix > optimizations can benefit the whole community if we do them right, and I > would like to make some contribution to the community. Ok, you can do here much for the collision detection. I expect that you should optimize that algorithmically and gain magnitudes without sse. So the question is more if such optimizations will bring performance improovements for the usual scenegraph case. Greetings Mathias -- Dr. Mathias Fröhlich, science + computing ag, Software Solutions Hagellocher Weg 71-75, D-72070 Tuebingen, Germany Phone: +49 7071 9457-268, Fax: +49 7071 9457-511 -- Vorstand/Board of Management: Dr. Bernd Finkbeiner, Dr. Florian Geyer, Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech Vorsitzender des Aufsichtsrats/ Chairman of the Supervisory Board: Prof. Dr. Hanns Ruder Sitz/Registered Office: Tuebingen Registergericht/Registration Court: Stuttgart Registernummer/Commercial Register No.: HRB 382196 ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] Using SSE within OSG
I have to disagree, using VS 7 and up to VS 9. It has done a terrific job with the instruction scheduling. We use to use that technique of asm back when P3's MMX were around and we had VS 6. We had one engineer who would use DOS and MASM. Times have changed (we had to let him go), intrinsics have proved to optimize quite well as we use the AMD code analyzer to confirm that the U and V pipes remain full due to well scheduled placement of the instructions. I should add that we avoid using any MMX instructions like the plague now days. - Original Message - From: "Gordon Tomlinson" <[EMAIL PROTECTED]> To: "'OpenSceneGraph Users'" Sent: Tuesday, July 29, 2008 8:56 AM Subject: Re: [osg-users] Using SSE within OSG > MS does a very poor job, > > I know most of our SSE is asm'ed > > > > _ > > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of David > Spilling > Sent: Tuesday, July 29, 2008 9:11 AM > To: OpenSceneGraph Users > Subject: Re: [osg-users] Using SSE within OSG > > > Benjamin, > > > > > may I suggest that you check the assembler code that the compilers create > when > compiling the OSG code? > > > > ... g++ with -march=core2 -O3 (see man page for description > of parameters) the compiler automatically uses SSE > > > I don't have much recent Linux/gcc experience, but can certainly attest that > the MS compilers don't do a good job of spotting SSE vectorisation > possibilities, even when you tell them to optimise with them (and this is > from reading the generated ssembler). In MS you can insert SSE intrinsics , > which still allow the compiler to optimise the execution order and > memory/register usage e.g. based on cycle counts. > > I understand (from other sources) that the Intel vectorising compilers are > much better at this, naturally. > > Perhaps this is then all only aMS/Windows thing? > > David > > > > > > ___ > osg-users mailing list > osg-users@lists.openscenegraph.org > http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org > ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] Using SSE within OSG
Hi All, Regarding question 2: Wouldn't it be possible to dynamically link different versions of the OSG-DLLs? So there would be two Version of the DLLs, one with the SSE-Optimizations and one with the straightforward code. I've seen examples of games some years ago, where they linked different Versions of DLLs depending on the machine the program was run on. cheers Sebastian Dear All, There's a discussion going on at the moment over in osg-submissions, and it has been raised that this ought to be opened up to the non-submissions community for feedback. Note that the following is my reading of the issues, and certainly doesn't represent the consensus view of the osg-submissions crowd, so feel free to challenge what I'm saying! *Background* Several people already use SSE instructions (http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions) alongside OSG to obtain speed improvements through parallelising math operations. The general point that has been raised is that under-the-hood, OSG does quite a lot that could benefit from the potential performance boost given by SSE operations. Obvious targets include some of the Vec/Matrix routines, for example. SSE is now sufficiently mainstream that the risk of processor incompatibility is felt to be low. *Question 1 : Where could the core OSG include SSE?* Most people follow the sensible approach of profiling to determine their bottlenecks, and then optimising particular methods in order to gain speed-up. This would be a sensible approach to follow, as SSEing all methods would probably be a waste of effort. It would therefore be instructive firstly to know if anybody is using SSE with OSG, and where. Secondly, for those who have profiling data and know how much time they spend in Vec/Matrix/whatever methods, it would be useful to know which methods the community considered good targets for SSEing. Any other maths "heavy lifting" going on? (e.g. Intersection testing? Delauney triangulation? etc.) *Question 2 : How could the core OSG include SSE?* SSE code benefits from aligned data. Hence there are several ways in which OSG could include SSE: a) Provide an aligned Vec4f and aligned Matrix4f class, which support SSE operations. This would appear (to me) to be the least intrusive. b) Provide branching code within the existing Vec4/Matrix4 methods for detecting whether data is aligned, and performing the correct operations. This would appear to me to be the most user-transparent. Although it would appear to be a performance hit, testing so far on some specific code would support the argument that the speed gains from SSE outweigh the branch cost; more testing needed, I guess. c) Robert suggested that SSE enabled array operators (e.g. providing a cross-product operator for Vec3Array) might be appropriate and provide the best speed improvement for those who want it. Certainly using SSE on large array type data sets is where one gains the most performance improvement. This question includes the possibility of linking out to, or pulling source code our of, an external optimised math library. Any other suggestions? *Question 3 : (possibly the biggest) Should the core OSG include SSE?* There are several downsides to including SSE. Firstly, x-platform provision of SSE may be tricky due to the way different compilers define aligned data, and how SSE instructions are used within the code. I personally don't have much experience here, so any feedback on x-plaform issues is useful. Secondly, the code readability drops, and the "use the source" argument may be trickier when many might not know much SSE. So - your opinion, experience and suggestions welcome! David ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] Using SSE within OSG
Paul asked me the same question a few days ago, and I just realized that we took that offline so I'll repost here: One of the things I should add is the actual profile dump, since that shows a more comprehensive picture. The actual game demo is free to download and play here: http://www.fringe-online.com/ The current installer of the game does not have my optimization in it yet, but it should be noted even with the optimization the postmult is still at the top. The Invert4x4() however got pushed way down to the bottom (which is great). I'll post my profiles when I get home. -snip--- --- That is a good question, and I believe the answer is collision detection. I should disable it and run the numbers again to confirm. All ships fire machine guns at a fast rate, and each bullet that gets close enough to a bounding box/sphere region has to go through the osg code to get the precise point where it hit. Rick would probably have a better explanation of this and other factors since he coded the bulk of the collision detection (and osg integration). Most of my time development in the game has been spent on the physics and flight dynamics (and now optimization). It may turn out that we could find some caching technique to reduce the collision stress (like the KBDtree), but in the mean time, matrix optimizations can benefit the whole community if we do them right, and I would like to make some contribution to the community. - Original Message - From: "Paul Melis" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Monday, July 28, 2008 9:05 AM Subject: [Fwd: Re: [osg-users] [osg-submissions] Matrixf multiply Optimization] > Hi James, > > I noted you posts on the osg-users list on the Matrix multiplication > optimizations using SSE. > You mention "Our game uses approximately 25% of all processing to these > functions [...]". What on earth takes up so much matrix computing time > in your game? > > Regards, > Paul > -snip--- --- - Original Message - From: "Mathias Fröhlich" <[EMAIL PROTECTED]> To: "OpenSceneGraph Users" Sent: Tuesday, July 29, 2008 9:31 AM Subject: Re: [osg-users] Using SSE within OSG Hi, On Tuesday 29 July 2008 15:18, James Killian wrote: > I 100% agree with that as that is the first thing I did. For the matrixf > mult I got 50% improvement with aligned data and 35% with unaligned. For > the Invert4x4 I got 80% improvement with aligned and 70% aligned with > unaligned. I've submitted this code in as it was the most time spent in > the profiles of our game. I wonder what your scenegraph looks like. Why do you have that much matrix operations? Where are they called from? Why do you need that many inverted matrices? Also the invert method makes me wonder. As far as I can tell, you do not need inverted matrices to do cull and draw. At least not in a magnitude that makes that method appear in profiles. Do you compute intersection tests where you need that inverse? And what kind of matrices are in your code that you really need the full 4x4 inverse? Almost alway the cheaper 3x4 variant can be used for usual transforms. Well, I ask that because I get the impression that the real botteneck - where you can gain much performance - is somwhere different. Greetings Mathias -- Dr. Mathias Fröhlich, science + computing ag, Software Solutions Hagellocher Weg 71-75, D-72070 Tuebingen, Germany Phone: +49 7071 9457-268, Fax: +49 7071 9457-511 -- Vorstand/Board of Management: Dr. Bernd Finkbeiner, Dr. Florian Geyer, Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech Vorsitzender des Aufsichtsrats/ Chairman of the Supervisory Board: Prof. Dr. Hanns Ruder Sitz/Registered Office: Tuebingen Registergericht/Registration Court: Stuttgart Registernummer/Commercial Register No.: HRB 382196 ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] Using SSE within OSG
Hi, On Tuesday 29 July 2008 15:18, James Killian wrote: > I 100% agree with that as that is the first thing I did. For the matrixf > mult I got 50% improvement with aligned data and 35% with unaligned. For > the Invert4x4 I got 80% improvement with aligned and 70% aligned with > unaligned. I've submitted this code in as it was the most time spent in > the profiles of our game. I wonder what your scenegraph looks like. Why do you have that much matrix operations? Where are they called from? Why do you need that many inverted matrices? Also the invert method makes me wonder. As far as I can tell, you do not need inverted matrices to do cull and draw. At least not in a magnitude that makes that method appear in profiles. Do you compute intersection tests where you need that inverse? And what kind of matrices are in your code that you really need the full 4x4 inverse? Almost alway the cheaper 3x4 variant can be used for usual transforms. Well, I ask that because I get the impression that the real botteneck - where you can gain much performance - is somwhere different. Greetings Mathias -- Dr. Mathias Fröhlich, science + computing ag, Software Solutions Hagellocher Weg 71-75, D-72070 Tuebingen, Germany Phone: +49 7071 9457-268, Fax: +49 7071 9457-511 -- Vorstand/Board of Management: Dr. Bernd Finkbeiner, Dr. Florian Geyer, Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech Vorsitzender des Aufsichtsrats/ Chairman of the Supervisory Board: Prof. Dr. Hanns Ruder Sitz/Registered Office: Tuebingen Registergericht/Registration Court: Stuttgart Registernummer/Commercial Register No.: HRB 382196 ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] Using SSE within OSG
MS does a very poor job, I know most of our SSE is asm'ed _ From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of David Spilling Sent: Tuesday, July 29, 2008 9:11 AM To: OpenSceneGraph Users Subject: Re: [osg-users] Using SSE within OSG Benjamin, may I suggest that you check the assembler code that the compilers create when compiling the OSG code? ... g++ with -march=core2 -O3 (see man page for description of parameters) the compiler automatically uses SSE I don't have much recent Linux/gcc experience, but can certainly attest that the MS compilers don't do a good job of spotting SSE vectorisation possibilities, even when you tell them to optimise with them (and this is from reading the generated ssembler). In MS you can insert SSE intrinsics , which still allow the compiler to optimise the execution order and memory/register usage e.g. based on cycle counts. I understand (from other sources) that the Intel vectorising compilers are much better at this, naturally. Perhaps this is then all only aMS/Windows thing? David ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] Using SSE within OSG
Benjamin, > > And please do not get me wrong. I do not want to stop your efforts to > improve > the performance of OSG; far from it! Not necessarily my efforts - I'm just being the messenger...! But putting assembler code into the > project decrease the readability and serviceability of the code. Absolutely. > Furthermore > it might be that it does not improve the speed at all. I agree, and this is an oft quoted issue. Here, I think, only testing (and experience) will help. For example, is it worth performing a single Vec3f cross product in SSE? Probably not. But as a counter example, over on osg-submissions (EDIT - and now here), one user (James) is getting large performance gains from having SSE'd the invert_4x4 function. I just want to suggest > that you try to exhaust the possibility of modern compilers as much as > possible. If you see any bottlenecks after that, it might make sense to > include manual performance tuning. I agree. This call-for-ideas was motivated by an understanding that several people are pushing in the same direction, and it would be perhaps beneficial to make use of this push. David ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] Using SSE within OSG
I heard that the Intel C++ compiler is able to optimize even better. Furthermore the use of profiling first is a good approach. Maybe it would be reasonable to compare profiling data of the Math/Vector/Matrix classes with and without compiler optimizations and see if some bottlenecks disappear when using the optimizations. I 100% agree with that as that is the first thing I did. For the matrixf mult I got 50% improvement with aligned data and 35% with unaligned. For the Invert4x4 I got 80% improvement with aligned and 70% aligned with unaligned. I've submitted this code in as it was the most time spent in the profiles of our game. While I am here I think whatever we do we should have CMake have an option to compile using SSE, and provide alternative c code for those who do not want it. Actually, one of the techniques we use at work we handled the case during when SSE2 was only available to some machines, where we wrote the main loop to do the bulk of the work and the remainder loop do finish the work in c code. We could then macro out the main loop for those who didn't have SSE2 as it fell to the remainder code which then did the entire loop. I believe the time has passed to make SSE and SSE2 distinction, so either someone can support SSE2, or they use the c code version. It should be implied that people who write SSE/SSE2 have tested against the c code and have seen a significant gain in performance before considering to use. James Killian - Original Message - From: "Benjamin Eikel" <[EMAIL PROTECTED]> To: "OpenSceneGraph Users" Sent: Tuesday, July 29, 2008 7:28 AM Subject: Re: [osg-users] Using SSE within OSG Am Dienstag, 29. Juli 2008 14:04:59 schrieb David Spilling: Dear All, [...] Any other suggestions? *Question 3 : (possibly the biggest) Should the core OSG include SSE?* There are several downsides to including SSE. Firstly, x-platform provision of SSE may be tricky due to the way different compilers define aligned data, and how SSE instructions are used within the code. I personally don't have much experience here, so any feedback on x-plaform issues is useful. Secondly, the code readability drops, and the "use the source" argument may be trickier when many might not know much SSE. Hello David, may I suggest that you check the assembler code that the compilers create when compiling the OSG code? I have not done it for the OSG code, but for another project I have done some time ago. There I tried to optimize the performance for composing depth-buffer attached images for sort-last rendering. Somehow I was not able to be much better than the compiler was. In some rare cases my procedures were faster, but most of the time the compiler was the winner. The code created by the compilers consider so many things - e. g. branch prediction by the processer, code reordering - that it is quite hard for a human programmer to beat them. For example if you use g++ with -march=core2 -O3 (see man page for description of parameters) the compiler automatically uses SSE or even SSE2, 3dNOW!, etc. instructions. In some cases the compiler generates much better assembler code than a normal programmer would do. There are some case though were manual improvements could yield better results. I heard that the Intel C++ compiler is able to optimize even better. Furthermore the use of profiling first is a good approach. Maybe it would be reasonable to compare profiling data of the Math/Vector/Matrix classes with and without compiler optimizations and see if some bottlenecks disappear when using the optimizations. Regards, Benjamin So - your opinion, experience and suggestions welcome! David ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] Using SSE within OSG
Benjamin, > may I suggest that you check the assembler code that the compilers create > when > compiling the OSG code? > ... g++ with -march=core2 -O3 (see man page for description > of parameters) the compiler automatically uses SSE I don't have much recent Linux/gcc experience, but can certainly attest that the MS compilers don't do a good job of spotting SSE vectorisation possibilities, even when you tell them to optimise with them (and this is from reading the generated ssembler). In MS you can insert SSE intrinsics , which still allow the compiler to optimise the execution order and memory/register usage e.g. based on cycle counts. I understand (from other sources) that the Intel vectorising compilers are much better at this, naturally. Perhaps this is then all only aMS/Windows thing? David ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] Using SSE within OSG
Am Dienstag, 29. Juli 2008 14:28:18 schrieb Benjamin Eikel: > Am Dienstag, 29. Juli 2008 14:04:59 schrieb David Spilling: > > Dear All, > > [...] > > > Any other suggestions? > > > > *Question 3 : (possibly the biggest) Should the core OSG include SSE?* > > There are several downsides to including SSE. Firstly, x-platform > > provision of SSE may be tricky due to the way different compilers define > > aligned data, and how SSE instructions are used within the code. I > > personally don't have much experience here, so any feedback on x-plaform > > issues is useful. > > > > Secondly, the code readability drops, and the "use the source" argument > > may be trickier when many might not know much SSE. > > Hello David, > > may I suggest that you check the assembler code that the compilers create > when compiling the OSG code? I have not done it for the OSG code, but for > another project I have done some time ago. There I tried to optimize the > performance for composing depth-buffer attached images for sort-last > rendering. Somehow I was not able to be much better than the compiler was. > In some rare cases my procedures were faster, but most of the time the > compiler was the winner. The code created by the compilers consider so many > things - e. g. branch prediction by the processer, code reordering - that > it is quite hard for a human programmer to beat them. > For example if you use g++ with -march=core2 -O3 (see man page for > description of parameters) the compiler automatically uses SSE or even > SSE2, 3dNOW!, etc. instructions. In some cases the compiler generates much > better assembler code than a normal programmer would do. There are some > case though were manual improvements could yield better results. > I heard that the Intel C++ compiler is able to optimize even better. > Furthermore the use of profiling first is a good approach. Maybe it would > be reasonable to compare profiling data of the Math/Vector/Matrix classes > with and without compiler optimizations and see if some bottlenecks > disappear when using the optimizations. > > Regards, > Benjamin Hello, I have an addition: With gcc/g++ you can use profiling (option -fprofile-generate) to help the compiler to do better optimizations (option -fprofile-use, e. g. loop unrolling). Maybe this can improve the performance further. If you want performance and sacrifice safety and precision for it, you may even think about -ffast-math (may be dangerous). The options are explained on the gcc/g++ man page or in the online manual [1]. There may be similar options for other compilers. And please do not get me wrong. I do not want to stop your efforts to improve the performance of OSG; far from it! But putting assembler code into the project decrease the readability and serviceability of the code. Furthermore it might be that it does not improve the speed at all. I just want to suggest that you try to exhaust the possibility of modern compilers as much as possible. If you see any bottlenecks after that, it might make sense to include manual performance tuning. Regards, Benjamin [1] http://gcc.gnu.org/onlinedocs/gcc-4.3.0/gcc/Optimize-Options.html#Optimize-Options > > > So - your opinion, experience and suggestions welcome! > > > > David > > ___ > osg-users mailing list > osg-users@lists.openscenegraph.org > http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] Using SSE within OSG
Am Dienstag, 29. Juli 2008 14:04:59 schrieb David Spilling: > Dear All, [...] > Any other suggestions? > > *Question 3 : (possibly the biggest) Should the core OSG include SSE?* > There are several downsides to including SSE. Firstly, x-platform provision > of SSE may be tricky due to the way different compilers define aligned > data, and how SSE instructions are used within the code. I personally don't > have much experience here, so any feedback on x-plaform issues is useful. > > Secondly, the code readability drops, and the "use the source" argument may > be trickier when many might not know much SSE. Hello David, may I suggest that you check the assembler code that the compilers create when compiling the OSG code? I have not done it for the OSG code, but for another project I have done some time ago. There I tried to optimize the performance for composing depth-buffer attached images for sort-last rendering. Somehow I was not able to be much better than the compiler was. In some rare cases my procedures were faster, but most of the time the compiler was the winner. The code created by the compilers consider so many things - e. g. branch prediction by the processer, code reordering - that it is quite hard for a human programmer to beat them. For example if you use g++ with -march=core2 -O3 (see man page for description of parameters) the compiler automatically uses SSE or even SSE2, 3dNOW!, etc. instructions. In some cases the compiler generates much better assembler code than a normal programmer would do. There are some case though were manual improvements could yield better results. I heard that the Intel C++ compiler is able to optimize even better. Furthermore the use of profiling first is a good approach. Maybe it would be reasonable to compare profiling data of the Math/Vector/Matrix classes with and without compiler optimizations and see if some bottlenecks disappear when using the optimizations. Regards, Benjamin > > > So - your opinion, experience and suggestions welcome! > > David ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
Re: [osg-users] Using SSE within OSG
Hi David My company makes very heavy use of SSE in our main products, and there are vast speed improvements to be gained, sadly I don't have permission to provide profiling data We use SSE's for heavy heavy matrix work outside of OSG, we use some we have added to our OSG/OGL apps such as for normal generations, fast sqr root routines, texture operations, the clock cycles saved can mount up quickly I would say adding SSE operation in the right places would be highly beneficial for the OSG core in performance gains. _ From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of David Spilling Sent: Tuesday, July 29, 2008 8:05 AM To: OpenSceneGraph Users Subject: [osg-users] Using SSE within OSG Dear All, There's a discussion going on at the moment over in osg-submissions, and it has been raised that this ought to be opened up to the non-submissions community for feedback. Note that the following is my reading of the issues, and certainly doesn't represent the consensus view of the osg-submissions crowd, so feel free to challenge what I'm saying! Background Several people already use SSE instructions (http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions) alongside OSG to obtain speed improvements through parallelising math operations. The general point that has been raised is that under-the-hood, OSG does quite a lot that could benefit from the potential performance boost given by SSE operations. Obvious targets include some of the Vec/Matrix routines, for example. SSE is now sufficiently mainstream that the risk of processor incompatibility is felt to be low. Question 1 : Where could the core OSG include SSE? Most people follow the sensible approach of profiling to determine their bottlenecks, and then optimising particular methods in order to gain speed-up. This would be a sensible approach to follow, as SSEing all methods would probably be a waste of effort. It would therefore be instructive firstly to know if anybody is using SSE with OSG, and where. Secondly, for those who have profiling data and know how much time they spend in Vec/Matrix/whatever methods, it would be useful to know which methods the community considered good targets for SSEing. Any other maths "heavy lifting" going on? (e.g. Intersection testing? Delauney triangulation? etc.) Question 2 : How could the core OSG include SSE? SSE code benefits from aligned data. Hence there are several ways in which OSG could include SSE: a) Provide an aligned Vec4f and aligned Matrix4f class, which support SSE operations. This would appear (to me) to be the least intrusive. b) Provide branching code within the existing Vec4/Matrix4 methods for detecting whether data is aligned, and performing the correct operations. This would appear to me to be the most user-transparent. Although it would appear to be a performance hit, testing so far on some specific code would support the argument that the speed gains from SSE outweigh the branch cost; more testing needed, I guess. c) Robert suggested that SSE enabled array operators (e.g. providing a cross-product operator for Vec3Array) might be appropriate and provide the best speed improvement for those who want it. Certainly using SSE on large array type data sets is where one gains the most performance improvement. This question includes the possibility of linking out to, or pulling source code our of, an external optimised math library. Any other suggestions? Question 3 : (possibly the biggest) Should the core OSG include SSE? There are several downsides to including SSE. Firstly, x-platform provision of SSE may be tricky due to the way different compilers define aligned data, and how SSE instructions are used within the code. I personally don't have much experience here, so any feedback on x-plaform issues is useful. Secondly, the code readability drops, and the "use the source" argument may be trickier when many might not know much SSE. So - your opinion, experience and suggestions welcome! David ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
[osg-users] Using SSE within OSG
Dear All, There's a discussion going on at the moment over in osg-submissions, and it has been raised that this ought to be opened up to the non-submissions community for feedback. Note that the following is my reading of the issues, and certainly doesn't represent the consensus view of the osg-submissions crowd, so feel free to challenge what I'm saying! *Background* Several people already use SSE instructions ( http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions) alongside OSG to obtain speed improvements through parallelising math operations. The general point that has been raised is that under-the-hood, OSG does quite a lot that could benefit from the potential performance boost given by SSE operations. Obvious targets include some of the Vec/Matrix routines, for example. SSE is now sufficiently mainstream that the risk of processor incompatibility is felt to be low. *Question 1 : Where could the core OSG include SSE?* Most people follow the sensible approach of profiling to determine their bottlenecks, and then optimising particular methods in order to gain speed-up. This would be a sensible approach to follow, as SSEing all methods would probably be a waste of effort. It would therefore be instructive firstly to know if anybody is using SSE with OSG, and where. Secondly, for those who have profiling data and know how much time they spend in Vec/Matrix/whatever methods, it would be useful to know which methods the community considered good targets for SSEing. Any other maths "heavy lifting" going on? (e.g. Intersection testing? Delauney triangulation? etc.) *Question 2 : How could the core OSG include SSE?* SSE code benefits from aligned data. Hence there are several ways in which OSG could include SSE: a) Provide an aligned Vec4f and aligned Matrix4f class, which support SSE operations. This would appear (to me) to be the least intrusive. b) Provide branching code within the existing Vec4/Matrix4 methods for detecting whether data is aligned, and performing the correct operations. This would appear to me to be the most user-transparent. Although it would appear to be a performance hit, testing so far on some specific code would support the argument that the speed gains from SSE outweigh the branch cost; more testing needed, I guess. c) Robert suggested that SSE enabled array operators (e.g. providing a cross-product operator for Vec3Array) might be appropriate and provide the best speed improvement for those who want it. Certainly using SSE on large array type data sets is where one gains the most performance improvement. This question includes the possibility of linking out to, or pulling source code our of, an external optimised math library. Any other suggestions? *Question 3 : (possibly the biggest) Should the core OSG include SSE?* There are several downsides to including SSE. Firstly, x-platform provision of SSE may be tricky due to the way different compilers define aligned data, and how SSE instructions are used within the code. I personally don't have much experience here, so any feedback on x-plaform issues is useful. Secondly, the code readability drops, and the "use the source" argument may be trickier when many might not know much SSE. So - your opinion, experience and suggestions welcome! David ___ osg-users mailing list osg-users@lists.openscenegraph.org http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org