Re: [osg-users] Using SSE within OSG

2008-08-05 Thread James Killian


Thanks for posting this link.  I'll definitely want to look at this.

James Killian
- Original Message - 
From: Benjamin Eikel [EMAIL PROTECTED]

To: OpenSceneGraph Users osg-users@lists.openscenegraph.org
Sent: Tuesday, August 05, 2008 3:11 AM
Subject: Re: [osg-users] Using SSE within OSG



Hello,

some days ago I stumbled upon a library: liboil [1]. Maybe some of the
routines implemented there could be used for OSG.
The library contains different functions (e. g. arithmetic ones) that are
optimized for different processeor architectures (it uses SSE or Altivec 
for
example). Maybe using these functions would be easier than implementing 
them
anew. Functions needed by OSG which are not yet part of liboil might be 
added

to it.

Regards,
Benjamin

[1] http://liboil.freedesktop.org/wiki/
___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org



___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-08-05 Thread James Killian

I would like to take a moment to show a snap shot of how these optimizations
have impacted our game.  To interpret the data, they show frames per second
where the first column keeps an average of the lowest times, the middle
keeps the overall average, and the right keeps track for the highest times.
We do at least 3 runs to get a good solid average.

Here is the fps without any of the SSE optimizations:
Framerates: (23.3, 41.3, 54.2)
Framerates: (27.5, 41.9, 50.4)
Framerates: (30.6, 41.8, 53.3)
AVERAGE:(27.1, 41.7, 52.6)


Here is my submissions with  SSE optimizations
Framerates: (30.2, 48.7, 58.1)
Framerates: (30.9, 49.6, 60.5)
Framerates: (36.8, 50.0, 60.5)
AVERAGE:(32.6, 49.4, 59.7)

Here is a combination of my Submission and Mathias submission
VS 9 (current) ..\Game Scripts\Miramar_001.lua -perf 0 60 0 -stats 10 60
VS9_Perf.txt
Framerates: (40.9, 53.2, 65.6)
Framerates: (34.5, 50.3, 60.9)
Framerates: (39.5, 49.9, 63.2)
AVERAGE:(38.3, 51.1, 63.2)


So basically in this test, both of our optimizations have yielded a solid
+10 fps for this machine.

___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-08-04 Thread James Killian


Thanks for the reply.  The delay of review will buy me some time to present 
the aligned matrixf.  It is true that these maths are not the largest 
bottleneck even for our game, but they still are significant! especially 
during  heavy use of collision detection.  I would like to know if Mathias 
submission would be considered now as 99% of it is a c solution to reduce 
the number of multiplies needed.  It did bring the numbers down in our game 
too.  If so, I would like to write SSE forms of these new functions (e.g. 
preMultTranslate) to the aligned matrixf and make them run even faster.


I would be interested in pursuing the traversal related methods, but I have 
a feeling the solution would entail a design solution with C and not an SSE 
one; However, if performance increase is not the top priority on anyone's 
list I'd be willing to look into this and see if I can help.




James Killian
- Original Message - 
From: Robert Osfield

To: OpenSceneGraph Users osg-users@lists.openscenegraph.org
Sent: Sunday, Aug 3, 2008 05:58 AM
Subject: Re: [osg-users] Using SSE within OSG




Hi Guys,

I've read through the correspondence on this issue, but won't dive in
with reviewing submissions on this topic till well after 2.6.0 is out
the door.

As a general note, there seems to be two related topics - data
alignment and then SSE instructions, they are of course related but
I'd suggest we tackle them separately.

As another general note, in my experience the most common bottleneck
of scene graph based applications is that of CPU memory bandwidth,
maths functions are much less of a bottleneck, and there cost in fact
largely hidden by the cost of waiting for the cache to be filled.  The
performance profiles provided in this threaded suggest this as well -
with the traversal related methods being the biggest bottleneck.  How
to address this bottleneck is a topic for another thread.

Robert.

___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-08-03 Thread Robert Osfield
Hi Guys,

I've read through the correspondence on this issue, but won't dive in
with reviewing submissions on this topic till well after 2.6.0 is out
the door.

As a general note, there seems to be two related topics - data
alignment and then SSE instructions, they are of course related but
I'd suggest we tackle them separately.

As another general note, in my experience the most common bottleneck
of scene graph based applications is that of CPU memory bandwidth,
maths functions are much less of a bottleneck, and there cost in fact
largely hidden by the cost of waiting for the cache to be filled.  The
performance profiles provided in this threaded suggest this as well -
with the traversal related methods being the biggest bottleneck.  How
to address this bottleneck is a topic for another thread.

Robert.
___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-30 Thread Mathias Fröhlich

James,

The most obvious problem: Group::traverse ...
Is one of the visitors you use a TRAVERSE_ALL_CHILDREN visitor? If so, the 
Group::traverse profile makes sense. Make sure that you do traverse only this 
subgraphps you need to traverse. You can then minimize that calls too. Will 
help overall!!!

Ok, so the PositionAttitudeTransform is the matrix multiplication problem.
Try that specialized transform patch I have sent to you.
That will help a bit here. But you might do even better:

You are talking about a game. So I expect that you have transform nodes to 
animate parts of the scenegraph.
I agree that you will need the full PositionAttitudeTransform in some cases. 

But I can well imagine to have special transforms in such a game where you can 
make use of specialized implementations.
Specialized with respect to:

* The kind of the transform.
Often you just have to rotate around the origin. Nothing more. Or you might 
have some linear transform to make something move but no rotation and no 
scaling.
For this case implement you own say LinearTransform or RotationTransform nodes 
derived from osg::Transform and reimplement the the computeLocalToWorldMatrix 
and computeWorldToLocalMatrix and computeBound methods with something more 
optimized. May be use that specialized preMultTranslate or equivalent methods 
from the patch I sent. You can avoid many matrix multiplications for that.

* Recomputation of the bounding sphere.
Sometimes with such special transforms, you do not need to dirty the bounding 
sphere.
Take a rotation. Say you have a leg that can rotate around the knee. Just 
compute the bounding sphere for all possible rotation values of that 
rotation. With that you will have slightly worse bounding spheres, but You do 
not need to walk large scenegraphs to invalidate the bound and you do not 
need to recompute the bound for large parts of the scene again and again.
If you have a human body for example with many transform nodes for arms legs 
and fingers and so on. Your human body bounding sphere will not be much 
larger with that kind of bounding box compared to the exact case. The 
interresting cull case is to cull away the *whole* human body which will 
happen about the same as with the exact bounding spheres.
Translations along an axis for example are a bit more difficult in this case 
since they would blow up the sphere to infinity if you want to catch any 
translation value. But if you have a translation axis, a maximum scalar value 
a minimum scalar value and a current scalar translation value, you can do 
about the same.

Hope this helps.

Greetings

Mathais

-- 
Dr. Mathias Fröhlich, science + computing ag, Software Solutions
Hagellocher Weg 71-75, D-72070 Tuebingen, Germany
Phone: +49 7071 9457-268, Fax: +49 7071 9457-511
-- 
Vorstand/Board of Management:
Dr. Bernd Finkbeiner, Dr. Florian Geyer,
Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech
Vorsitzender des Aufsichtsrats/
Chairman of the Supervisory Board:
Prof. Dr. Hanns Ruder
Sitz/Registered Office: Tuebingen
Registergericht/Registration Court: Stuttgart
Registernummer/Commercial Register No.: HRB 382196 


___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread Gordon Tomlinson
Hi David
 
My company makes very heavy use of SSE in our main products, and there are
vast speed improvements to be gained, sadly I don't have permission to
provide profiling data
 
We use SSE's for heavy heavy matrix work outside of OSG, we use some we have
added to our OSG/OGL apps such as for normal generations, fast sqr root
routines, texture operations, the clock cycles saved can mount up quickly
 
I would say adding SSE operation in the right places would be highly
beneficial for the OSG core in performance gains.
 
 
  _  

From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of David
Spilling
Sent: Tuesday, July 29, 2008 8:05 AM
To: OpenSceneGraph Users
Subject: [osg-users] Using SSE within OSG


Dear All,

There's a discussion going on at the moment over in osg-submissions, and it
has been raised that this ought to be opened up to the non-submissions
community for feedback. Note that the following is my reading of the issues,
and certainly doesn't represent the consensus view of the osg-submissions
crowd, so feel free to challenge what I'm saying!

Background
Several people already use SSE instructions
(http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions) alongside OSG to
obtain speed improvements through parallelising math operations. The general
point that has been raised is that under-the-hood, OSG does quite a lot that
could benefit from the potential performance boost given by SSE operations.
Obvious targets include some of the Vec/Matrix routines, for example. SSE is
now sufficiently mainstream that the risk of processor incompatibility is
felt to be low.

Question 1 : Where could the core OSG include SSE?
Most people follow the sensible approach of profiling to determine their
bottlenecks, and then optimising particular methods in order to gain
speed-up. This would be a sensible approach to follow, as SSEing all methods
would probably be a waste of effort.  It would therefore be instructive
firstly to know if anybody is using SSE with OSG, and where. Secondly, for
those who have profiling data and know how much time they spend in
Vec/Matrix/whatever methods, it would be useful to know which methods the
community considered good targets for SSEing. Any other maths heavy
lifting going on? (e.g. Intersection testing? Delauney triangulation? etc.)

Question 2 : How could the core OSG include SSE?
SSE code benefits from aligned data.  Hence there are several ways in which
OSG could include SSE:

a) Provide an aligned Vec4f and aligned Matrix4f class, which support SSE
operations. This would appear (to me) to be the least intrusive.

b) Provide branching code within the existing Vec4/Matrix4 methods for
detecting whether data is aligned, and performing the correct operations.
This would appear to me to be the most user-transparent. Although it would
appear to be a performance hit, testing so far on some specific code would
support the argument that the speed gains from SSE outweigh the branch cost;
more testing needed, I guess.

c) Robert suggested that SSE enabled array operators (e.g. providing a
cross-product operator for Vec3Array) might be appropriate and provide the
best speed improvement for those who want it. Certainly using SSE on large
array type data sets is where one gains the most performance improvement.

This question includes the possibility of linking out to, or pulling source
code our of, an external optimised math library.

Any other suggestions?

Question 3 : (possibly the biggest) Should the core OSG include SSE?
There are several downsides to including SSE. Firstly, x-platform provision
of SSE may be tricky due to the way different compilers define aligned data,
and how SSE instructions are used within the code. I personally don't have
much experience here, so any feedback on x-plaform issues is useful.

Secondly, the code readability drops, and the use the source argument may
be trickier when many might not know much SSE.


So - your opinion, experience and suggestions welcome!

David







___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread Benjamin Eikel
Am Dienstag, 29. Juli 2008 14:04:59 schrieb David Spilling:
 Dear All,
[...]
 Any other suggestions?

 *Question 3 : (possibly the biggest) Should the core OSG include SSE?*
 There are several downsides to including SSE. Firstly, x-platform provision
 of SSE may be tricky due to the way different compilers define aligned
 data, and how SSE instructions are used within the code. I personally don't
 have much experience here, so any feedback on x-plaform issues is useful.

 Secondly, the code readability drops, and the use the source argument may
 be trickier when many might not know much SSE.
Hello David,

may I suggest that you check the assembler code that the compilers create when 
compiling the OSG code? I have not done it for the OSG code, but for another 
project I have done some time ago. There I tried to optimize the performance 
for composing depth-buffer attached images for sort-last rendering. Somehow I 
was not able to be much better than the compiler was. In some rare cases my 
procedures were faster, but most of the time the compiler was the winner. The 
code created by the compilers consider so many things - e. g. branch 
prediction by the processer, code reordering - that it is quite hard for a 
human programmer to beat them.
For example if you use g++ with -march=core2 -O3 (see man page for description 
of parameters) the compiler automatically uses SSE or even SSE2, 3dNOW!, etc. 
instructions. In some cases the compiler generates much better assembler code 
than a normal programmer would do. There are some case though were manual 
improvements could yield better results.
I heard that the Intel C++ compiler is able to optimize even better.
Furthermore the use of profiling first is a good approach. Maybe it would be 
reasonable to compare profiling data of the Math/Vector/Matrix classes with 
and without compiler optimizations and see if some bottlenecks disappear when 
using the optimizations.

Regards,
Benjamin


 So - your opinion, experience and suggestions welcome!

 David


___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread Benjamin Eikel
Am Dienstag, 29. Juli 2008 14:28:18 schrieb Benjamin Eikel:
 Am Dienstag, 29. Juli 2008 14:04:59 schrieb David Spilling:
  Dear All,

 [...]

  Any other suggestions?
 
  *Question 3 : (possibly the biggest) Should the core OSG include SSE?*
  There are several downsides to including SSE. Firstly, x-platform
  provision of SSE may be tricky due to the way different compilers define
  aligned data, and how SSE instructions are used within the code. I
  personally don't have much experience here, so any feedback on x-plaform
  issues is useful.
 
  Secondly, the code readability drops, and the use the source argument
  may be trickier when many might not know much SSE.

 Hello David,

 may I suggest that you check the assembler code that the compilers create
 when compiling the OSG code? I have not done it for the OSG code, but for
 another project I have done some time ago. There I tried to optimize the
 performance for composing depth-buffer attached images for sort-last
 rendering. Somehow I was not able to be much better than the compiler was.
 In some rare cases my procedures were faster, but most of the time the
 compiler was the winner. The code created by the compilers consider so many
 things - e. g. branch prediction by the processer, code reordering - that
 it is quite hard for a human programmer to beat them.
 For example if you use g++ with -march=core2 -O3 (see man page for
 description of parameters) the compiler automatically uses SSE or even
 SSE2, 3dNOW!, etc. instructions. In some cases the compiler generates much
 better assembler code than a normal programmer would do. There are some
 case though were manual improvements could yield better results.
 I heard that the Intel C++ compiler is able to optimize even better.
 Furthermore the use of profiling first is a good approach. Maybe it would
 be reasonable to compare profiling data of the Math/Vector/Matrix classes
 with and without compiler optimizations and see if some bottlenecks
 disappear when using the optimizations.

 Regards,
 Benjamin
Hello,

I have an addition:
With gcc/g++ you can use profiling (option -fprofile-generate) to help the 
compiler to do better optimizations (option -fprofile-use, e. g. loop 
unrolling). Maybe this can improve the performance further.
If you want performance and sacrifice safety and precision for it, you may 
even think about -ffast-math (may be dangerous).
The options are explained on the gcc/g++ man page or in the online manual [1].
There may be similar options for other compilers.
And please do not get me wrong. I do not want to stop your efforts to improve 
the performance of OSG; far from it! But putting assembler code into the 
project decrease the readability and serviceability of the code. Furthermore 
it might be that it does not improve the speed at all. I just want to suggest 
that you try to exhaust the possibility of modern compilers as much as 
possible. If you see any bottlenecks after that, it might make sense to 
include manual performance tuning.

Regards,
Benjamin

[1] 
http://gcc.gnu.org/onlinedocs/gcc-4.3.0/gcc/Optimize-Options.html#Optimize-Options


  So - your opinion, experience and suggestions welcome!
 
  David

 ___
 osg-users mailing list
 osg-users@lists.openscenegraph.org
 http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread David Spilling
Benjamin,


 may I suggest that you check the assembler code that the compilers create
 when
 compiling the OSG code?



 ... g++ with -march=core2 -O3 (see man page for description
 of parameters) the compiler automatically uses SSE


I don't have much recent Linux/gcc experience, but can certainly attest that
the MS compilers don't do a good job of spotting SSE vectorisation
possibilities, even when you tell them to optimise with them (and this is
from reading the generated ssembler). In MS you can insert SSE intrinsics ,
which still allow the compiler to optimise the execution order and
memory/register usage e.g. based on cycle counts.

I understand (from other sources) that the Intel vectorising compilers are
much better at this, naturally.

Perhaps this is then all only aMS/Windows thing?

David
___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread James Killian

I heard that the Intel C++ compiler is able to optimize even better.
Furthermore the use of profiling first is a good approach. Maybe it would 
be
reasonable to compare profiling data of the Math/Vector/Matrix classes 
with
and without compiler optimizations and see if some bottlenecks disappear 
when

using the optimizations.


I 100% agree with that as that is the first thing I did.  For the matrixf 
mult I got 50% improvement with aligned data and 35% with unaligned.  For 
the Invert4x4 I got 80% improvement with aligned and 70% aligned with 
unaligned.  I've submitted this code in as it was the most time spent in the 
profiles of our game.


While I am here I think whatever we do we should have CMake have an option 
to compile using SSE, and provide alternative c code for those who do not 
want it.  Actually, one of the techniques we use at work we handled the case 
during when SSE2 was only available to some machines, where we wrote the 
main loop to do the bulk of the work and the remainder loop do finish the 
work in c code.  We could then macro out the main loop for those who didn't 
have SSE2 as it fell to the remainder code which then did the entire loop. 
I believe the time has passed to make SSE and SSE2 distinction, so either 
someone can support SSE2, or they use the c code version.  It should be 
implied that people who write SSE/SSE2 have tested against the c code and 
have seen a significant gain in performance before considering to use.





James Killian
- Original Message - 
From: Benjamin Eikel [EMAIL PROTECTED]

To: OpenSceneGraph Users osg-users@lists.openscenegraph.org
Sent: Tuesday, July 29, 2008 7:28 AM
Subject: Re: [osg-users] Using SSE within OSG



Am Dienstag, 29. Juli 2008 14:04:59 schrieb David Spilling:

Dear All,

[...]

Any other suggestions?

*Question 3 : (possibly the biggest) Should the core OSG include SSE?*
There are several downsides to including SSE. Firstly, x-platform 
provision

of SSE may be tricky due to the way different compilers define aligned
data, and how SSE instructions are used within the code. I personally 
don't

have much experience here, so any feedback on x-plaform issues is useful.

Secondly, the code readability drops, and the use the source argument 
may

be trickier when many might not know much SSE.

Hello David,

may I suggest that you check the assembler code that the compilers create 
when
compiling the OSG code? I have not done it for the OSG code, but for 
another
project I have done some time ago. There I tried to optimize the 
performance
for composing depth-buffer attached images for sort-last rendering. 
Somehow I
was not able to be much better than the compiler was. In some rare cases 
my
procedures were faster, but most of the time the compiler was the winner. 
The

code created by the compilers consider so many things - e. g. branch
prediction by the processer, code reordering - that it is quite hard for a
human programmer to beat them.
For example if you use g++ with -march=core2 -O3 (see man page for 
description
of parameters) the compiler automatically uses SSE or even SSE2, 3dNOW!, 
etc.
instructions. In some cases the compiler generates much better assembler 
code

than a normal programmer would do. There are some case though were manual
improvements could yield better results.
I heard that the Intel C++ compiler is able to optimize even better.
Furthermore the use of profiling first is a good approach. Maybe it would 
be
reasonable to compare profiling data of the Math/Vector/Matrix classes 
with
and without compiler optimizations and see if some bottlenecks disappear 
when

using the optimizations.

Regards,
Benjamin



So - your opinion, experience and suggestions welcome!

David



___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org



___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread David Spilling
Benjamin,


 And please do not get me wrong. I do not want to stop your efforts to
 improve
 the performance of OSG; far from it!


Not necessarily my efforts - I'm just being the messenger...!

But putting assembler code into the
 project decrease the readability and serviceability of the code.


Absolutely.


 Furthermore
 it might be that it does not improve the speed at all.


I agree, and this is an oft quoted issue. Here, I think, only testing (and
experience) will help. For example, is it worth performing a single Vec3f
cross product in SSE? Probably not. But as a counter example, over on
osg-submissions (EDIT - and now here), one user (James) is getting large
performance gains from having SSE'd the invert_4x4 function.

I just want to suggest
 that you try to exhaust the possibility of modern compilers as much as
 possible. If you see any bottlenecks after that, it might make sense to
 include manual performance tuning.


I agree. This call-for-ideas was motivated by an understanding that several
people are pushing in the same direction, and it would be perhaps beneficial
to make use of this push.

David
___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread Gordon Tomlinson
MS does a very poor job, 
 
I know most of our SSE is asm'ed 
 
 

  _  

From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of David
Spilling
Sent: Tuesday, July 29, 2008 9:11 AM
To: OpenSceneGraph Users
Subject: Re: [osg-users] Using SSE within OSG


Benjamin,




may I suggest that you check the assembler code that the compilers create
when
compiling the OSG code?

 

... g++ with -march=core2 -O3 (see man page for description
of parameters) the compiler automatically uses SSE


I don't have much recent Linux/gcc experience, but can certainly attest that
the MS compilers don't do a good job of spotting SSE vectorisation
possibilities, even when you tell them to optimise with them (and this is
from reading the generated ssembler). In MS you can insert SSE intrinsics ,
which still allow the compiler to optimise the execution order and
memory/register usage e.g. based on cycle counts.

I understand (from other sources) that the Intel vectorising compilers are
much better at this, naturally.

Perhaps this is then all only aMS/Windows thing?

David




___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread Mathias Fröhlich

Hi,

On Tuesday 29 July 2008 15:18, James Killian wrote:
 I 100% agree with that as that is the first thing I did.  For the matrixf
 mult I got 50% improvement with aligned data and 35% with unaligned.  For
 the Invert4x4 I got 80% improvement with aligned and 70% aligned with
 unaligned.  I've submitted this code in as it was the most time spent in
 the profiles of our game.
I wonder what your scenegraph looks like.
Why do you have that much matrix operations?
Where are they called from?
Why do you need that many inverted matrices?

Also the invert method makes me wonder. As far as I can tell, you do not need 
inverted matrices to do cull and draw. At least not in a magnitude that makes 
that method appear in profiles.

Do you compute intersection tests where you need that inverse?
And what kind of matrices are in your code that you really need the full 4x4 
inverse? Almost alway the cheaper 3x4 variant can be used for usual 
transforms.

Well, I ask that because I get the impression that the real botteneck - where 
you can gain much performance - is somwhere different.

Greetings

Mathias

-- 
Dr. Mathias Fröhlich, science + computing ag, Software Solutions
Hagellocher Weg 71-75, D-72070 Tuebingen, Germany
Phone: +49 7071 9457-268, Fax: +49 7071 9457-511
-- 
Vorstand/Board of Management:
Dr. Bernd Finkbeiner, Dr. Florian Geyer,
Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech
Vorsitzender des Aufsichtsrats/
Chairman of the Supervisory Board:
Prof. Dr. Hanns Ruder
Sitz/Registered Office: Tuebingen
Registergericht/Registration Court: Stuttgart
Registernummer/Commercial Register No.: HRB 382196 


___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread James Killian

Paul asked me the same question a few days ago, and I just realized that we
took that offline so I'll repost here:
One of the things I should add is the actual profile dump, since that shows
a more comprehensive picture.  The actual game demo is free to download and
play here:
http://www.fringe-online.com/

The current installer of the game does not have my optimization in it yet,
but it should be noted even with the optimization the postmult is still at
the top.  The Invert4x4() however got pushed way down to the bottom (which
is great).  I'll post my profiles when I get home.


-snip---
---
That is a good question, and I believe the answer is collision detection.  I
should disable it and run the numbers again to confirm.  All ships fire
machine guns at a fast rate, and each bullet that gets close enough to a
bounding box/sphere region has to go through the osg code to get the precise
point where it hit.  Rick would probably have a better explanation of this
and other factors since he coded the bulk of the collision detection (and
osg integration).  Most of my time development in the game has been spent on
the physics and flight dynamics (and now optimization).

It may turn out that we could find some caching technique to reduce the
collision stress (like the KBDtree), but in the mean time, matrix
optimizations can benefit the whole community if we do them right, and I
would like to make some contribution to the community.


- Original Message - 
From: Paul Melis [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Monday, July 28, 2008 9:05 AM
Subject: [Fwd: Re: [osg-users] [osg-submissions] Matrixf multiply
Optimization]


 Hi James,

 I noted you posts on the osg-users list on the Matrix multiplication
 optimizations using SSE.
 You mention Our game uses approximately 25% of all processing to these
 functions [...]. What on earth takes up so much matrix computing time
 in your game?

 Regards,
 Paul

-snip---
---

- Original Message - 
From: Mathias Fröhlich [EMAIL PROTECTED]
To: OpenSceneGraph Users osg-users@lists.openscenegraph.org
Sent: Tuesday, July 29, 2008 9:31 AM
Subject: Re: [osg-users] Using SSE within OSG



Hi,

On Tuesday 29 July 2008 15:18, James Killian wrote:
 I 100% agree with that as that is the first thing I did.  For the matrixf
 mult I got 50% improvement with aligned data and 35% with unaligned.  For
 the Invert4x4 I got 80% improvement with aligned and 70% aligned with
 unaligned.  I've submitted this code in as it was the most time spent in
 the profiles of our game.
I wonder what your scenegraph looks like.
Why do you have that much matrix operations?
Where are they called from?
Why do you need that many inverted matrices?

Also the invert method makes me wonder. As far as I can tell, you do not
need
inverted matrices to do cull and draw. At least not in a magnitude that
makes
that method appear in profiles.

Do you compute intersection tests where you need that inverse?
And what kind of matrices are in your code that you really need the full 4x4
inverse? Almost alway the cheaper 3x4 variant can be used for usual
transforms.

Well, I ask that because I get the impression that the real botteneck -
where
you can gain much performance - is somwhere different.

Greetings

Mathias

-- 
Dr. Mathias Fröhlich, science + computing ag, Software Solutions
Hagellocher Weg 71-75, D-72070 Tuebingen, Germany
Phone: +49 7071 9457-268, Fax: +49 7071 9457-511
-- 
Vorstand/Board of Management:
Dr. Bernd Finkbeiner, Dr. Florian Geyer,
Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech
Vorsitzender des Aufsichtsrats/
Chairman of the Supervisory Board:
Prof. Dr. Hanns Ruder
Sitz/Registered Office: Tuebingen
Registergericht/Registration Court: Stuttgart
Registernummer/Commercial Register No.: HRB 382196


___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org

___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread Sebastian Messerschmidt

Hi All,

Regarding question 2:
Wouldn't it be possible to dynamically link different versions of the 
OSG-DLLs?
So there would be two Version of the DLLs, one with the 
SSE-Optimizations and one with the straightforward code.
I've seen examples of games some years ago, where they linked different 
Versions of DLLs depending on the machine the program was run on.


cheers
Sebastian

Dear All,

There's a discussion going on at the moment over in osg-submissions, 
and it has been raised that this ought to be opened up to the 
non-submissions community for feedback. Note that the following is my 
reading of the issues, and certainly doesn't represent the consensus 
view of the osg-submissions crowd, so feel free to challenge what I'm 
saying!


*Background*
Several people already use SSE instructions 
(http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions) alongside OSG 
to obtain speed improvements through parallelising math operations. 
The general point that has been raised is that under-the-hood, OSG 
does quite a lot that could benefit from the potential performance 
boost given by SSE operations. Obvious targets include some of the 
Vec/Matrix routines, for example. SSE is now sufficiently mainstream 
that the risk of processor incompatibility is felt to be low.


*Question 1 : Where could the core OSG include SSE?*
Most people follow the sensible approach of profiling to determine 
their bottlenecks, and then optimising particular methods in order to 
gain speed-up. This would be a sensible approach to follow, as SSEing 
all methods would probably be a waste of effort.  It would therefore 
be instructive firstly to know if anybody is using SSE with OSG, and 
where. Secondly, for those who have profiling data and know how much 
time they spend in Vec/Matrix/whatever methods, it would be useful to 
know which methods the community considered good targets for SSEing. 
Any other maths heavy lifting going on? (e.g. Intersection testing? 
Delauney triangulation? etc.)


*Question 2 : How could the core OSG include SSE?*
SSE code benefits from aligned data.  Hence there are several ways in 
which OSG could include SSE:


a) Provide an aligned Vec4f and aligned Matrix4f class, which support 
SSE operations. This would appear (to me) to be the least intrusive.


b) Provide branching code within the existing Vec4/Matrix4 methods for 
detecting whether data is aligned, and performing the correct 
operations. This would appear to me to be the most user-transparent. 
Although it would appear to be a performance hit, testing so far on 
some specific code would support the argument that the speed gains 
from SSE outweigh the branch cost; more testing needed, I guess.


c) Robert suggested that SSE enabled array operators (e.g. providing a 
cross-product operator for Vec3Array) might be appropriate and provide 
the best speed improvement for those who want it. Certainly using SSE 
on large array type data sets is where one gains the most performance 
improvement.


This question includes the possibility of linking out to, or pulling 
source code our of, an external optimised math library.


Any other suggestions?

*Question 3 : (possibly the biggest) Should the core OSG include SSE?*
There are several downsides to including SSE. Firstly, x-platform 
provision of SSE may be tricky due to the way different compilers 
define aligned data, and how SSE instructions are used within the 
code. I personally don't have much experience here, so any feedback on 
x-plaform issues is useful.


Secondly, the code readability drops, and the use the source 
argument may be trickier when many might not know much SSE.



So - your opinion, experience and suggestions welcome!

David








___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
  


___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread James Killian

I have to disagree, using VS 7 and up to VS 9.  It has done a terrific job
with the instruction scheduling.  We use to use that technique of asm back
when P3's MMX were around and we had VS 6.  We had one engineer who would
use DOS and MASM.  Times have changed (we had to let him go), intrinsics
have proved to optimize quite well as we use the AMD code analyzer to
confirm that the U and V pipes remain full due to well scheduled placement
of the instructions.

I should add that we avoid using any MMX instructions like the plague now
days.

- Original Message - 
From: Gordon Tomlinson [EMAIL PROTECTED]
To: 'OpenSceneGraph Users' osg-users@lists.openscenegraph.org
Sent: Tuesday, July 29, 2008 8:56 AM
Subject: Re: [osg-users] Using SSE within OSG


 MS does a very poor job,

 I know most of our SSE is asm'ed



   _

 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of David
 Spilling
 Sent: Tuesday, July 29, 2008 9:11 AM
 To: OpenSceneGraph Users
 Subject: Re: [osg-users] Using SSE within OSG


 Benjamin,




 may I suggest that you check the assembler code that the compilers create
 when
 compiling the OSG code?



 ... g++ with -march=core2 -O3 (see man page for description
 of parameters) the compiler automatically uses SSE


 I don't have much recent Linux/gcc experience, but can certainly attest
that
 the MS compilers don't do a good job of spotting SSE vectorisation
 possibilities, even when you tell them to optimise with them (and this is
 from reading the generated ssembler). In MS you can insert SSE intrinsics
,
 which still allow the compiler to optimise the execution order and
 memory/register usage e.g. based on cycle counts.

 I understand (from other sources) that the Intel vectorising compilers are
 much better at this, naturally.

 Perhaps this is then all only aMS/Windows thing?

 David











 ___
 osg-users mailing list
 osg-users@lists.openscenegraph.org
 http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread Mathias Fröhlich

James,

On Tuesday 29 July 2008 16:59, James Killian wrote:
 Paul asked me the same question a few days ago, and I just realized that we
 took that offline so I'll repost here:
 One of the things I should add is the actual profile dump, since that shows
 a more comprehensive picture.  The actual game demo is free to download and
 play here:
 http://www.fringe-online.com/

 The current installer of the game does not have my optimization in it yet,
 but it should be noted even with the optimization the postmult is still at
 the top.  The Invert4x4() however got pushed way down to the bottom (which
 is great).  I'll post my profiles when I get home.


 -snip--
- ---
 That is a good question, and I believe the answer is collision detection. 
 I should disable it and run the numbers again to confirm.  All ships fire
 machine guns at a fast rate, and each bullet that gets close enough to a
 bounding box/sphere region has to go through the osg code to get the
 precise point where it hit.  Rick would probably have a better explanation
 of this and other factors since he coded the bulk of the collision
 detection (and osg integration).  Most of my time development in the game
 has been spent on the physics and flight dynamics (and now optimization).

 It may turn out that we could find some caching technique to reduce the
 collision stress (like the KBDtree), but in the mean time, matrix
 optimizations can benefit the whole community if we do them right, and I
 would like to make some contribution to the community.

Ok, you can do here much for the collision detection.
I expect that you should optimize that algorithmically and gain magnitudes 
without sse.

So the question is more if such optimizations will bring performance 
improovements for the usual scenegraph case.

Greetings

Mathias

-- 
Dr. Mathias Fröhlich, science + computing ag, Software Solutions
Hagellocher Weg 71-75, D-72070 Tuebingen, Germany
Phone: +49 7071 9457-268, Fax: +49 7071 9457-511
-- 
Vorstand/Board of Management:
Dr. Bernd Finkbeiner, Dr. Florian Geyer,
Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech
Vorsitzender des Aufsichtsrats/
Chairman of the Supervisory Board:
Prof. Dr. Hanns Ruder
Sitz/Registered Office: Tuebingen
Registergericht/Registration Court: Stuttgart
Registernummer/Commercial Register No.: HRB 382196 


___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread David Spilling
James,


 I have to disagree, using VS 7 and up to VS 9.


Just to clarify - what are you disagreeing with? Do you find that MS
compilers will produce SSE vectorised code _without_ use of intrinsics or
raw __asm?

David
___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread Gordon Tomlinson
HI 

I can only go buy our low level masters and their profiling shows that the
hand road asm'ed  SSE code is significantly fasted than MS VS compiled code

Obviously this our experience in our environments and we computationally
heavily and moving and editing terra-bytes of data around in real-time

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of James
Killian
Sent: Tuesday, July 29, 2008 11:38 AM
To: OpenSceneGraph Users
Subject: Re: [osg-users] Using SSE within OSG


Sorry...

I interpreted Gordon's response as follows:
MS does a poor job (insert here with compiling SSE intrinsics), as a result
most of his SSE is asm'ed.
The asm'ed approach is where you don't trust the compiler to do the right
thing with intrinsics, where it has the flexibility of scheduling and
assigning registers etc.

I disagree with MS does a poor job compiling intrinsic code, and that you
should not *ever need to resort to __asm anymore.
*this is not absolute, there was once a rare case where we found a strange
anomaly, but later solved by doing an un-intuitive c code change

Do you find that MS compilers will produce SSE vectorised code 
_without_
use of intrinsics or raw __asm?
Ah this is a tricky question.  There is in fact an option in VS 8 and VS 9
project settings to generate SSE or SSE2 code.  What this does is that it
will evaluate c code and try to use SSE for it.  I was surprised to find
that this actually lowered the performance of c code, especially c code for
matrixf.  I'm so glad that the project settings for osg do not turn this on,
and I'd recommend not using that, but instead write intrisics ourselves for
places that need it.

I hope this clears things up.


- Original Message -
From: David Spilling [EMAIL PROTECTED]
To: OpenSceneGraph Users osg-users@lists.openscenegraph.org
Sent: Tuesday, July 29, 2008 10:17 AM
Subject: Re: [osg-users] Using SSE within OSG


 James,


  I have to disagree, using VS 7 and up to VS 9.


 Just to clarify - what are you disagreeing with? Do you find that MS
 compilers will produce SSE vectorised code _without_ use of intrinsics or
 raw __asm?

 David







 ___
 osg-users mailing list
 osg-users@lists.openscenegraph.org
 http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread James Killian

Thanks for the reply.  We could resolve this argument if any one of the low
level masters cares to email me offline [EMAIL PROTECTED], but I'd
be open to believe an argument could be made for the context of moving
around large amounts of data.

In regards to moving data, SSE/SSE2 is really better suited for code which
requires a lot of math like 3d computations.  Perhaps the heart of SSE would
be the packed multiply and add, where it can do 4 multiplies and 4 adds in
one clock cycle (or a half cycle if paired properly).  Thus, code which
requires heavy math like many of the OSG matrix computations could really
benefit from it.  I would profile cases like this against hand written
assembly since this is what OSG would care about.

I looked at the assembly code produced by VS 9 for the optimized matrixf
multiply, and I could not have scheduled it better myself by hand.

- Original Message - 
From: Gordon Tomlinson [EMAIL PROTECTED]
To: 'OpenSceneGraph Users' osg-users@lists.openscenegraph.org
Sent: Tuesday, July 29, 2008 2:58 PM
Subject: Re: [osg-users] Using SSE within OSG


 HI

 I can only go buy our low level masters and their profiling shows that the
 hand road asm'ed  SSE code is significantly fasted than MS VS compiled
code

 Obviously this our experience in our environments and we computationally
 heavily and moving and editing terra-bytes of data around in real-time

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of James
 Killian
 Sent: Tuesday, July 29, 2008 11:38 AM
 To: OpenSceneGraph Users
 Subject: Re: [osg-users] Using SSE within OSG


 Sorry...

 I interpreted Gordon's response as follows:
 MS does a poor job (insert here with compiling SSE intrinsics), as a
result
 most of his SSE is asm'ed.
 The asm'ed approach is where you don't trust the compiler to do the right
 thing with intrinsics, where it has the flexibility of scheduling and
 assigning registers etc.

 I disagree with MS does a poor job compiling intrinsic code, and that
you
 should not *ever need to resort to __asm anymore.
 *this is not absolute, there was once a rare case where we found a strange
 anomaly, but later solved by doing an un-intuitive c code change

 Do you find that MS compilers will produce SSE vectorised code
 _without_
 use of intrinsics or raw __asm?
 Ah this is a tricky question.  There is in fact an option in VS 8 and VS 9
 project settings to generate SSE or SSE2 code.  What this does is that it
 will evaluate c code and try to use SSE for it.  I was surprised to find
 that this actually lowered the performance of c code, especially c code
for
 matrixf.  I'm so glad that the project settings for osg do not turn this
on,
 and I'd recommend not using that, but instead write intrisics ourselves
for
 places that need it.

 I hope this clears things up.


 - Original Message -
 From: David Spilling [EMAIL PROTECTED]
 To: OpenSceneGraph Users osg-users@lists.openscenegraph.org
 Sent: Tuesday, July 29, 2008 10:17 AM
 Subject: Re: [osg-users] Using SSE within OSG


  James,
 
 
   I have to disagree, using VS 7 and up to VS 9.
 
 
  Just to clarify - what are you disagreeing with? Do you find that MS
  compilers will produce SSE vectorised code _without_ use of intrinsics
or
  raw __asm?
 
  David
 


 --
--
 


  ___
  osg-users mailing list
  osg-users@lists.openscenegraph.org
 
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
 

 ___
 osg-users mailing list
 osg-users@lists.openscenegraph.org
 http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


 ___
 osg-users mailing list
 osg-users@lists.openscenegraph.org
 http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org