Re: [osg-users] Using SSE within OSG

2008-08-05 Thread James Killian

I would like to take a moment to show a snap shot of how these optimizations
have impacted our game.  To interpret the data, they show frames per second
where the first column keeps an average of the lowest times, the middle
keeps the overall average, and the right keeps track for the highest times.
We do at least 3 runs to get a good solid average.

Here is the fps without any of the SSE optimizations:
Framerates: (23.3, 41.3, 54.2)
Framerates: (27.5, 41.9, 50.4)
Framerates: (30.6, 41.8, 53.3)
AVERAGE:(27.1, 41.7, 52.6)


Here is my submissions with  SSE optimizations
Framerates: (30.2, 48.7, 58.1)
Framerates: (30.9, 49.6, 60.5)
Framerates: (36.8, 50.0, 60.5)
AVERAGE:(32.6, 49.4, 59.7)

Here is a combination of my Submission and Mathias submission
VS 9 (current) "..\Game Scripts\Miramar_001.lua" -perf 0 60 0 -stats 10 60
VS9_Perf.txt
Framerates: (40.9, 53.2, 65.6)
Framerates: (34.5, 50.3, 60.9)
Framerates: (39.5, 49.9, 63.2)
AVERAGE:(38.3, 51.1, 63.2)


So basically in this test, both of our optimizations have yielded a solid
+10 fps for this machine.

___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-08-05 Thread James Killian


Thanks for posting this link.  I'll definitely want to look at this.

James Killian
- Original Message - 
From: "Benjamin Eikel" <[EMAIL PROTECTED]>

To: "OpenSceneGraph Users" 
Sent: Tuesday, August 05, 2008 3:11 AM
Subject: Re: [osg-users] Using SSE within OSG



Hello,

some days ago I stumbled upon a library: liboil [1]. Maybe some of the
routines implemented there could be used for OSG.
The library contains different functions (e. g. arithmetic ones) that are
optimized for different processeor architectures (it uses SSE or Altivec 
for
example). Maybe using these functions would be easier than implementing 
them
anew. Functions needed by OSG which are not yet part of liboil might be 
added

to it.

Regards,
Benjamin

[1] http://liboil.freedesktop.org/wiki/
___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org



___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-08-05 Thread Benjamin Eikel
Hello,

some days ago I stumbled upon a library: liboil [1]. Maybe some of the 
routines implemented there could be used for OSG.
The library contains different functions (e. g. arithmetic ones) that are 
optimized for different processeor architectures (it uses SSE or Altivec for 
example). Maybe using these functions would be easier than implementing them 
anew. Functions needed by OSG which are not yet part of liboil might be added 
to it.

Regards,
Benjamin

[1] http://liboil.freedesktop.org/wiki/
___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-08-04 Thread James Killian


Thanks for the reply.  The delay of review will buy me some time to present 
the aligned matrixf.  It is true that these maths are not the largest 
bottleneck even for our game, but they still are significant! especially 
during  heavy use of collision detection.  I would like to know if Mathias 
submission would be considered now as 99% of it is a c solution to reduce 
the number of multiplies needed.  It did bring the numbers down in our game 
too.  If so, I would like to write SSE forms of these new functions (e.g. 
preMultTranslate) to the aligned matrixf and make them run even faster.


I would be interested in pursuing the traversal related methods, but I have 
a feeling the solution would entail a design solution with C and not an SSE 
one; However, if performance increase is not the top priority on anyone's 
list I'd be willing to look into this and see if I can help.




James Killian
- Original Message - 
From: Robert Osfield

To: "OpenSceneGraph Users" 
Sent: Sunday, Aug 3, 2008 05:58 AM
Subject: Re: [osg-users] Using SSE within OSG




Hi Guys,

I've read through the correspondence on this issue, but won't dive in
with reviewing submissions on this topic till well after 2.6.0 is out
the door.

As a general note, there seems to be two related topics - data
alignment and then SSE instructions, they are of course related but
I'd suggest we tackle them separately.

As another general note, in my experience the most common bottleneck
of scene graph based applications is that of CPU memory bandwidth,
maths functions are much less of a bottleneck, and there cost in fact
largely hidden by the cost of waiting for the cache to be filled.  The
performance profiles provided in this threaded suggest this as well -
with the traversal related methods being the biggest bottleneck.  How
to address this bottleneck is a topic for another thread.

Robert.

___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-08-03 Thread Robert Osfield
Hi Guys,

I've read through the correspondence on this issue, but won't dive in
with reviewing submissions on this topic till well after 2.6.0 is out
the door.

As a general note, there seems to be two related topics - data
alignment and then SSE instructions, they are of course related but
I'd suggest we tackle them separately.

As another general note, in my experience the most common bottleneck
of scene graph based applications is that of CPU memory bandwidth,
maths functions are much less of a bottleneck, and there cost in fact
largely hidden by the cost of waiting for the cache to be filled.  The
performance profiles provided in this threaded suggest this as well -
with the traversal related methods being the biggest bottleneck.  How
to address this bottleneck is a topic for another thread.

Robert.
___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-30 Thread Mathias Fröhlich

James,

The most obvious problem: Group::traverse ...
Is one of the visitors you use a TRAVERSE_ALL_CHILDREN visitor? If so, the 
Group::traverse profile makes sense. Make sure that you do traverse only this 
subgraphps you need to traverse. You can then minimize that calls too. Will 
help overall!!!

Ok, so the PositionAttitudeTransform is the matrix multiplication problem.
Try that specialized transform patch I have sent to you.
That will help a bit here. But you might do even better:

You are talking about a game. So I expect that you have transform nodes to 
animate parts of the scenegraph.
I agree that you will need the full PositionAttitudeTransform in some cases. 

But I can well imagine to have special transforms in such a game where you can 
make use of specialized implementations.
Specialized with respect to:

* The kind of the transform.
Often you just have to rotate around the origin. Nothing more. Or you might 
have some linear transform to make something move but no rotation and no 
scaling.
For this case implement you own say LinearTransform or RotationTransform nodes 
derived from osg::Transform and reimplement the the computeLocalToWorldMatrix 
and computeWorldToLocalMatrix and computeBound methods with something more 
optimized. May be use that specialized preMultTranslate or equivalent methods 
from the patch I sent. You can avoid many matrix multiplications for that.

* Recomputation of the bounding sphere.
Sometimes with such special transforms, you do not need to dirty the bounding 
sphere.
Take a rotation. Say you have a leg that can rotate around the knee. Just 
compute the bounding sphere for all possible rotation values of that 
rotation. With that you will have slightly worse bounding spheres, but You do 
not need to walk large scenegraphs to invalidate the bound and you do not 
need to recompute the bound for large parts of the scene again and again.
If you have a human body for example with many transform nodes for arms legs 
and fingers and so on. Your human body bounding sphere will not be much 
larger with that kind of bounding box compared to the exact case. The 
interresting cull case is to cull away the *whole* human body which will 
happen about the same as with the exact bounding spheres.
Translations along an axis for example are a bit more difficult in this case 
since they would blow up the sphere to infinity if you want to catch any 
translation value. But if you have a translation axis, a maximum scalar value 
a minimum scalar value and a current scalar translation value, you can do 
about the same.

Hope this helps.

Greetings

Mathais

-- 
Dr. Mathias Fröhlich, science + computing ag, Software Solutions
Hagellocher Weg 71-75, D-72070 Tuebingen, Germany
Phone: +49 7071 9457-268, Fax: +49 7071 9457-511
-- 
Vorstand/Board of Management:
Dr. Bernd Finkbeiner, Dr. Florian Geyer,
Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech
Vorsitzender des Aufsichtsrats/
Chairman of the Supervisory Board:
Prof. Dr. Hanns Ruder
Sitz/Registered Office: Tuebingen
Registergericht/Registration Court: Stuttgart
Registernummer/Commercial Register No.: HRB 382196 


___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread James Killian
Ok I thought it was the collision detection but that is not the case here 
are some of the numbers with collision disabled:


CS:EIP  Symbol + Offset 
64-bit  Timer samples
0x10083cc0  osg::Group::traverse 
1434
0x10083d60  osg::Group::computeBound 
1391
0x10099ca0  osg::Matrixf::mult 
833
0x1001a9d0  osg::PositionAttitudeTransform::accept 
409
0x10099370  osg::Matrixf::preMult 
407
0x1000e840  osg::AnimationPathCallback::update 
352
0x1009bb50  osg::Node::dirtyBound 
340
0x100dcee0  osg::Transform::computeBound 
318
0x100a9df0  osg::PositionAttitudeTransform::computeLocalToWorldMatrix 
294
0x100126f0  osg::AnimationPath::getInterpolatedControlPoint 
285
0x10009c70  osg::AnimationPathCallback::setPause 
251
0x1000c8e0  osg::StateSet::requiresUpdateTraversal 
228


12 functions, 806 instructions, Total: 6542 samples, 50.85% of samples in 
the module, 16.36% of total session samples


Ok here is with collision detection:
===
CS:EIP  Symbol + Offset 
64-bit  Timer samples
0x10083cc0  osg::Group::traverse 
1382
0x10083d60  osg::Group::computeBound 
1237
0x10099ca0  osg::Matrixf::mult 
924
0x10099370  osg::Matrixf::preMult 
600
0x1001a9d0  osg::PositionAttitudeTransform::accept 
394
0x1000e840  osg::AnimationPathCallback::update 
292
0x100dcee0  osg::Transform::computeBound 
284
0x1009bb50  osg::Node::dirtyBound 
280
0x100126f0  osg::AnimationPath::getInterpolatedControlPoint 
274
0x100a9df0  osg::PositionAttitudeTransform::computeLocalToWorldMatrix 
230
0x10009c70  osg::AnimationPathCallback::setPause 
225
0x10002e00  osg::Matrixf::preMult 
210


12 functions, 846 instructions, Total: 6332 samples, 51.35% of samples in 
the module, 15.83% of total session samples



Here is with both matrixf and invert4x4 optimized:
=
CS:EIP  Symbol + Offset 
64-bit  Timer samples
0x10083cb0  osg::Group::traverse 
1362
0x10083d50  osg::Group::computeBound 
1142
0x1009a180  osg::Matrixf::mult 
922
0x1001ac70  osg::PositionAttitudeTransform::accept 
381
0x1000e650  osg::AnimationPathCallback::update 
354
0x100dcf30  osg::Transform::computeBound 
306
0x1009bcf0  osg::Node::dirtyBound 
274
0x100124f0  osg::AnimationPath::getInterpolatedControlPoint 
257
0x1009a340  osg::Matrixf::invert_4x3 
252
0x10009bb0  osg::GraphicsContext::ScreenIdentifier::~ScreenIdentifier 
248
0x100a9b20  osg::PositionAttitudeTransform::computeLocalToWorldMatrix 
245
0x10002d00  osg::Matrixf::preMult 
214
0x10002c70  osg::Matrixf::preMult 
197
0x1000c6b0  osg::StateSet::requiresUpdateTraversal 
178


14 functions, 829 instructions, Total: 6332 samples, 54.18% of samples in 
the module, 15.84% of total session samples


For the optimized profile it did push down the Invert4x4 way to the bottom 
(I did not want to show that here).  If you want the complete list let me 
know and I'll resend as attachments.  Actually you cannot really use this to 
see how much better the performance is, because the Matrixf Mult is still 
needed just as much, the actual way to tell would be to show the framerate 
of the game; however here is where I can show the optimization:

Avarage time using the D3DXMATRIX class:  402.54
Avarage time using the SPMatrix class:277.69
Avarage time using the Matrixf class:297.40
Avarage time using the ScalarDP class:400.21
Avarage time using the DPMatrix class:1418.11
Avarage time using the Matrixd class:471.69

Here is the result for postMult where matrixf use to be the same as Matrixd. 
The 277.69 is what would have been for Matrixf is it was aligned.


Avarage time using the D3DXMATRIX class:  1035.63
Avarage time using the SPMatrix class:365.36
Avarage time using the Matrixf class:706.09
Avarage time using the ScalarDP class:664.13
Avarage time using the DPMatrix class:2052.29
Avarage time using the Matrixd class:2125.93

Here is the results for Invert4x4 where Matrixf also was the same as Marixd, 
and the 365 is what it would have been if the data was aligned.


This stress code is part of the matlib2 with a little tweaking of the osg 
code to add into the mix.









James Killian
- Original Message - 
From: "Mathias Fröhlich" <[EMAIL PROTECTED]>

To: "OpenSceneGraph Users" 
Sent: Tuesday, July 29, 2008 10:14 AM
Subject: Re: [osg-users] Using SSE within OSG



James,

On Tuesday 29 July 2008 16:59, James Killian wrote:
Paul asked me the same question a few days ago, and I just realized that 
we

took that offline so I'll repost here:
One of the things I should add is the actual profile dump, since that 
shows
a more comprehensive picture.  The actual game demo is free to download 
and

play here:
http://www.fringe-online.com/

The current installer of the game does not have my optimization in it yet,
but it should be noted even with the optimization the postmult is still at
the top.  The Invert4x4() however got pushed way down to the bottom (which
is great).  I

Re: [osg-users] Using SSE within OSG

2008-07-29 Thread James Killian

Thanks for the reply.  We could resolve this argument if any one of the "low
level masters" cares to email me offline [EMAIL PROTECTED], but I'd
be open to believe an argument could be made for the context of moving
around large amounts of data.

In regards to moving data, SSE/SSE2 is really better suited for code which
requires a lot of math like 3d computations.  Perhaps the heart of SSE would
be the packed multiply and add, where it can do 4 multiplies and 4 adds in
one clock cycle (or a half cycle if paired properly).  Thus, code which
requires heavy math like many of the OSG matrix computations could really
benefit from it.  I would profile cases like this against hand written
assembly since this is what OSG would care about.

I looked at the assembly code produced by VS 9 for the optimized matrixf
multiply, and I could not have scheduled it better myself by hand.

- Original Message - 
From: "Gordon Tomlinson" <[EMAIL PROTECTED]>
To: "'OpenSceneGraph Users'" 
Sent: Tuesday, July 29, 2008 2:58 PM
Subject: Re: [osg-users] Using SSE within OSG


> HI
>
> I can only go buy our low level masters and their profiling shows that the
> hand road asm'ed  SSE code is significantly fasted than MS VS compiled
code
>
> Obviously this our experience in our environments and we computationally
> heavily and moving and editing terra-bytes of data around in real-time
>
> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of James
> Killian
> Sent: Tuesday, July 29, 2008 11:38 AM
> To: OpenSceneGraph Users
> Subject: Re: [osg-users] Using SSE within OSG
>
>
> Sorry...
>
> I interpreted Gordon's response as follows:
> MS does a poor job (insert here with compiling SSE intrinsics), as a
result
> most of his SSE is asm'ed.
> The asm'ed approach is where you don't trust the compiler to do the right
> thing with intrinsics, where it has the flexibility of scheduling and
> assigning registers etc.
>
> I disagree with "MS does a poor job compiling intrinsic code", and that
you
> should not *ever need to resort to __asm anymore.
> *this is not absolute, there was once a rare case where we found a strange
> anomaly, but later solved by doing an un-intuitive c code change
>
> >Do you find that MS compilers will produce SSE vectorised code
> >_without_
> use of intrinsics or raw __asm?
> Ah this is a tricky question.  There is in fact an option in VS 8 and VS 9
> project settings to generate SSE or SSE2 code.  What this does is that it
> will evaluate c code and try to use SSE for it.  I was surprised to find
> that this actually lowered the performance of c code, especially c code
for
> matrixf.  I'm so glad that the project settings for osg do not turn this
on,
> and I'd recommend not using that, but instead write intrisics ourselves
for
> places that need it.
>
> I hope this clears things up.
>
>
> - Original Message -
> From: "David Spilling" <[EMAIL PROTECTED]>
> To: "OpenSceneGraph Users" 
> Sent: Tuesday, July 29, 2008 10:17 AM
> Subject: Re: [osg-users] Using SSE within OSG
>
>
> > James,
> >
> >
> > > I have to disagree, using VS 7 and up to VS 9.
> >
> >
> > Just to clarify - what are you disagreeing with? Do you find that MS
> > compilers will produce SSE vectorised code _without_ use of intrinsics
or
> > raw __asm?
> >
> > David
> >
>
>
> --
--
> 
>
>
> > ___
> > osg-users mailing list
> > osg-users@lists.openscenegraph.org
> >
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
> >
>
> ___
> osg-users mailing list
> osg-users@lists.openscenegraph.org
> http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
>
>
> ___
> osg-users mailing list
> osg-users@lists.openscenegraph.org
> http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
>

___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread Gordon Tomlinson
HI 

I can only go buy our low level masters and their profiling shows that the
hand road asm'ed  SSE code is significantly fasted than MS VS compiled code

Obviously this our experience in our environments and we computationally
heavily and moving and editing terra-bytes of data around in real-time

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of James
Killian
Sent: Tuesday, July 29, 2008 11:38 AM
To: OpenSceneGraph Users
Subject: Re: [osg-users] Using SSE within OSG


Sorry...

I interpreted Gordon's response as follows:
MS does a poor job (insert here with compiling SSE intrinsics), as a result
most of his SSE is asm'ed.
The asm'ed approach is where you don't trust the compiler to do the right
thing with intrinsics, where it has the flexibility of scheduling and
assigning registers etc.

I disagree with "MS does a poor job compiling intrinsic code", and that you
should not *ever need to resort to __asm anymore.
*this is not absolute, there was once a rare case where we found a strange
anomaly, but later solved by doing an un-intuitive c code change

>Do you find that MS compilers will produce SSE vectorised code 
>_without_
use of intrinsics or raw __asm?
Ah this is a tricky question.  There is in fact an option in VS 8 and VS 9
project settings to generate SSE or SSE2 code.  What this does is that it
will evaluate c code and try to use SSE for it.  I was surprised to find
that this actually lowered the performance of c code, especially c code for
matrixf.  I'm so glad that the project settings for osg do not turn this on,
and I'd recommend not using that, but instead write intrisics ourselves for
places that need it.

I hope this clears things up.


- Original Message -
From: "David Spilling" <[EMAIL PROTECTED]>
To: "OpenSceneGraph Users" 
Sent: Tuesday, July 29, 2008 10:17 AM
Subject: Re: [osg-users] Using SSE within OSG


> James,
>
>
> > I have to disagree, using VS 7 and up to VS 9.
>
>
> Just to clarify - what are you disagreeing with? Do you find that MS
> compilers will produce SSE vectorised code _without_ use of intrinsics or
> raw __asm?
>
> David
>






> ___
> osg-users mailing list
> osg-users@lists.openscenegraph.org
> http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
>

___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread James Killian
"
Ok, you can do here much for the collision detection.
I expect that you should optimize that algorithmically and gain magnitudes
without sse.
"

You are probably right, but Rick is the OSG guru in the team, and I do not
understand osg that well yet as most of my time has been spent developing
other aspects of the game.  My strength lies with general purpose optimizing
code, and so this helps our game tremendously.  The least I can do is
contribute this to the community and hope others get a boost as well.

"
So the question is more if such optimizations will bring performance
improvements for the usual scenegraph case.
"

When I get home tonight I'll disable the collision detection and run the
profiler again, and post the results, hopefully that data should answer this
question.


- Original Message - 
From: "Mathias Fröhlich" <[EMAIL PROTECTED]>
To: "OpenSceneGraph Users" 
Sent: Tuesday, July 29, 2008 10:14 AM
Subject: Re: [osg-users] Using SSE within OSG



James,

On Tuesday 29 July 2008 16:59, James Killian wrote:
> Paul asked me the same question a few days ago, and I just realized that
we
> took that offline so I'll repost here:
> One of the things I should add is the actual profile dump, since that
shows
> a more comprehensive picture.  The actual game demo is free to download
and
> play here:
> http://www.fringe-online.com/
>
> The current installer of the game does not have my optimization in it yet,
> but it should be noted even with the optimization the postmult is still at
> the top.  The Invert4x4() however got pushed way down to the bottom (which
> is great).  I'll post my profiles when I get home.
>
>
> -snip-
-
>- ---
> That is a good question, and I believe the answer is collision detection.
> I should disable it and run the numbers again to confirm.  All ships fire
> machine guns at a fast rate, and each bullet that gets close enough to a
> bounding box/sphere region has to go through the osg code to get the
> precise point where it hit.  Rick would probably have a better explanation
> of this and other factors since he coded the bulk of the collision
> detection (and osg integration).  Most of my time development in the game
> has been spent on the physics and flight dynamics (and now optimization).
>
> It may turn out that we could find some caching technique to reduce the
> collision stress (like the KBDtree), but in the mean time, matrix
> optimizations can benefit the whole community if we do them right, and I
> would like to make some contribution to the community.

Ok, you can do here much for the collision detection.
I expect that you should optimize that algorithmically and gain magnitudes
without sse.

So the question is more if such optimizations will bring performance
improovements for the usual scenegraph case.

Greetings

Mathias

-- 
Dr. Mathias Fröhlich, science + computing ag, Software Solutions
Hagellocher Weg 71-75, D-72070 Tuebingen, Germany
Phone: +49 7071 9457-268, Fax: +49 7071 9457-511
-- 
Vorstand/Board of Management:
Dr. Bernd Finkbeiner, Dr. Florian Geyer,
Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech
Vorsitzender des Aufsichtsrats/
Chairman of the Supervisory Board:
Prof. Dr. Hanns Ruder
Sitz/Registered Office: Tuebingen
Registergericht/Registration Court: Stuttgart
Registernummer/Commercial Register No.: HRB 382196


___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org

___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread James Killian

Sorry...

I interpreted Gordon's response as follows:
MS does a poor job (insert here with compiling SSE intrinsics), as a result
most of his SSE is asm'ed.
The asm'ed approach is where you don't trust the compiler to do the right
thing with intrinsics, where it has the flexibility of scheduling and
assigning registers etc.

I disagree with "MS does a poor job compiling intrinsic code", and that you
should not *ever need to resort to __asm anymore.
*this is not absolute, there was once a rare case where we found a strange
anomaly, but later solved by doing an un-intuitive c code change

>Do you find that MS compilers will produce SSE vectorised code _without_
use of intrinsics or raw __asm?
Ah this is a tricky question.  There is in fact an option in VS 8 and VS 9
project settings to generate SSE or SSE2 code.  What this does is that it
will evaluate c code and try to use SSE for it.  I was surprised to find
that this actually lowered the performance of c code, especially c code for
matrixf.  I'm so glad that the project settings for osg do not turn this on,
and I'd recommend not using that, but instead write intrisics ourselves for
places that need it.

I hope this clears things up.


- Original Message - 
From: "David Spilling" <[EMAIL PROTECTED]>
To: "OpenSceneGraph Users" 
Sent: Tuesday, July 29, 2008 10:17 AM
Subject: Re: [osg-users] Using SSE within OSG


> James,
>
>
> > I have to disagree, using VS 7 and up to VS 9.
>
>
> Just to clarify - what are you disagreeing with? Do you find that MS
> compilers will produce SSE vectorised code _without_ use of intrinsics or
> raw __asm?
>
> David
>






> ___
> osg-users mailing list
> osg-users@lists.openscenegraph.org
> http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
>

___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread David Spilling
James,


> I have to disagree, using VS 7 and up to VS 9.


Just to clarify - what are you disagreeing with? Do you find that MS
compilers will produce SSE vectorised code _without_ use of intrinsics or
raw __asm?

David
___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread Mathias Fröhlich

James,

On Tuesday 29 July 2008 16:59, James Killian wrote:
> Paul asked me the same question a few days ago, and I just realized that we
> took that offline so I'll repost here:
> One of the things I should add is the actual profile dump, since that shows
> a more comprehensive picture.  The actual game demo is free to download and
> play here:
> http://www.fringe-online.com/
>
> The current installer of the game does not have my optimization in it yet,
> but it should be noted even with the optimization the postmult is still at
> the top.  The Invert4x4() however got pushed way down to the bottom (which
> is great).  I'll post my profiles when I get home.
>
>
> -snip--
>- ---
> That is a good question, and I believe the answer is collision detection. 
> I should disable it and run the numbers again to confirm.  All ships fire
> machine guns at a fast rate, and each bullet that gets close enough to a
> bounding box/sphere region has to go through the osg code to get the
> precise point where it hit.  Rick would probably have a better explanation
> of this and other factors since he coded the bulk of the collision
> detection (and osg integration).  Most of my time development in the game
> has been spent on the physics and flight dynamics (and now optimization).
>
> It may turn out that we could find some caching technique to reduce the
> collision stress (like the KBDtree), but in the mean time, matrix
> optimizations can benefit the whole community if we do them right, and I
> would like to make some contribution to the community.

Ok, you can do here much for the collision detection.
I expect that you should optimize that algorithmically and gain magnitudes 
without sse.

So the question is more if such optimizations will bring performance 
improovements for the usual scenegraph case.

Greetings

Mathias

-- 
Dr. Mathias Fröhlich, science + computing ag, Software Solutions
Hagellocher Weg 71-75, D-72070 Tuebingen, Germany
Phone: +49 7071 9457-268, Fax: +49 7071 9457-511
-- 
Vorstand/Board of Management:
Dr. Bernd Finkbeiner, Dr. Florian Geyer,
Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech
Vorsitzender des Aufsichtsrats/
Chairman of the Supervisory Board:
Prof. Dr. Hanns Ruder
Sitz/Registered Office: Tuebingen
Registergericht/Registration Court: Stuttgart
Registernummer/Commercial Register No.: HRB 382196 


___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread James Killian

I have to disagree, using VS 7 and up to VS 9.  It has done a terrific job
with the instruction scheduling.  We use to use that technique of asm back
when P3's MMX were around and we had VS 6.  We had one engineer who would
use DOS and MASM.  Times have changed (we had to let him go), intrinsics
have proved to optimize quite well as we use the AMD code analyzer to
confirm that the U and V pipes remain full due to well scheduled placement
of the instructions.

I should add that we avoid using any MMX instructions like the plague now
days.

- Original Message - 
From: "Gordon Tomlinson" <[EMAIL PROTECTED]>
To: "'OpenSceneGraph Users'" 
Sent: Tuesday, July 29, 2008 8:56 AM
Subject: Re: [osg-users] Using SSE within OSG


> MS does a very poor job,
>
> I know most of our SSE is asm'ed
>
>
>
>   _
>
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of David
> Spilling
> Sent: Tuesday, July 29, 2008 9:11 AM
> To: OpenSceneGraph Users
> Subject: Re: [osg-users] Using SSE within OSG
>
>
> Benjamin,
>
>
>
>
> may I suggest that you check the assembler code that the compilers create
> when
> compiling the OSG code?
>
>
>
> ... g++ with -march=core2 -O3 (see man page for description
> of parameters) the compiler automatically uses SSE
>
>
> I don't have much recent Linux/gcc experience, but can certainly attest
that
> the MS compilers don't do a good job of spotting SSE vectorisation
> possibilities, even when you tell them to optimise with them (and this is
> from reading the generated ssembler). In MS you can insert SSE intrinsics
,
> which still allow the compiler to optimise the execution order and
> memory/register usage e.g. based on cycle counts.
>
> I understand (from other sources) that the Intel vectorising compilers are
> much better at this, naturally.
>
> Perhaps this is then all only aMS/Windows thing?
>
> David
>
>
>
>
>






> ___
> osg-users mailing list
> osg-users@lists.openscenegraph.org
> http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
>

___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread Sebastian Messerschmidt

Hi All,

Regarding question 2:
Wouldn't it be possible to dynamically link different versions of the 
OSG-DLLs?
So there would be two Version of the DLLs, one with the 
SSE-Optimizations and one with the straightforward code.
I've seen examples of games some years ago, where they linked different 
Versions of DLLs depending on the machine the program was run on.


cheers
Sebastian

Dear All,

There's a discussion going on at the moment over in osg-submissions, 
and it has been raised that this ought to be opened up to the 
non-submissions community for feedback. Note that the following is my 
reading of the issues, and certainly doesn't represent the consensus 
view of the osg-submissions crowd, so feel free to challenge what I'm 
saying!


*Background*
Several people already use SSE instructions 
(http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions) alongside OSG 
to obtain speed improvements through parallelising math operations. 
The general point that has been raised is that under-the-hood, OSG 
does quite a lot that could benefit from the potential performance 
boost given by SSE operations. Obvious targets include some of the 
Vec/Matrix routines, for example. SSE is now sufficiently mainstream 
that the risk of processor incompatibility is felt to be low.


*Question 1 : Where could the core OSG include SSE?*
Most people follow the sensible approach of profiling to determine 
their bottlenecks, and then optimising particular methods in order to 
gain speed-up. This would be a sensible approach to follow, as SSEing 
all methods would probably be a waste of effort.  It would therefore 
be instructive firstly to know if anybody is using SSE with OSG, and 
where. Secondly, for those who have profiling data and know how much 
time they spend in Vec/Matrix/whatever methods, it would be useful to 
know which methods the community considered good targets for SSEing. 
Any other maths "heavy lifting" going on? (e.g. Intersection testing? 
Delauney triangulation? etc.)


*Question 2 : How could the core OSG include SSE?*
SSE code benefits from aligned data.  Hence there are several ways in 
which OSG could include SSE:


a) Provide an aligned Vec4f and aligned Matrix4f class, which support 
SSE operations. This would appear (to me) to be the least intrusive.


b) Provide branching code within the existing Vec4/Matrix4 methods for 
detecting whether data is aligned, and performing the correct 
operations. This would appear to me to be the most user-transparent. 
Although it would appear to be a performance hit, testing so far on 
some specific code would support the argument that the speed gains 
from SSE outweigh the branch cost; more testing needed, I guess.


c) Robert suggested that SSE enabled array operators (e.g. providing a 
cross-product operator for Vec3Array) might be appropriate and provide 
the best speed improvement for those who want it. Certainly using SSE 
on large array type data sets is where one gains the most performance 
improvement.


This question includes the possibility of linking out to, or pulling 
source code our of, an external optimised math library.


Any other suggestions?

*Question 3 : (possibly the biggest) Should the core OSG include SSE?*
There are several downsides to including SSE. Firstly, x-platform 
provision of SSE may be tricky due to the way different compilers 
define aligned data, and how SSE instructions are used within the 
code. I personally don't have much experience here, so any feedback on 
x-plaform issues is useful.


Secondly, the code readability drops, and the "use the source" 
argument may be trickier when many might not know much SSE.



So - your opinion, experience and suggestions welcome!

David








___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
  


___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread James Killian

Paul asked me the same question a few days ago, and I just realized that we
took that offline so I'll repost here:
One of the things I should add is the actual profile dump, since that shows
a more comprehensive picture.  The actual game demo is free to download and
play here:
http://www.fringe-online.com/

The current installer of the game does not have my optimization in it yet,
but it should be noted even with the optimization the postmult is still at
the top.  The Invert4x4() however got pushed way down to the bottom (which
is great).  I'll post my profiles when I get home.


-snip---
---
That is a good question, and I believe the answer is collision detection.  I
should disable it and run the numbers again to confirm.  All ships fire
machine guns at a fast rate, and each bullet that gets close enough to a
bounding box/sphere region has to go through the osg code to get the precise
point where it hit.  Rick would probably have a better explanation of this
and other factors since he coded the bulk of the collision detection (and
osg integration).  Most of my time development in the game has been spent on
the physics and flight dynamics (and now optimization).

It may turn out that we could find some caching technique to reduce the
collision stress (like the KBDtree), but in the mean time, matrix
optimizations can benefit the whole community if we do them right, and I
would like to make some contribution to the community.


- Original Message - 
From: "Paul Melis" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Monday, July 28, 2008 9:05 AM
Subject: [Fwd: Re: [osg-users] [osg-submissions] Matrixf multiply
Optimization]


> Hi James,
>
> I noted you posts on the osg-users list on the Matrix multiplication
> optimizations using SSE.
> You mention "Our game uses approximately 25% of all processing to these
> functions [...]". What on earth takes up so much matrix computing time
> in your game?
>
> Regards,
> Paul
>
-snip---
---

- Original Message - 
From: "Mathias Fröhlich" <[EMAIL PROTECTED]>
To: "OpenSceneGraph Users" 
Sent: Tuesday, July 29, 2008 9:31 AM
Subject: Re: [osg-users] Using SSE within OSG



Hi,

On Tuesday 29 July 2008 15:18, James Killian wrote:
> I 100% agree with that as that is the first thing I did.  For the matrixf
> mult I got 50% improvement with aligned data and 35% with unaligned.  For
> the Invert4x4 I got 80% improvement with aligned and 70% aligned with
> unaligned.  I've submitted this code in as it was the most time spent in
> the profiles of our game.
I wonder what your scenegraph looks like.
Why do you have that much matrix operations?
Where are they called from?
Why do you need that many inverted matrices?

Also the invert method makes me wonder. As far as I can tell, you do not
need
inverted matrices to do cull and draw. At least not in a magnitude that
makes
that method appear in profiles.

Do you compute intersection tests where you need that inverse?
And what kind of matrices are in your code that you really need the full 4x4
inverse? Almost alway the cheaper 3x4 variant can be used for usual
transforms.

Well, I ask that because I get the impression that the real botteneck -
where
you can gain much performance - is somwhere different.

Greetings

Mathias

-- 
Dr. Mathias Fröhlich, science + computing ag, Software Solutions
Hagellocher Weg 71-75, D-72070 Tuebingen, Germany
Phone: +49 7071 9457-268, Fax: +49 7071 9457-511
-- 
Vorstand/Board of Management:
Dr. Bernd Finkbeiner, Dr. Florian Geyer,
Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech
Vorsitzender des Aufsichtsrats/
Chairman of the Supervisory Board:
Prof. Dr. Hanns Ruder
Sitz/Registered Office: Tuebingen
Registergericht/Registration Court: Stuttgart
Registernummer/Commercial Register No.: HRB 382196


___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org

___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread Mathias Fröhlich

Hi,

On Tuesday 29 July 2008 15:18, James Killian wrote:
> I 100% agree with that as that is the first thing I did.  For the matrixf
> mult I got 50% improvement with aligned data and 35% with unaligned.  For
> the Invert4x4 I got 80% improvement with aligned and 70% aligned with
> unaligned.  I've submitted this code in as it was the most time spent in
> the profiles of our game.
I wonder what your scenegraph looks like.
Why do you have that much matrix operations?
Where are they called from?
Why do you need that many inverted matrices?

Also the invert method makes me wonder. As far as I can tell, you do not need 
inverted matrices to do cull and draw. At least not in a magnitude that makes 
that method appear in profiles.

Do you compute intersection tests where you need that inverse?
And what kind of matrices are in your code that you really need the full 4x4 
inverse? Almost alway the cheaper 3x4 variant can be used for usual 
transforms.

Well, I ask that because I get the impression that the real botteneck - where 
you can gain much performance - is somwhere different.

Greetings

Mathias

-- 
Dr. Mathias Fröhlich, science + computing ag, Software Solutions
Hagellocher Weg 71-75, D-72070 Tuebingen, Germany
Phone: +49 7071 9457-268, Fax: +49 7071 9457-511
-- 
Vorstand/Board of Management:
Dr. Bernd Finkbeiner, Dr. Florian Geyer,
Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech
Vorsitzender des Aufsichtsrats/
Chairman of the Supervisory Board:
Prof. Dr. Hanns Ruder
Sitz/Registered Office: Tuebingen
Registergericht/Registration Court: Stuttgart
Registernummer/Commercial Register No.: HRB 382196 


___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread Gordon Tomlinson
MS does a very poor job, 
 
I know most of our SSE is asm'ed 
 
 

  _  

From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of David
Spilling
Sent: Tuesday, July 29, 2008 9:11 AM
To: OpenSceneGraph Users
Subject: Re: [osg-users] Using SSE within OSG


Benjamin,




may I suggest that you check the assembler code that the compilers create
when
compiling the OSG code?

 

... g++ with -march=core2 -O3 (see man page for description
of parameters) the compiler automatically uses SSE


I don't have much recent Linux/gcc experience, but can certainly attest that
the MS compilers don't do a good job of spotting SSE vectorisation
possibilities, even when you tell them to optimise with them (and this is
from reading the generated ssembler). In MS you can insert SSE intrinsics ,
which still allow the compiler to optimise the execution order and
memory/register usage e.g. based on cycle counts.

I understand (from other sources) that the Intel vectorising compilers are
much better at this, naturally.

Perhaps this is then all only aMS/Windows thing?

David




___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread David Spilling
Benjamin,

>
> And please do not get me wrong. I do not want to stop your efforts to
> improve
> the performance of OSG; far from it!


Not necessarily my efforts - I'm just being the messenger...!

But putting assembler code into the
> project decrease the readability and serviceability of the code.


Absolutely.


> Furthermore
> it might be that it does not improve the speed at all.


I agree, and this is an oft quoted issue. Here, I think, only testing (and
experience) will help. For example, is it worth performing a single Vec3f
cross product in SSE? Probably not. But as a counter example, over on
osg-submissions (EDIT - and now here), one user (James) is getting large
performance gains from having SSE'd the invert_4x4 function.

I just want to suggest
> that you try to exhaust the possibility of modern compilers as much as
> possible. If you see any bottlenecks after that, it might make sense to
> include manual performance tuning.


I agree. This call-for-ideas was motivated by an understanding that several
people are pushing in the same direction, and it would be perhaps beneficial
to make use of this push.

David
___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread James Killian

I heard that the Intel C++ compiler is able to optimize even better.
Furthermore the use of profiling first is a good approach. Maybe it would 
be
reasonable to compare profiling data of the Math/Vector/Matrix classes 
with
and without compiler optimizations and see if some bottlenecks disappear 
when

using the optimizations.


I 100% agree with that as that is the first thing I did.  For the matrixf 
mult I got 50% improvement with aligned data and 35% with unaligned.  For 
the Invert4x4 I got 80% improvement with aligned and 70% aligned with 
unaligned.  I've submitted this code in as it was the most time spent in the 
profiles of our game.


While I am here I think whatever we do we should have CMake have an option 
to compile using SSE, and provide alternative c code for those who do not 
want it.  Actually, one of the techniques we use at work we handled the case 
during when SSE2 was only available to some machines, where we wrote the 
main loop to do the bulk of the work and the remainder loop do finish the 
work in c code.  We could then macro out the main loop for those who didn't 
have SSE2 as it fell to the remainder code which then did the entire loop. 
I believe the time has passed to make SSE and SSE2 distinction, so either 
someone can support SSE2, or they use the c code version.  It should be 
implied that people who write SSE/SSE2 have tested against the c code and 
have seen a significant gain in performance before considering to use.





James Killian
- Original Message - 
From: "Benjamin Eikel" <[EMAIL PROTECTED]>

To: "OpenSceneGraph Users" 
Sent: Tuesday, July 29, 2008 7:28 AM
Subject: Re: [osg-users] Using SSE within OSG



Am Dienstag, 29. Juli 2008 14:04:59 schrieb David Spilling:

Dear All,

[...]

Any other suggestions?

*Question 3 : (possibly the biggest) Should the core OSG include SSE?*
There are several downsides to including SSE. Firstly, x-platform 
provision

of SSE may be tricky due to the way different compilers define aligned
data, and how SSE instructions are used within the code. I personally 
don't

have much experience here, so any feedback on x-plaform issues is useful.

Secondly, the code readability drops, and the "use the source" argument 
may

be trickier when many might not know much SSE.

Hello David,

may I suggest that you check the assembler code that the compilers create 
when
compiling the OSG code? I have not done it for the OSG code, but for 
another
project I have done some time ago. There I tried to optimize the 
performance
for composing depth-buffer attached images for sort-last rendering. 
Somehow I
was not able to be much better than the compiler was. In some rare cases 
my
procedures were faster, but most of the time the compiler was the winner. 
The

code created by the compilers consider so many things - e. g. branch
prediction by the processer, code reordering - that it is quite hard for a
human programmer to beat them.
For example if you use g++ with -march=core2 -O3 (see man page for 
description
of parameters) the compiler automatically uses SSE or even SSE2, 3dNOW!, 
etc.
instructions. In some cases the compiler generates much better assembler 
code

than a normal programmer would do. There are some case though were manual
improvements could yield better results.
I heard that the Intel C++ compiler is able to optimize even better.
Furthermore the use of profiling first is a good approach. Maybe it would 
be
reasonable to compare profiling data of the Math/Vector/Matrix classes 
with
and without compiler optimizations and see if some bottlenecks disappear 
when

using the optimizations.

Regards,
Benjamin



So - your opinion, experience and suggestions welcome!

David



___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org



___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread David Spilling
Benjamin,


> may I suggest that you check the assembler code that the compilers create
> when
> compiling the OSG code?



> ... g++ with -march=core2 -O3 (see man page for description
> of parameters) the compiler automatically uses SSE


I don't have much recent Linux/gcc experience, but can certainly attest that
the MS compilers don't do a good job of spotting SSE vectorisation
possibilities, even when you tell them to optimise with them (and this is
from reading the generated ssembler). In MS you can insert SSE intrinsics ,
which still allow the compiler to optimise the execution order and
memory/register usage e.g. based on cycle counts.

I understand (from other sources) that the Intel vectorising compilers are
much better at this, naturally.

Perhaps this is then all only aMS/Windows thing?

David
___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread Benjamin Eikel
Am Dienstag, 29. Juli 2008 14:28:18 schrieb Benjamin Eikel:
> Am Dienstag, 29. Juli 2008 14:04:59 schrieb David Spilling:
> > Dear All,
>
> [...]
>
> > Any other suggestions?
> >
> > *Question 3 : (possibly the biggest) Should the core OSG include SSE?*
> > There are several downsides to including SSE. Firstly, x-platform
> > provision of SSE may be tricky due to the way different compilers define
> > aligned data, and how SSE instructions are used within the code. I
> > personally don't have much experience here, so any feedback on x-plaform
> > issues is useful.
> >
> > Secondly, the code readability drops, and the "use the source" argument
> > may be trickier when many might not know much SSE.
>
> Hello David,
>
> may I suggest that you check the assembler code that the compilers create
> when compiling the OSG code? I have not done it for the OSG code, but for
> another project I have done some time ago. There I tried to optimize the
> performance for composing depth-buffer attached images for sort-last
> rendering. Somehow I was not able to be much better than the compiler was.
> In some rare cases my procedures were faster, but most of the time the
> compiler was the winner. The code created by the compilers consider so many
> things - e. g. branch prediction by the processer, code reordering - that
> it is quite hard for a human programmer to beat them.
> For example if you use g++ with -march=core2 -O3 (see man page for
> description of parameters) the compiler automatically uses SSE or even
> SSE2, 3dNOW!, etc. instructions. In some cases the compiler generates much
> better assembler code than a normal programmer would do. There are some
> case though were manual improvements could yield better results.
> I heard that the Intel C++ compiler is able to optimize even better.
> Furthermore the use of profiling first is a good approach. Maybe it would
> be reasonable to compare profiling data of the Math/Vector/Matrix classes
> with and without compiler optimizations and see if some bottlenecks
> disappear when using the optimizations.
>
> Regards,
> Benjamin
Hello,

I have an addition:
With gcc/g++ you can use profiling (option -fprofile-generate) to help the 
compiler to do better optimizations (option -fprofile-use, e. g. loop 
unrolling). Maybe this can improve the performance further.
If you want performance and sacrifice safety and precision for it, you may 
even think about -ffast-math (may be dangerous).
The options are explained on the gcc/g++ man page or in the online manual [1].
There may be similar options for other compilers.
And please do not get me wrong. I do not want to stop your efforts to improve 
the performance of OSG; far from it! But putting assembler code into the 
project decrease the readability and serviceability of the code. Furthermore 
it might be that it does not improve the speed at all. I just want to suggest 
that you try to exhaust the possibility of modern compilers as much as 
possible. If you see any bottlenecks after that, it might make sense to 
include manual performance tuning.

Regards,
Benjamin

[1] 
http://gcc.gnu.org/onlinedocs/gcc-4.3.0/gcc/Optimize-Options.html#Optimize-Options

>
> > So - your opinion, experience and suggestions welcome!
> >
> > David
>
> ___
> osg-users mailing list
> osg-users@lists.openscenegraph.org
> http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread Benjamin Eikel
Am Dienstag, 29. Juli 2008 14:04:59 schrieb David Spilling:
> Dear All,
[...]
> Any other suggestions?
>
> *Question 3 : (possibly the biggest) Should the core OSG include SSE?*
> There are several downsides to including SSE. Firstly, x-platform provision
> of SSE may be tricky due to the way different compilers define aligned
> data, and how SSE instructions are used within the code. I personally don't
> have much experience here, so any feedback on x-plaform issues is useful.
>
> Secondly, the code readability drops, and the "use the source" argument may
> be trickier when many might not know much SSE.
Hello David,

may I suggest that you check the assembler code that the compilers create when 
compiling the OSG code? I have not done it for the OSG code, but for another 
project I have done some time ago. There I tried to optimize the performance 
for composing depth-buffer attached images for sort-last rendering. Somehow I 
was not able to be much better than the compiler was. In some rare cases my 
procedures were faster, but most of the time the compiler was the winner. The 
code created by the compilers consider so many things - e. g. branch 
prediction by the processer, code reordering - that it is quite hard for a 
human programmer to beat them.
For example if you use g++ with -march=core2 -O3 (see man page for description 
of parameters) the compiler automatically uses SSE or even SSE2, 3dNOW!, etc. 
instructions. In some cases the compiler generates much better assembler code 
than a normal programmer would do. There are some case though were manual 
improvements could yield better results.
I heard that the Intel C++ compiler is able to optimize even better.
Furthermore the use of profiling first is a good approach. Maybe it would be 
reasonable to compare profiling data of the Math/Vector/Matrix classes with 
and without compiler optimizations and see if some bottlenecks disappear when 
using the optimizations.

Regards,
Benjamin
>
>
> So - your opinion, experience and suggestions welcome!
>
> David


___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


Re: [osg-users] Using SSE within OSG

2008-07-29 Thread Gordon Tomlinson
Hi David
 
My company makes very heavy use of SSE in our main products, and there are
vast speed improvements to be gained, sadly I don't have permission to
provide profiling data
 
We use SSE's for heavy heavy matrix work outside of OSG, we use some we have
added to our OSG/OGL apps such as for normal generations, fast sqr root
routines, texture operations, the clock cycles saved can mount up quickly
 
I would say adding SSE operation in the right places would be highly
beneficial for the OSG core in performance gains.
 
 
  _  

From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of David
Spilling
Sent: Tuesday, July 29, 2008 8:05 AM
To: OpenSceneGraph Users
Subject: [osg-users] Using SSE within OSG


Dear All,

There's a discussion going on at the moment over in osg-submissions, and it
has been raised that this ought to be opened up to the non-submissions
community for feedback. Note that the following is my reading of the issues,
and certainly doesn't represent the consensus view of the osg-submissions
crowd, so feel free to challenge what I'm saying!

Background
Several people already use SSE instructions
(http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions) alongside OSG to
obtain speed improvements through parallelising math operations. The general
point that has been raised is that under-the-hood, OSG does quite a lot that
could benefit from the potential performance boost given by SSE operations.
Obvious targets include some of the Vec/Matrix routines, for example. SSE is
now sufficiently mainstream that the risk of processor incompatibility is
felt to be low.

Question 1 : Where could the core OSG include SSE?
Most people follow the sensible approach of profiling to determine their
bottlenecks, and then optimising particular methods in order to gain
speed-up. This would be a sensible approach to follow, as SSEing all methods
would probably be a waste of effort.  It would therefore be instructive
firstly to know if anybody is using SSE with OSG, and where. Secondly, for
those who have profiling data and know how much time they spend in
Vec/Matrix/whatever methods, it would be useful to know which methods the
community considered good targets for SSEing. Any other maths "heavy
lifting" going on? (e.g. Intersection testing? Delauney triangulation? etc.)

Question 2 : How could the core OSG include SSE?
SSE code benefits from aligned data.  Hence there are several ways in which
OSG could include SSE:

a) Provide an aligned Vec4f and aligned Matrix4f class, which support SSE
operations. This would appear (to me) to be the least intrusive.

b) Provide branching code within the existing Vec4/Matrix4 methods for
detecting whether data is aligned, and performing the correct operations.
This would appear to me to be the most user-transparent. Although it would
appear to be a performance hit, testing so far on some specific code would
support the argument that the speed gains from SSE outweigh the branch cost;
more testing needed, I guess.

c) Robert suggested that SSE enabled array operators (e.g. providing a
cross-product operator for Vec3Array) might be appropriate and provide the
best speed improvement for those who want it. Certainly using SSE on large
array type data sets is where one gains the most performance improvement.

This question includes the possibility of linking out to, or pulling source
code our of, an external optimised math library.

Any other suggestions?

Question 3 : (possibly the biggest) Should the core OSG include SSE?
There are several downsides to including SSE. Firstly, x-platform provision
of SSE may be tricky due to the way different compilers define aligned data,
and how SSE instructions are used within the code. I personally don't have
much experience here, so any feedback on x-plaform issues is useful.

Secondly, the code readability drops, and the "use the source" argument may
be trickier when many might not know much SSE.


So - your opinion, experience and suggestions welcome!

David







___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org


[osg-users] Using SSE within OSG

2008-07-29 Thread David Spilling
Dear All,

There's a discussion going on at the moment over in osg-submissions, and it
has been raised that this ought to be opened up to the non-submissions
community for feedback. Note that the following is my reading of the issues,
and certainly doesn't represent the consensus view of the osg-submissions
crowd, so feel free to challenge what I'm saying!

*Background*
Several people already use SSE instructions (
http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions) alongside OSG to
obtain speed improvements through parallelising math operations. The general
point that has been raised is that under-the-hood, OSG does quite a lot that
could benefit from the potential performance boost given by SSE operations.
Obvious targets include some of the Vec/Matrix routines, for example. SSE is
now sufficiently mainstream that the risk of processor incompatibility is
felt to be low.

*Question 1 : Where could the core OSG include SSE?*
Most people follow the sensible approach of profiling to determine their
bottlenecks, and then optimising particular methods in order to gain
speed-up. This would be a sensible approach to follow, as SSEing all methods
would probably be a waste of effort.  It would therefore be instructive
firstly to know if anybody is using SSE with OSG, and where. Secondly, for
those who have profiling data and know how much time they spend in
Vec/Matrix/whatever methods, it would be useful to know which methods the
community considered good targets for SSEing. Any other maths "heavy
lifting" going on? (e.g. Intersection testing? Delauney triangulation? etc.)

*Question 2 : How could the core OSG include SSE?*
SSE code benefits from aligned data.  Hence there are several ways in which
OSG could include SSE:

a) Provide an aligned Vec4f and aligned Matrix4f class, which support SSE
operations. This would appear (to me) to be the least intrusive.

b) Provide branching code within the existing Vec4/Matrix4 methods for
detecting whether data is aligned, and performing the correct operations.
This would appear to me to be the most user-transparent. Although it would
appear to be a performance hit, testing so far on some specific code would
support the argument that the speed gains from SSE outweigh the branch cost;
more testing needed, I guess.

c) Robert suggested that SSE enabled array operators (e.g. providing a
cross-product operator for Vec3Array) might be appropriate and provide the
best speed improvement for those who want it. Certainly using SSE on large
array type data sets is where one gains the most performance improvement.

This question includes the possibility of linking out to, or pulling source
code our of, an external optimised math library.

Any other suggestions?

*Question 3 : (possibly the biggest) Should the core OSG include SSE?*
There are several downsides to including SSE. Firstly, x-platform provision
of SSE may be tricky due to the way different compilers define aligned data,
and how SSE instructions are used within the code. I personally don't have
much experience here, so any feedback on x-plaform issues is useful.

Secondly, the code readability drops, and the "use the source" argument may
be trickier when many might not know much SSE.


So - your opinion, experience and suggestions welcome!

David
___
osg-users mailing list
osg-users@lists.openscenegraph.org
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org