Re: performance work

2008-12-31 Thread Neil Graham
On Tue, 2008-12-30 at 20:41 -0700, Jordan Crouse wrote:
  I'm curious as to why reads from video memory are so slow,  On standard
  video cards it's slow because there is quite a division between the CPU
  and the video memory,  but on the geode isn't the video memory shared in
  the same SDRAM as Main memory. 
 
 It is, in that they share the same physical RAM chips, but they are 
 controlled by different entities - one is managed by the system memory 
 controller and the other is handled by the GPU.   At start up time, the 
 memory is carved up by the firmware, and after the top of system RAM is 
 established, video and system memory behave for all intents and purposes 
 like separate components.  Put simply, there is no way to directly 
 address video memory from the system memory.  Access to the video memory 
 has to happen via PCI cycles, and for obvious reasons the active video 
 region has the cache disabled, accounting for relatively slow readback.

That makes my brain melt, you can't address it even though it's on the
same chip!?!  Even as far back as the PCjr the deal was that sharing
video memory cost some performance due to taking turns with cycles but
it gave some back with easy access to the memory for all.   Has the
geode cunningly managed to provide a system that combines all the
disadvantages of separate memory with all the disadvantages of shared?

One wonders what would happen if you wired some lines to the chips so
that the memory appeared in two places,  would you get access to the ram
(with the usual 'you pays your money, you takes your chances' caveats
about coherency)

I'm not a hardware person, but that all just seems odd.

 That said, the read from memory performance is still worse  then you
 might expect - I never really got a good answer from
 the silicon guys as to why. 
 
being hit with the full sdram latency every access maybe?

Is it feasible to try with caches enabled and require the software to
flush as needed.
 


___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: performance work

2008-12-31 Thread Jordan Crouse
Neil Graham wrote:
 On Tue, 2008-12-30 at 20:41 -0700, Jordan Crouse wrote:
 I'm curious as to why reads from video memory are so slow,  On standard
 video cards it's slow because there is quite a division between the CPU
 and the video memory,  but on the geode isn't the video memory shared in
 the same SDRAM as Main memory. 
 It is, in that they share the same physical RAM chips, but they are 
 controlled by different entities - one is managed by the system memory 
 controller and the other is handled by the GPU.   At start up time, the 
 memory is carved up by the firmware, and after the top of system RAM is 
 established, video and system memory behave for all intents and purposes 
 like separate components.  Put simply, there is no way to directly 
 address video memory from the system memory.  Access to the video memory 
 has to happen via PCI cycles, and for obvious reasons the active video 
 region has the cache disabled, accounting for relatively slow readback.
 
 That makes my brain melt, you can't address it even though it's on the
 same chip!?!  Even as far back as the PCjr the deal was that sharing
 video memory cost some performance due to taking turns with cycles but
 it gave some back with easy access to the memory for all.   Has the
 geode cunningly managed to provide a system that combines all the
 disadvantages of separate memory with all the disadvantages of shared?
 
 One wonders what would happen if you wired some lines to the chips so
 that the memory appeared in two places,  would you get access to the ram
 (with the usual 'you pays your money, you takes your chances' caveats
 about coherency)
 
 I'm not a hardware person, but that all just seems odd.

You are missing the point - this model wasn't designed so that the 
system could somehow sneakily address video memory, it was designed so 
that the system designer could eliminate the need for the added cost, 
expense and real estate for a separate bank of memory chips.  See also
http://en.wikipedia.org/wiki/Shared_Memory_Architecture.

 That said, the read from memory performance is still worse  then you
 might expect - I never really got a good answer from
 the silicon guys as to why. 

 being hit with the full sdram latency every access maybe?
 
 Is it feasible to try with caches enabled and require the software to
 flush as needed.

Ask around - I don't think that you'll find anybody too keen on having 
the X server execute a cache invalidate a half dozen times a second.

Anyway, you are getting distracted and solving the wrong problem.  You 
should be more concerned about limiting the number of times that the X 
server reads from video memory rather then worrying about how fast the 
read is.

If I can rant for a second (and this isn't targeted at Neil 
specifically, but just in general), but this is another in a list of 
more or less hard constraints that the current XO design has. 
Throughout the history of the project, it seems to me that developers 
have been more biased toward trying to eliminate those constraints 
rather then making the software work in spite of them.  The processor is 
too slow - everybody immediately wants to overclock.  There is too 
little memory - enter a few dozen schemes for compressing it or swaping it.

The XO platform has limitations, most of which were introduced by choice 
for power or cost reasons.  The limitations are clearly documented and 
were known by all, at least when the project started.  The understanding 
was that the software would have to be adjusted to fit the hardware, not 
the other way around.  Over time, we seem to have lost that understanding.

Software engineering is hard - software engineering for resource 
restrained systems is even harder.  In this day and age geeks like us 
have been accustomed to always having the latest and greatest hardware 
at our fingertips, and so the software that we write is also for the 
latest and greatest.  And so, when confronted with a system such as the 
XO, our first instinct is to just plop our software on it and watch it 
go.  That attitude is further re-enforced by the fact that the Geode is 
x86 based - just like our desktops.  It should just work, right?  We 
know better - or at least, we should know better.

The solution to the performance problems is good old fashioned elbow 
grease.  We have to take our software that is naturally biased toward 
the year 2007 and make it work for the year 1995.  Thats going to 
involve fixing bugs in the drivers, but also re-thinking how the 
software works - and finding situations where the software might be 
inadvertently doing the wrong thing. Let me give you an example - as 
recently as X 1.5, operations involving an a8 alpha mask worked like this:

* Draw a 1x1 rectangle in video memory containing the source color for 
the operation
* Read the source color from video memory
* Perform the mask operation with the source color

This isn't smart for any kind of processor or GPU, running at 2 Ghz or 

Re: performance work

2008-12-31 Thread Michael Stone
On Wed, Dec 31, 2008 at 09:20:27AM -0700, Jordan Crouse wrote:
The solution to the performance problems is good old fashioned elbow 
grease These are the sorts of things that we need to find and
squash - and yes, it will be very time consuming and a little boring.

Several anecdotes for your amusement and reflection:

* When was the last time someone posted to devel asking: what is the
   right algorithm or datastructure for task ?

* When was the last time someone publicly analyzed the upper or lower
   bounds on the bandwidth, latency, or quantity of messages necessary to
   accomplish task ?

* When was the last time that you published a performance goal for your
   software? Did you hit it? Did anyone notice?

Michael

P.S. - Charles Leiserson once remarked that performance is like a
currency which programmers trade for (all) other worthwhile things like
schedule targets, scope of features, other resource consumption, various
kinds of security, etc [1]. This suggests that one would do better to
ask for performance or  but not both. Think of Blizzard.

[1] http://www.catonmat.net/blog/mit-introduction-to-algorithms-part-one/
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: performance work

2008-12-31 Thread Greg Smith
Hi All,

Great thread. I don't know the history but I completely agree with 
Jordan. A dedicated team of engineers takes at least two years of 
software to optimize available resources.

The main memory - video memory debate is age old. Until someone builds a 
better programming language and architecture for addressing the DCON 
frame buffer directly we need to optimize the architecture we have.

Moore's law is against us but we have 500,000 units in the field and can 
more than double that in 12 months (Moore be damned :-). Nail this 
problem quickly and we gain an industry-wide edge.

I collected related performance threads in the specification section here:
http://wiki.laptop.org/go/Feature_roadmap/General_UI_sluggishness

Did I miss instructions on how to determine which Cairo benchmarks are 
being called most often by sugar?

Can someone report how often the top 10 offenders below are called by 
using Sugar:
http://wiki.laptop.org/go/Feature_roadmap/General_UI_sluggishness#Test_data_comparison

Ask if its not clear. First steps may be documented here: 
http://wiki.laptop.org/go/Performance_tuning#Other

Our development bottleneck could be X-Windows (and Cairo) people. Can 
someone send an e-mail to the right list and ask for help?

Jordan told us which X functions he thinks will pay off. See 
http://wiki.laptop.org/go/Feature_roadmap/General_UI_sluggishness#X_optimization_suggestions

That's not asking for new functions, just calling well know ones. I'm 
optimistic compositing hooks will be a huge win

Thanks,

Greg S

 Date: Wed, 31 Dec 2008 09:20:27 -0700
 From: Jordan Crouse jor...@cosmicpenguin.net
 Subject: Re: performance work
 To: l...@screamingduck.com
 Cc: devel@lists.laptop.org, g...@laptop.org
 Message-ID: 495b9bcb.2010...@cosmicpenguin.net
 Content-Type: text/plain; charset=ISO-8859-1; format=flowed
 
 Neil Graham wrote:
 On Tue, 2008-12-30 at 20:41 -0700, Jordan Crouse wrote:
 I'm curious as to why reads from video memory are so slow,  On standard
 video cards it's slow because there is quite a division between the CPU
 and the video memory,  but on the geode isn't the video memory shared in
 the same SDRAM as Main memory. 
 It is, in that they share the same physical RAM chips, but they are 
 controlled by different entities - one is managed by the system memory 
 controller and the other is handled by the GPU.   At start up time, the 
 memory is carved up by the firmware, and after the top of system RAM is 
 established, video and system memory behave for all intents and purposes 
 like separate components.  Put simply, there is no way to directly 
 address video memory from the system memory.  Access to the video memory 
 has to happen via PCI cycles, and for obvious reasons the active video 
 region has the cache disabled, accounting for relatively slow readback.
 That makes my brain melt, you can't address it even though it's on the
 same chip!?!  Even as far back as the PCjr the deal was that sharing
 video memory cost some performance due to taking turns with cycles but
 it gave some back with easy access to the memory for all.   Has the
 geode cunningly managed to provide a system that combines all the
 disadvantages of separate memory with all the disadvantages of shared?

 One wonders what would happen if you wired some lines to the chips so
 that the memory appeared in two places,  would you get access to the ram
 (with the usual 'you pays your money, you takes your chances' caveats
 about coherency)

 I'm not a hardware person, but that all just seems odd.
 
 You are missing the point - this model wasn't designed so that the 
 system could somehow sneakily address video memory, it was designed so 
 that the system designer could eliminate the need for the added cost, 
 expense and real estate for a separate bank of memory chips.  See also
 http://en.wikipedia.org/wiki/Shared_Memory_Architecture.
 
 That said, the read from memory performance is still worse  then you
 might expect - I never really got a good answer from
 the silicon guys as to why. 

 being hit with the full sdram latency every access maybe?

 Is it feasible to try with caches enabled and require the software to
 flush as needed.
 
 Ask around - I don't think that you'll find anybody too keen on having 
 the X server execute a cache invalidate a half dozen times a second.
 
 Anyway, you are getting distracted and solving the wrong problem.  You 
 should be more concerned about limiting the number of times that the X 
 server reads from video memory rather then worrying about how fast the 
 read is.
 
 If I can rant for a second (and this isn't targeted at Neil 
 specifically, but just in general), but this is another in a list of 
 more or less hard constraints that the current XO design has. 
 Throughout the history of the project, it seems to me that developers 
 have been more biased toward trying to eliminate those constraints 
 rather then making the software work in spite of them.  The processor

Re: performance work

2008-12-31 Thread Wade Brainerd
I agree with Jordan.  You just have to sit down and do the work to optimize
the code, either finding the fastest path through hardware and software
stack.
I've rewritten Bounce twice now for performance just to hold on to 20fps on
the XO.  Colors! has been through many performance iterations as well
(compare v1 and v13 with large brushes).  I've just had my hat handed to me
by Cairo for Typing Turtle as well (with the hand display enabled, you can
type about 1WPM).  So I'm looking forward to rewriting my keyboard rendering
to deal with that.

If you have an issue with the performance of the XO, just spend the time by
yourself to analyze it and fix it, talking about it accomplishes nothing.
 If you find a solution that would help others, post it.

-Wade
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: performance work

2008-12-31 Thread Greg Smith
Hi All,

Answering two e-mails on one pass.

I agree, its hard work.

Wade,

I believe this thread is about optimizing the XO OS and GUI. That's why 
I call the requirement General_UI_sluggishness.

Optimizing applications is yet another challenge. I'm all for people 
doing that hard work and documenting it so the next person doesn't have 
to re-invent the wheel. Your performance URL is already posted to the 
page in the tools section. Let me know if you have any other links 
(GIT URLs?) or e-mails I should make easily accessible.

Michael,

The performance goal I worked out with Eben is on the page already. It 
could be better but its a start.

Lots of people have noticed.

Neil and Jordan analyzed which Cairo calls are causing the most trouble 
and how long they take.  I also broke John's suggestions in to general 
areas: http://wiki.laptop.org/go/Feature_roadmap/General_UI_sluggishness

Could use more editing (e.g. swap suggestions may belong in memory, file 
read/write caching should be added etc.).

You're just scratcing the surface with BW, latency and messages. CPU 
cycles, process priority, caching, bottleneck definition, instruction 
sets and compilers, word/block/sector size usage, and if you're really 
hard core rows and columns are all optimizable.

If you have an algorithm improvement to offer, I'm all ears.

When we have a critical mass of time from professional engineers we can 
improve performance. Until then it waits and the users wait too.

Let's build on what we have, we're making progress.

Thanks,

Greg S



On Wed, Dec 31, 2008 at 09:20:27AM -0700, Jordan Crouse wrote:
  The solution to the performance problems is good old fashioned elbow 
grease These are the sorts of things that we need to find and
  squash - and yes, it will be very time consuming and a little boring.

Several anecdotes for your amusement and reflection:

* When was the last time someone posted to devel asking: what is the
   right algorithm or datastructure for task ?

* When was the last time someone publicly analyzed the upper or lower
   bounds on the bandwidth, latency, or quantity of messages necessary to
   accomplish task ?

* When was the last time that you published a performance goal for your
   software? Did you hit it? Did anyone notice?

Michael

P.S. - Charles Leiserson once remarked that performance is like a
currency which programmers trade for (all) other worthwhile things like
schedule targets, scope of features, other resource consumption, various
kinds of security, etc [1]. This suggests that one would do better to
ask for performance or  but not both. Think of Blizzard.

[1] http://www.catonmat.net/blog/mit-introduction-to-algorithms-part-one/

Wade Brainerd wrote:
 I agree with Jordan.  You just have to sit down and do the work to optimize
 the code, either finding the fastest path through hardware and software
 stack.
 I've rewritten Bounce twice now for performance just to hold on to 20fps on
 the XO.  Colors! has been through many performance iterations as well
 (compare v1 and v13 with large brushes).  I've just had my hat handed to me
 by Cairo for Typing Turtle as well (with the hand display enabled, you can
 type about 1WPM).  So I'm looking forward to rewriting my keyboard rendering
 to deal with that.
 
 If you have an issue with the performance of the XO, just spend the time by
 yourself to analyze it and fix it, talking about it accomplishes nothing.
  If you find a solution that would help others, post it.
 
 -Wade
 
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: performance work

2008-12-31 Thread Wade Brainerd
Hi Greg,
I think there's actually a lot of overlap between activity performance work
and OS performance work.

The bottlenecks I encountered and resolved were in PyGTK, Cairo, the Python
interpreter, librsvg, etc.  These are the many of the same libraries which
are believed to cause sluggishness in the core UI.

Unfortunately, most of my performance success was had by replacing the
aforementioned libraries with custom C++ extension modules.  Each time I
remove some Cairo code and replace it with custom C++ code, the activity
gets faster.  I wouldn't advocate the Sugar developers take that approach :)

-Wade

On Wed, Dec 31, 2008 at 4:01 PM, Greg Smith gregsmitho...@gmail.com wrote:

 Hi All,

 Answering two e-mails on one pass.

 I agree, its hard work.

 Wade,

 I believe this thread is about optimizing the XO OS and GUI. That's why I
 call the requirement General_UI_sluggishness.

 Optimizing applications is yet another challenge. I'm all for people doing
 that hard work and documenting it so the next person doesn't have to
 re-invent the wheel. Your performance URL is already posted to the page in
 the tools section. Let me know if you have any other links (GIT URLs?) or
 e-mails I should make easily accessible.

 Michael,

 The performance goal I worked out with Eben is on the page already. It
 could be better but its a start.

 Lots of people have noticed.

 Neil and Jordan analyzed which Cairo calls are causing the most trouble and
 how long they take.  I also broke John's suggestions in to general areas:
 http://wiki.laptop.org/go/Feature_roadmap/General_UI_sluggishness

 Could use more editing (e.g. swap suggestions may belong in memory, file
 read/write caching should be added etc.).

 You're just scratcing the surface with BW, latency and messages. CPU
 cycles, process priority, caching, bottleneck definition, instruction sets
 and compilers, word/block/sector size usage, and if you're really hard core
 rows and columns are all optimizable.

 If you have an algorithm improvement to offer, I'm all ears.

 When we have a critical mass of time from professional engineers we can
 improve performance. Until then it waits and the users wait too.

 Let's build on what we have, we're making progress.

 Thanks,

 Greg S

 

 On Wed, Dec 31, 2008 at 09:20:27AM -0700, Jordan Crouse wrote:
  The solution to the performance problems is good old fashioned elbow
 grease These are the sorts of things that we need to find and
  squash - and yes, it will be very time consuming and a little boring.

 Several anecdotes for your amusement and reflection:

 * When was the last time someone posted to devel asking: what is the
  right algorithm or datastructure for task ?

 * When was the last time someone publicly analyzed the upper or lower
  bounds on the bandwidth, latency, or quantity of messages necessary to
  accomplish task ?

 * When was the last time that you published a performance goal for your
  software? Did you hit it? Did anyone notice?

 Michael

 P.S. - Charles Leiserson once remarked that performance is like a
 currency which programmers trade for (all) other worthwhile things like
 schedule targets, scope of features, other resource consumption, various
 kinds of security, etc [1]. This suggests that one would do better to
 ask for performance or  but not both. Think of Blizzard.

 [1] http://www.catonmat.net/blog/mit-introduction-to-algorithms-part-one/

 Wade Brainerd wrote:

 I agree with Jordan.  You just have to sit down and do the work to
 optimize
 the code, either finding the fastest path through hardware and software
 stack.
 I've rewritten Bounce twice now for performance just to hold on to 20fps
 on
 the XO.  Colors! has been through many performance iterations as well
 (compare v1 and v13 with large brushes).  I've just had my hat handed to
 me
 by Cairo for Typing Turtle as well (with the hand display enabled, you can
 type about 1WPM).  So I'm looking forward to rewriting my keyboard
 rendering
 to deal with that.

 If you have an issue with the performance of the XO, just spend the time
 by
 yourself to analyze it and fix it, talking about it accomplishes nothing.
  If you find a solution that would help others, post it.

 -Wade


___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: performance work

2008-12-31 Thread Mikus Grinbergs
What I find discouraging is mentally comparing the responsiveness 
in use of F10-based Joyride builds against what I remember (perhaps 
mistakenly) of responsiveness with Ship.2 builds.  [I don't 
currently have a Ship.2 system on hand for direct comparison.]

Examples of my (non-performance) experience with recent Joyrides :

  -  I use an external USB keyboard.  In Terminal, I type in a short 
command and press enter.  It can take more than half a second (or 
even longer) for the cursor to move from the line it was on (where I 
typed) to the start of a new screen line.

  -  I click on an Activity icon in Home View or Journal.  Up to TWO 
seconds later, the XO begins showing (and pulsing) the screen that 
is supposed to indicate Activity being launched.  [The only way I 
can get 'instant' response from the XO is to elicit the drop-down 
palette from the Activity icon, then click in that palette on Start 
(if in Home View) or on Resume (if in Journal).  Having done so, the 
vanishing of that palette is satisfyingly responsive.  (Of course, 
if I were to miss the palette entirely due to my shaky hands when 
I click, the palette vanishes without the Activity being launched.)]

mikus

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: performance work

2008-12-30 Thread Jordan Crouse
Neil Graham wrote:
 On Mon, 2008-12-22 at 15:36 -0700, Jordan Crouse wrote:
 
 You might want to re-acquire the numbers with wireless turned off and 
 the system in a very quiet state.  If you want to be extra careful, you 
 can run the benchmarks in an empty X server (no sugar) and save the 
 results to a ramfs backed directory to avoid NAND. 
 
 
 The XO Numbers were recorded from a fairly inactive state.  Wireless was
 active but there shouldn't have been any traffic.  I did launch X with
 just an xterm, so sugar shouldn't be in play at all.  I didn't think of
 the speed of nand writes however.
 
 
  2) The accel path requires reading from video memory (which is 
 very slow)
 
 I'm curious as to why reads from video memory are so slow,  On standard
 video cards it's slow because there is quite a division between the CPU
 and the video memory,  but on the geode isn't the video memory shared in
 the same SDRAM as Main memory. 

It is, in that they share the same physical RAM chips, but they are 
controlled by different entities - one is managed by the system memory 
controller and the other is handled by the GPU.   At start up time, the 
memory is carved up by the firmware, and after the top of system RAM is 
established, video and system memory behave for all intents and purposes 
like separate components.  Put simply, there is no way to directly 
address video memory from the system memory.  Access to the video memory 
has to happen via PCI cycles, and for obvious reasons the active video 
region has the cache disabled, accounting for relatively slow readback.

That said, the read from memory performance is still worse then you 
might expect - I never really got a good answer from the silicon guys as 
to why.  If Tom Sylla is still reading this list, he might know more.

 There's a separate 2 meg for DCON memory, but I was under the impression
 that was just to remember the last frame.
 
 Do I have that all wrong?   

No - thats right, there is a completely separate bank of chips just for 
the DCON.

Jordan


___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: performance work

2008-12-23 Thread Greg Smith
Jordan and Neil,

That's great work, thanks!

Eben, Neil and Sugar people,

Can you tell from the test descriptions below which of these operations 
we are most likely to encounter in the XO GUI?

I think we can use the Cairo trace utility S found: 
http://wiki.laptop.org/go/Performance_tuning#Other

Turn that on with logging then use the XO as normal (or as a kid would) 
and generate the log file to see which are most common Cairo calls.

I know Jordan gets way over scheduled. Let's help him prove that fixing 
a driver bug or two would benefit the UI performance, before he has to 
move on...

Thanks,

Greg S

Jordan Crouse wrote:
 Greg Smith wrote:
 Hi Jordan,

 Looks like we made a little more progress on graphics benchmarking. 
 See Neil's results below.

 I updated the feature page with the test results so far:
 http://wiki.laptop.org/go/Feature_roadmap/General_UI_sluggishness

 What's next?

 Do we know enough now to target a particular section of the code for 
 optimization?

 
 I ran the raw data through a script, and came up with a nice little 
 summary of where we stand.  My first general observation is that the 
 numbers are skewed due to system activity - recall that X runs in user 
 space, so it is subject to be preempted by the kernel.  I think that the 
 obviously high numbers in many of the results are due to NAND or 
 wireless interrupts (example):
 
 6: 2261923 (5.25 ms)
 7: 16690761 (38.73 ms)
 8: 2306919 (5.35 ms)
 
 You might want to re-acquire the numbers with wireless turned off and 
 the system in a very quiet state.  If you want to be extra careful, you 
 can run the benchmarks in an empty X server (no sugar) and save the 
 results to a ramfs backed directory to avoid NAND.  You probably don't 
 have to get _that_ extreme, but I don't want you to spend much time 
 trying to investigate a path only to find out that the numbers are wrong 
 due to a few writes().  In the results below, I tried to mitigate the 
 damage somewhat by removing the highest and lowest value.
 
 The list below is sorted by delta between accel and un-accel, with the 
 worse tests on top (i.e - the ones where accel is actually hurting 
 you) - these are good candidates to be looked at.  There are three 
 reasons why unaccel would be faster then accel - 1) a bug in the accel 
 code, 2) The accel path requires reading from video memory (which is 
 very slow), and 3) the accel path doesn't punt to unaccel early enough.
 
 The first two on the list (textpath-xlib and texturedtext-xlib) toss up 
 a huge red flag - I am guessing we are probably seeing a bug in the driver.
 
 All of the upsample and downsample entries are interesting, because the 
 driver should be kicking back to the unaccelerated path - I'm guessing
 that 3) might be in effect here - though 73 ms is a long time.
 
 Most of the operations between 1ms and -1ms are probably going down the 
 unaccelerated path.  Most everything in there probably should be 
 unaccelerated, with the possible exception of the 'over' operations - 
 those are the easiest for the GPU to accelerate and the most heavily 
 used, so you probably want to take a look at those.
 
 As before, I encourage you to investigate which operation are heavily 
 used - if you don't use textured text very much, then optimizing it 
 would be heavily on the geek points, but not very useful in the long haul.
 
 Jordan
 Test AccelNoaccel   Delta
 --
 textpath-xlib-textpath   1562.60  1345.12  217.48
 texturedtext-xlib-texturedtext   315.61   140.54   175.07
 downsample-nearest-xlib-512x512-redsquar 106.37   33.25 73.12
 downsample-bilinear-xlib-512x512-redsqua 96.5735.22 61.35
 downsample-bilinear-xlib-512x512-primros 83.3634.81 48.56
 downsample-nearest-xlib-512x512-lenna78.1829.83 48.35
 downsample-bilinear-xlib-512x512-lenna   83.9136.32 47.59
 downsample-nearest-xlib-512x512-primrose 77.4930.06 47.43
 upsample-nearest-xlib-48x48-todo 86.2360.14 26.09
 upsample-bilinear-xlib-48x48-brokenlock  242.52   216.4926.03
 upsample-bilinear-xlib-48x48-script  237.69   211.7025.98
 upsample-bilinear-xlib-48x48-mail234.40   208.4325.97
 upsample-bilinear-xlib-48x48-todo239.85   213.9425.91
 upsample-nearest-xlib-48x48-script   81.6757.02 24.65
 upsample-nearest-xlib-48x48-mail 78.9954.42 24.57
 upsample-nearest-xlib-48x48-brokenlock   86.1861.73 24.45
 upsample-nearest-48x48-script61.9557.46  4.49
 downsample-bilinear-512x512-redsquare11.247.77   3.47
 solidtext-xlib-solidtext 11.709.51   2.19
 textpath-textpath1081.14  1079.371.78
 texturedtext-texturedtext112.33   111.79 0.54
 upsample-bilinear-48x48-todo 224.06   223.68 0.37
 

Re: performance work

2008-12-22 Thread Greg Smith
Hi Jordan,

Looks like we made a little more progress on graphics benchmarking. See 
Neil's results below.

I updated the feature page with the test results so far:
http://wiki.laptop.org/go/Feature_roadmap/General_UI_sluggishness

What's next?

Do we know enough now to target a particular section of the code for 
optimization?

Thanks,

Greg S

***

Subject: Re: performance work
To: Wade Brainerd wad...@gmail.com
Cc: OLPC Development devel@lists.laptop.org, g...@laptop.org
Message-ID: 494e16aa.3070...@skierpage.com
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Wade Brainerd wrote:
On Tue, Dec 16, 2008 at 7:08 PM, Neil Graham l...@screamingduck.com

   Is there a build of cairo that can produce a log of what calls 
are used
   in typical XO use?

http://www.cairographics.org/FAQ/#performance_concerns says
Cairo provides a cairo-trace utility (currently only available from the
git development tree, but is planned for inclusion with Cairo 1.10)
(I think Joyride builds include Cairo 1.8.0, latest released Cairo is 1.8.6)

   Some good ways to find out are located here:
  
   http://wiki.laptop.org/go/Performance_tuning

I mentioned this.

--
=S

**
Neil said:
  I recommend running the Cairo benchmarks on the XO again with
  acceleration turned off in the X driver. This will give you a good
  indication of which operations are being accelerated and which are not.

Done.

http://screamingduck.com/Cruft/cairo_benchmark_XO_NoAccel.txt


At a cursory glance it looks like an overall improvement without
acceleration except for lines-xlib, add-xlib and over-xlib

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: performance work

2008-12-22 Thread Jordan Crouse
Greg Smith wrote:
 Hi Jordan,
 
 Looks like we made a little more progress on graphics benchmarking. See 
 Neil's results below.
 
 I updated the feature page with the test results so far:
 http://wiki.laptop.org/go/Feature_roadmap/General_UI_sluggishness
 
 What's next?
 
 Do we know enough now to target a particular section of the code for 
 optimization?
 

I ran the raw data through a script, and came up with a nice little 
summary of where we stand.  My first general observation is that the 
numbers are skewed due to system activity - recall that X runs in user 
space, so it is subject to be preempted by the kernel.  I think that the 
obviously high numbers in many of the results are due to NAND or 
wireless interrupts (example):

6: 2261923 (5.25 ms)
7: 16690761 (38.73 ms)
8: 2306919 (5.35 ms)

You might want to re-acquire the numbers with wireless turned off and 
the system in a very quiet state.  If you want to be extra careful, you 
can run the benchmarks in an empty X server (no sugar) and save the 
results to a ramfs backed directory to avoid NAND.  You probably don't 
have to get _that_ extreme, but I don't want you to spend much time 
trying to investigate a path only to find out that the numbers are wrong 
due to a few writes().  In the results below, I tried to mitigate the 
damage somewhat by removing the highest and lowest value.

The list below is sorted by delta between accel and un-accel, with the 
worse tests on top (i.e - the ones where accel is actually hurting 
you) - these are good candidates to be looked at.  There are three 
reasons why unaccel would be faster then accel - 1) a bug in the accel 
code, 2) The accel path requires reading from video memory (which is 
very slow), and 3) the accel path doesn't punt to unaccel early enough.

The first two on the list (textpath-xlib and texturedtext-xlib) toss up 
a huge red flag - I am guessing we are probably seeing a bug in the driver.

All of the upsample and downsample entries are interesting, because the 
driver should be kicking back to the unaccelerated path - I'm guessing
that 3) might be in effect here - though 73 ms is a long time.

Most of the operations between 1ms and -1ms are probably going down the 
unaccelerated path.  Most everything in there probably should be 
unaccelerated, with the possible exception of the 'over' operations - 
those are the easiest for the GPU to accelerate and the most heavily 
used, so you probably want to take a look at those.

As before, I encourage you to investigate which operation are heavily 
used - if you don't use textured text very much, then optimizing it 
would be heavily on the geek points, but not very useful in the long haul.

Jordan
Test AccelNoaccel   Delta
--
textpath-xlib-textpath   1562.60  1345.12  217.48
texturedtext-xlib-texturedtext   315.61   140.54   175.07
downsample-nearest-xlib-512x512-redsquar 106.37   33.25 73.12
downsample-bilinear-xlib-512x512-redsqua 96.5735.22 61.35
downsample-bilinear-xlib-512x512-primros 83.3634.81 48.56
downsample-nearest-xlib-512x512-lenna78.1829.83 48.35
downsample-bilinear-xlib-512x512-lenna   83.9136.32 47.59
downsample-nearest-xlib-512x512-primrose 77.4930.06 47.43
upsample-nearest-xlib-48x48-todo 86.2360.14 26.09
upsample-bilinear-xlib-48x48-brokenlock  242.52   216.4926.03
upsample-bilinear-xlib-48x48-script  237.69   211.7025.98
upsample-bilinear-xlib-48x48-mail234.40   208.4325.97
upsample-bilinear-xlib-48x48-todo239.85   213.9425.91
upsample-nearest-xlib-48x48-script   81.6757.02 24.65
upsample-nearest-xlib-48x48-mail 78.9954.42 24.57
upsample-nearest-xlib-48x48-brokenlock   86.1861.73 24.45
upsample-nearest-48x48-script61.9557.46  4.49
downsample-bilinear-512x512-redsquare11.247.77   3.47
solidtext-xlib-solidtext 11.709.51   2.19
textpath-textpath1081.14  1079.371.78
texturedtext-texturedtext112.33   111.79 0.54
upsample-bilinear-48x48-todo 224.06   223.68 0.37
upsample-nearest-48x48-brokenlock64.4664.16  0.30
upsample-bilinear-48x48-brokenlock   226.51   226.25 0.26
downsample-nearest-512x512-redsquare 2.43 2.23   0.19
gradients-linear-gradients-linear107.39   107.30 0.09
over-640x480-empty   15.6815.61  0.07
over-640x480-opaque  20.1920.12  0.07
add-640x480-opaque   20.7720.73  0.04
upsample-nearest-48x48-todo  60.7560.71  0.04
add-640x480-transparentshapes20.7920.78  0.02
add-640x480-shapes   20.7620.74  0.02
multiple-clip-rectangles-multiple clip r 1.23  

Re: performance work

2008-12-22 Thread Jordan Crouse
Greg Smith wrote:
 Hi Jordan,
 
 Looks like we made a little more progress on graphics benchmarking. See 
 Neil's results below.
 
 I updated the feature page with the test results so far:
 http://wiki.laptop.org/go/Feature_roadmap/General_UI_sluggishness
 
 What's next?
 
 Do we know enough now to target a particular section of the code for 
 optimization?

My previous email was pretty long, so I thought I would answer this last 
question separately.   I can help guide you with the operations that are 
slower with acceleration.   There may be other optimizations to be had 
within cairo or elsewhere in the X world, but I'll have to leave those 
to  people who understand that code better.

The majority of the operations will probably be composite operations. 
You will want to instrument the three composite hooks in the X driver 
and their sub-functions:  lx_check_composite, lx_prepare_composite, and 
lx_do_composite (in lx_exa.c).

lx_check_composite is the function where EXA checks to see if we are 
willing to do the operation at all - most of the acceleration rejects 
should happen here. lx_prepare_composite is where we store the 
information we need for the ensuing composite operation(s) - we can also 
bail out here, but there is an incremental cost in leading EXA further 
down the primrose path before rejecting it.  lx_do_composite() obviously 
is where the operation happens.  You will want to concentrate on these 
functions - instrument the code to figure out why we accept or reject an 
operation and why we take so long in rejecting certain operations. 
Profiling these functions may also help you figure out where we are 
spending our time.

So, in short - become one with the ErrorF() and good luck... :)

Jordan
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: performance work

2008-12-22 Thread Neil Graham
On Mon, 2008-12-22 at 15:36 -0700, Jordan Crouse wrote:

 You might want to re-acquire the numbers with wireless turned off and 
 the system in a very quiet state.  If you want to be extra careful, you 
 can run the benchmarks in an empty X server (no sugar) and save the 
 results to a ramfs backed directory to avoid NAND. 


The XO Numbers were recorded from a fairly inactive state.  Wireless was
active but there shouldn't have been any traffic.  I did launch X with
just an xterm, so sugar shouldn't be in play at all.  I didn't think of
the speed of nand writes however.


  2) The accel path requires reading from video memory (which is 
 very slow)

I'm curious as to why reads from video memory are so slow,  On standard
video cards it's slow because there is quite a division between the CPU
and the video memory,  but on the geode isn't the video memory shared in
the same SDRAM as Main memory. 

There's a separate 2 meg for DCON memory, but I was under the impression
that was just to remember the last frame.

Do I have that all wrong?   



___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: performance work

2008-12-21 Thread S Page
Wade Brainerd wrote:
On Tue, Dec 16, 2008 at 7:08 PM, Neil Graham l...@screamingduck.com

 Is there a build of cairo that can produce a log of what calls are used
 in typical XO use?

http://www.cairographics.org/FAQ/#performance_concerns says
Cairo provides a cairo-trace utility (currently only available from the 
git development tree, but is planned for inclusion with Cairo 1.10)
(I think Joyride builds include Cairo 1.8.0, latest released Cairo is 1.8.6)

 Some good ways to find out are located here:
 
 http://wiki.laptop.org/go/Performance_tuning

I mentioned this.

--
=S
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: performance work

2008-12-17 Thread Wade Brainerd
On Tue, Dec 16, 2008 at 7:08 PM, Neil Graham l...@screamingduck.com wrote:

 On Tue, 2008-12-16 at 16:23 -0700, Jordan Crouse wrote:
  The first thing you need to do is determine which operations you really
  care about. I would first target the operations that deal with text and
  rounded corners, since those will be the most complex. Straight blits
  and rectangle fills are important, but less interesting, since they
  involve the least work in the path between you and the GPU.
 Fundimentally, you care about the operations that are making it slow.
 Those are the ones A) being used lots B) Take notable amounts of time in
 total and C) have room for improvement.

 Is there a build of cairo that can produce a log of what calls are used
 in typical XO use?


Some good ways to find out are located here:

http://wiki.laptop.org/go/Performance_tuning

I personally most often use oprofile, without vmlinux (I don't know where to
get a vmlinux file for the olpc kernel).

-Wade
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: performance work

2008-12-17 Thread Neil Graham
On Tue, 2008-12-16 at 16:23 -0700, Jordan Crouse wrote:

 I recommend running the Cairo benchmarks on the XO again with 
 acceleration turned off in the X driver. This will give you a good 
 indication of which operations are being accelerated and which are not. 

Done.

http://screamingduck.com/Cruft/cairo_benchmark_XO_NoAccel.txt


At a cursory glance it looks like an overall improvement without
acceleration except for lines-xlib, add-xlib and over-xlib 



___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: performance work

2008-12-16 Thread Greg Smith
Forwarding this to devel.

Any comments or suggestions on how we can start to optimize graphics 
performance is appreciated.

It looks like we have a good test bed in place which should help us 
focus on the right bottlenecks.

Thanks,

Greg S

Greg Smith wrote:
 Hi Neil,
 
 That's great data, thanks!
 
 
 I put these links here for tracking: 
 http://wiki.laptop.org/go/Feature_roadmap/General_UI_sluggishness
 
 John,
 
 Do you have further suggestions on what bottle necks this points to? 
 What part of the code should be optimized to improve the graphics 
 performance based on these results and what do you think Neil's next 
 steps should be?
 
 Thanks,
 
 Greg S
 
 Neil Graham wrote:
 On Tue, 2008-12-09 at 15:43 -0500, Greg Smith wrote:

 Three ideas on how you can help.

 1 - There is a recent thread on SVG performance. See: 
 http://lists.sugarlabs.org/archive/sugar-devel/2008-December/010200.html

 You may find something there you can contribute to.

 2 - I also get the impression we do need to work on the Cairo front. 
 If you can list a set of bugs, we can flag them as useful for 9.1 and 
 track them.

 Well To start off with I compiled the cairo benchmarks and ran them on
 my slowest PC (2Ghz) and the XO  (from a basic startx )

 http://screamingduck.com/Cruft/cairo_benchmark_XO.txt
 http://screamingduck.com/Cruft/cairo_benchmark_2GHz_E2180.txt


 At least this gives me some base data to work with.  Some of the tests
 on the XO have some eyebrow raising results, such as...

 downsample-nearest
 Testing 512x512-lenna...
 0: 851892 (1.98 ms)
 1: 855671 (1.99 ms)
 2: 905907 (2.10 ms)
 3: 862388 (2.00 ms)
 4: 852743 (1.98 ms)

 downsample-nearest-xlib
 Testing 512x512-lenna...
 0: 10102252 (23.44 ms)
 1: 33629542 (78.02 ms)
 2: 33715350 (78.22 ms)
 3: 34031523 (78.96 ms)



 
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: performance work

2008-12-16 Thread Jordan Crouse
Greg Smith wrote:
 Forwarding this to devel.
 
 Any comments or suggestions on how we can start to optimize graphics 
 performance is appreciated.

That is a rather open ended question.  I'll try to point you at some 
interesting places to start with the understanding that not one thing
is going to solve your all problems - the total processing time is 
almost definitely a cumulative effect of all of the different stages of 
the rendering pipeline.

I would start by establishing a 1:1 baseline - it is great to compare 
against a 2Ghz Intel box, but that the differences between the two 
platforms are just too extreme.  No matter how good the graphics gets, 
we are still constrained by the Geode clock speed, FPU performance, and 
GPU feature set (what it can, and most importantly _cannot_ do).

The first thing you need to do is determine which operations you really 
care about. I would first target the operations that deal with text and 
rounded corners, since those will be the most complex. Straight blits 
and rectangle fills are important, but less interesting, since they 
involve the least work in the path between you and the GPU.

I recommend running the Cairo benchmarks on the XO again with 
acceleration turned off in the X driver. This will give you a good 
indication of which operations are being accelerated and which are not. 
  If you have another Geode platform handy (which you should if you are 
at 1CC), then you might also want to run the same benchmarks again 
against the vesa driver (which will be completely unaccelerated).  The 
difference in the three sets of data will give you a good idea of which 
operations are unaccelerated, and which operations are being further 
delayed by the Geode X driver.

The low hanging fruit here are the operations that are not being 
accelerated; you will need to determine why.  Sometimes its because the 
GPU cannot handle the operation (for example, operations on a8 
destinations), or it might because the operation was never implemented 
in the code, or it could be that the code is just downright buggy.
This is where it is imortant to know which operations you care most 
about.  You could probably find a good number of bugs in the two pass 
operations (PictOpXor and PictOpAtop) but both are rarely used and not a 
good use of your time.  I have no problems at all with biasing the 
driver toward very common operations.  If there is something that can be 
done to the driver to improve text rendering at the cost of say, 
rotation, then I'm all for it.

Outside of the driver, you are pretty much limited to evaluating 
alogrithms, either in the software render code (pixman) or in the cairo 
code.  For those situations, I have less knowledge, but I do advise you 
to remember the two hardware constraints which I mentioned above - CPU 
clock speed and FPU performance.  Remember that alot of this code was 
written recently when nobody in their right mind has  1Ghz on their 
desktop - no matter how hard they try, this will end up biasing the code 
slightly.  FPU performance is more serious. The Geode does not do well 
with heavy FPU use - to mitigate the damage, try to use single precision 
only, and try not to use a lot of FPU operations in a row because the 
Geode pipeline stalls horribly if two FPU operations are scheduled one 
after another.

Finally, I will remind you that you that no amount of hacking is going 
to magically make the Geode + Geode GPU all of a sudden look like a 
modern desktop Radeon.  There are many modern GPU concepts that desktop 
toolkits are becoming increasingly dependent on that the Geode just 
cannot grok.  Fading icons and anti-aliasing and animations may look 
really neat on your 2Ghz Intel, but they are a major strain on CPU 
resources on the Geode.  I'm not saying that there isn't room for 
improvement, but I am saying that at some point you will have to make 
compromises between what the UI does, and what the hardware can do. 
Until you are willing to bite that bullet, any optimizations you under 
the hood will be a treatment but never a cure.

Jordan
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: performance work

2008-12-16 Thread Neil Graham

On Tue, 2008-12-16 at 16:23 -0700, Jordan Crouse wrote:

 I would start by establishing a 1:1 baseline - it is great to compare 
 against a 2Ghz Intel box, but that the differences between the two 
 platforms are just too extreme.  No matter how good the graphics gets, 
 we are still constrained by the Geode clock speed, FPU performance, and 
 GPU feature set (what it can, and most importantly _cannot_ do).
I'm not even sure there _is_ a decent 1:1 baseline (and if there were
wouldn't it produce exact same results).  I did the 2GHz machine because
it's my slowest running box and more data can't hurt.  I suspect it
would be more value to compare the ratios of speeds between different
tests on the same machine rather than across machines.   At the very
least it can impress upon people the speed difference between the
machines.


 The first thing you need to do is determine which operations you really 
 care about. I would first target the operations that deal with text and 
 rounded corners, since those will be the most complex. Straight blits 
 and rectangle fills are important, but less interesting, since they 
 involve the least work in the path between you and the GPU.
Fundimentally, you care about the operations that are making it slow.
Those are the ones A) being used lots B) Take notable amounts of time in
total and C) have room for improvement.  

Is there a build of cairo that can produce a log of what calls are used
in typical XO use?

 
 I recommend running the Cairo benchmarks on the XO again with 
 acceleration turned off in the X driver.

That's just a xorg.conf change?  I can do that and rerun the benchmark.


 Outside of the driver, you are pretty much limited to evaluating 
 alogrithms, either in the software render code (pixman) or in the cairo 
 code.  For those situations, I have less knowledge, but I do advise you 
 to remember the two hardware constraints which I mentioned above - CPU 
 clock speed and FPU performance.  Remember that alot of this code was 
 written recently when nobody in their right mind has  1Ghz on their 
 desktop - no matter how hard they try, this will end up biasing the code 
 slightly.  FPU performance is more serious. The Geode does not do well 
 with heavy FPU use - to mitigate the damage, try to use single precision 
 only, and try not to use a lot of FPU operations in a row because the 
 Geode pipeline stalls horribly if two FPU operations are scheduled one 
 after another.

 
 Finally, I will remind you that you that no amount of hacking is going 
 to magically make the Geode + Geode GPU all of a sudden look like a 
 modern desktop Radeon.  

Agreed, but it at least should do something in the range of my old
166Mhz system with s3 card.  Which it currently doesn't at a user
experience level,  how much of that is inefficiency and how much is
trying to do too much remains to be seen.



___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel