Re: [Mesa-dev] Very low framerate when recording desktop content in Weston using mesa git on Radeon 5770 (glReadPixels slow path)

2013-03-27 Thread Bengt Richter

On 03/26/2013 10:51 PM Matt Turner wrote:

On Tue, Mar 26, 2013 at 2:44 PM, Bengt Richterb...@oz.net  wrote:

uint32_t
component_delta2(uint32_t next, uint32_t prev)
{
 return next0xff00ff)-(prev0xff00ff)+0x100)0xff00ff)+
 (((next0xff00)-(prev0xff00))0xff00));
}


Does removing all the spaces make it faster? ;)

LOL .. actually I didn't put them in in the first place ;-)
But inlining might make the calling loop faster.

Hm, easy to try now ... inlining cut the time in almost in half again.
I assigned to a volatile so the loop wouldn't get optimized away.
I just have a loop in the test kludge like

else if (strcmp(argv[1],v2)==0){
for (i=0;i256*256*256;++i){
antiO2 = component_delta2(i, i^0xff);
}
}

For the above with inlined component_delta2 I get
55ms vs 95ms not inlined, vs orig not inlined 167ms, FWIW.
My optimization reflex just got triggered, I didn't look at the
full post context to see if it might really be useful or not.

Regards,
Bengt Richter

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Very low framerate when recording desktop content in Weston using mesa git on Radeon 5770 (glReadPixels slow path)

2013-03-26 Thread Pekka Paalanen
On Tue, 26 Mar 2013 03:30:58 +0100
Rune Kjær Svendsen runesv...@gmail.com wrote:

 Marek, do you have an idea on where the currency bottleneck is?
 
 I just did a profiling with sysprof, zooming in on the desktop in Weston
 and moving the mouse wildly around, so that the buffer is completely
 changed for every frame. I got around 5 fps, which isn't *that* much, but
 still an order of magnitude better than without your patches.
 
 sysprof says there is 100% CPU usage, but unlike the previous 0.5-FPS
 recording, it's not in a single function, but spread out over several
 functions:
 
 35% weston_recorder_frame_notify
 11% __memcpy_ssse3
 4.5% clear_page_c
 4.3% output_run
 
 Although I'm not completely sure I'm reading the sysprof output right.
 weston_recorder_frame_notify, for example, has 35% CPU usage, but none of
 its child functions has any significant CPU usage. I presume the CPU usage
 in that function is from calling glReadPixels, although that's not apparent
 from sysprof:
 
 weston_recorder_frame_notify 39.15%
  39.15%
   - - kernel - -  0.00%
 0.01%
 ret_from_intr 0.00%
 0.01%
   __irqentry_text_start   0.00%

Well, if you look at weston_recorder_frame_notify function, it has a
naive loop over each single pixel it processes. component_delta() may
get inlined, and output_run() you saw in the profile.

I think it's possible it actually is weston_recorder_frame_notify
eating the CPU. Can you get more precise profiling, like hits per source
line or instruction?


Thanks,
pq
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Very low framerate when recording desktop content in Weston using mesa git on Radeon 5770 (glReadPixels slow path)

2013-03-26 Thread Matt Turner
On Tue, Mar 26, 2013 at 2:44 PM, Bengt Richter b...@oz.net wrote:
 uint32_t
 component_delta2(uint32_t next, uint32_t prev)
 {
 return next0xff00ff)-(prev0xff00ff)+0x100)0xff00ff)+
 (((next0xff00)-(prev0xff00))0xff00));
 }

Does removing all the spaces make it faster? ;)
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Very low framerate when recording desktop content in Weston using mesa git on Radeon 5770 (glReadPixels slow path)

2013-03-26 Thread Marek Olšák
I recommend using OpenMP for this kind of pixel processing,
specifically the parallel for. It's pretty easy to use and you could
do wonders with it, i.e. taking advantage of all CPU cores on *any*
system. You could also offload the whole thing to another thread and
continue there.

Marek

On Tue, Mar 26, 2013 at 6:26 PM, Rune Kjær Svendsen runesv...@gmail.com wrote:
 (I'm re-sending this message because the attachment was too large for
 mesa-dev, and because I want to add wayland-devel CC. The valgrind output
 can be found here: http://runeks.dk/files/callgrind.out.11362).

 Seems like you are right Pekka.

 I just ran weston through valgrind, and got some interesting results. I ran
 it like so:

 valgrind --tool=callgrind --dump-instr=yes --trace-jump=yes weston

 Which allows me to get the time spent on a per-instruction level. Now, this
 is running inside a virtual machine, so it won't be the same as running it
 natively, but it agrees with the other benchmarks in the sense that it
 suggests it is the simple calculations inside screenshooter.c that take up
 most of the CPU time, not calls to outside functions (like it was before
 with the slow glReadPixels() path).

 The callgrind output can be found at the following URL:
 http://runeks.dk/files/callgrind.out.11362
 Open it with KCachegrind, select the function
 weston_recorder_frame_notify(), and go to the Machine Code tab in the
 lower right corner to see the interesting stuff.

 According to callgrind, a total of 54.39% CPU time is used in the four lines
 251, 252, 253 and 255 in screenshooter.c. That's the function
 component_delta():

 dr = (next  16) - (prev  16);
 dg = (next   8) - (prev   8);
 db = (next   0) - (prev   0);

 return (dr  16) | (dg  8) | (db  0);

 Additionally, the lines 358, 359, 361, 362, 363 take up 25.9% CPU time. That
 is the innermost for-loop inside weston_recorder_frame_notify(). It's the
 following lines with the call to component_delta() not included (line 360)
 since we've already included the CPU usage of that:

 for (k = 0; k  width; k++) {
 next = *s++;
 delta = component_delta(next, *d);
 *d++ = next;
 if (run == 0 || delta == prev) {
 run++;

 And then finally the call to output_run() on line 365 takes up 10.6% CPU
 time:

 p = output_run(p, prev, run);

 Inside output_run(), the lines that take up the most CPU time are: 232, 233,
 234, 235, 239, 240. They take up 9.94% CPU time. So it's basically the whole
 while loop in output_run() except the line with the call to __builtin_clz()
 (line 238):

 while (run  0) {
 if (run = 0xe0) {
 *p++ = delta | ((run - 1)  24);
 break;
 }

 i = 24 - __builtin_clz(run);
 *p++ = delta | ((i + 0xe0)  24);
 run -= 1  (7 + i);
 }

 All of that adds up 90.89% CPU time spent inside screenshooter.c.

 /Rune

 On Tue, Mar 26, 2013 at 8:58 AM, Pekka Paalanen ppaala...@gmail.com wrote:
 On Tue, 26 Mar 2013 03:30:58 +0100
 Rune Kjær Svendsen runesv...@gmail.com wrote:

 Marek, do you have an idea on where the currency bottleneck is?

 I just did a profiling with sysprof, zooming in on the desktop in Weston
 and moving the mouse wildly around, so that the buffer is completely
 changed for every frame. I got around 5 fps, which isn't *that* much, but
 still an order of magnitude better than without your patches.

 sysprof says there is 100% CPU usage, but unlike the previous 0.5-FPS
 recording, it's not in a single function, but spread out over several
 functions:

 35% weston_recorder_frame_notify
 11% __memcpy_ssse3
 4.5% clear_page_c
 4.3% output_run

 Although I'm not completely sure I'm reading the sysprof output right.
 weston_recorder_frame_notify, for example, has 35% CPU usage, but none of
 its child functions has any significant CPU usage. I presume the CPU usage
 in that function is from calling glReadPixels, although that's not
 apparent
 from sysprof:

 weston_recorder_frame_notify 39.15%
  39.15%
   - - kernel - -  0.00%
 0.01%
 ret_from_intr 0.00%
 0.01%
   __irqentry_text_start   0.00%

 Well, if you look at weston_recorder_frame_notify function, it has a
 naive loop over each single pixel it processes. component_delta() may
 get inlined, and output_run() you saw in the profile.

 I think it's possible it actually is weston_recorder_frame_notify
 eating the CPU. Can you get more precise profiling, like hits per source
 line or instruction?


 Thanks,
 pq


 Med venlig hilsen,

 Rune Kjær Svendsen
 Østerbrogade 111, 3. - 302
 2100 København Ø
 Tlf.: 2835 0726


 On Tue, Mar 26, 2013 at 8:58 AM, Pekka Paalanen ppaala...@gmail.com wrote:

 On Tue, 26 Mar 2013 03:30:58 +0100
 Rune Kjær Svendsen runesv...@gmail.com wrote:

  Marek, do you have an idea on where the currency bottleneck is?
 
  I just did a profiling with sysprof, zooming in on the desktop in Weston
  and moving the mouse wildly around, so 

Re: [Mesa-dev] Very low framerate when recording desktop content in Weston using mesa git on Radeon 5770 (glReadPixels slow path)

2013-03-25 Thread Rune Kjær Svendsen
Marek, do you have an idea on where the currency bottleneck is?

I just did a profiling with sysprof, zooming in on the desktop in Weston
and moving the mouse wildly around, so that the buffer is completely
changed for every frame. I got around 5 fps, which isn't *that* much, but
still an order of magnitude better than without your patches.

sysprof says there is 100% CPU usage, but unlike the previous 0.5-FPS
recording, it's not in a single function, but spread out over several
functions:

35% weston_recorder_frame_notify
11% __memcpy_ssse3
4.5% clear_page_c
4.3% output_run

Although I'm not completely sure I'm reading the sysprof output right.
weston_recorder_frame_notify, for example, has 35% CPU usage, but none of
its child functions has any significant CPU usage. I presume the CPU usage
in that function is from calling glReadPixels, although that's not apparent
from sysprof:

weston_recorder_frame_notify 39.15%
 39.15%
  - - kernel - -  0.00%
0.01%
ret_from_intr 0.00%
0.01%
  __irqentry_text_start   0.00%
0.01%
irq_exit  0.00%
0.01%
  do_softirq  0.00%
0.01%
call_softirq  0.00%
0.01%
  __do_softirq0.00%
0.01%
blk_done_softirq  0.00%
0.01%
  scsi_softirq_done   0.00%
0.01%
scsi_finish_command   0.00%
0.01%
  scsi_io_completion  0.00%
0.01%
blk_end_request   0.00%
0.01%
  blk_end_bidi_request0.00%
0.01%
blk_update_bidi_request   0.00%
0.01%
  blk_update_request  0.00%
0.01%
req_bio_endio.isra.46 0.00%
0.01%
  bio_endio   0.00%
0.01%
end_swap_bio_write0.00%
0.01%
  end_page_writeback  0.00%
0.01%
rotate_reclaimable_page   0.01%
0.01%

Another possible bottleneck is simply disk access, although it doesn't seem
to be relevant on my system (since I have 100% CPU usage). The 36-second
recording I made was 1.3 GB in size, so that's around 36 MB/s.

Med venlig hilsen,

Rune Kjær Svendsen
Østerbrogade 111, 3. - 302
2100 København Ø
Tlf.: 2835 0726


On Mon, Mar 18, 2013 at 1:20 AM, Marek Olšák mar...@gmail.com wrote:

 Slowness is not usually a bug.

 I guess it can be optimized even more. It depends on where the
 bottleneck is now.

 Marek

 On Sun, Mar 17, 2013 at 10:14 PM, Rune Kjær Svendsen
 runesv...@gmail.com wrote:
  Thank you very much! This is much better. It's gone from 0.5-ish FPS when
  zooming in to around 10 FPS, depending on screen content.
 
  So I figure this isn't a bug? I assumed it was a bug, but is the case
 simply
  that an efficient glReadPixels path for radeon/gallium doesn't exist?
 
  The patch set sure helps in that regard, although it'd be really nice to
 get
  30 FPS consistently, if at all possible.
 
  Thanks again.
 
  /Rune
 
 
  On Sun, Mar 17, 2013 at 6:46 PM, Andreas Boll 
 andreas.boll@gmail.com
  wrote:
 
  2013/3/17 Rune Kjær Svendsen runesv...@gmail.com:
   Hello list
  
   I'm having problems recording the desktop content using the Weston
   compositor's built-in recording function. When I start a recording and
   do
   something that changes a lot of screen content (like zooming in on the
   desktop, for example), I get around 0.5 FPS. Using sysprof, I can see
   that
   ~98% of CPU is used in the function unpack_XRGB(). krh has told me
   this
   is caused by glReadPixels going through a slowpath. I have a Radeon HD
   5770
   GPU and I'm using mesa git (I've tried the mesa version in the Ubuntu
   12.10
   repos, and the xorg-edgers PPA, same result).
  
   Does anyone know what the issue could be, or how to debug the problem
   further?
  
 
  This patch series [1] should help. You might want to try it.
 
  [1]
 http://lists.freedesktop.org/archives/mesa-dev/2013-March/036214.html
 
   Doing some debugging, it seems the call to ctx-Driver.ReadPixels() in
   _mesa_ReadnPixelsARB leads to _mesa_readpixels() being called in
   readpix.c.
  
   I'm attaching some output of gdb that will hopefully be useful.
  
   I'm also attaching the debug terminal output of running Weston with
 the
   DRM
   backend.
  
   Let me know if I can provide other useful information.
  
   

[Mesa-dev] Very low framerate when recording desktop content in Weston using mesa git on Radeon 5770 (glReadPixels slow path)

2013-03-17 Thread Rune Kjær Svendsen
Hello list

I'm having problems recording the desktop content using the Weston
compositor's built-in recording function. When I start a recording and do
something that changes a lot of screen content (like zooming in on the
desktop, for example), I get around 0.5 FPS. Using sysprof, I can see that
~98% of CPU is used in the function unpack_XRGB(). krh has told me this
is caused by glReadPixels going through a slowpath. I have a Radeon HD 5770
GPU and I'm using mesa git (I've tried the mesa version in the Ubuntu 12.10
repos, and the xorg-edgers PPA, same result).

Does anyone know what the issue could be, or how to debug the problem
further?

Doing some debugging, it seems the call to ctx-Driver.ReadPixels()
in _mesa_ReadnPixelsARB leads to _mesa_readpixels() being called in
readpix.c.

I'm attaching some output of gdb that will hopefully be useful.

I'm also attaching the debug terminal output of running Weston with the DRM
backend.

Let me know if I can provide other useful information.


weston-mesa.log
Description: Binary data


gdb-mesa.log
Description: Binary data
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Very low framerate when recording desktop content in Weston using mesa git on Radeon 5770 (glReadPixels slow path)

2013-03-17 Thread Andreas Boll
2013/3/17 Rune Kjær Svendsen runesv...@gmail.com:
 Hello list

 I'm having problems recording the desktop content using the Weston
 compositor's built-in recording function. When I start a recording and do
 something that changes a lot of screen content (like zooming in on the
 desktop, for example), I get around 0.5 FPS. Using sysprof, I can see that
 ~98% of CPU is used in the function unpack_XRGB(). krh has told me this
 is caused by glReadPixels going through a slowpath. I have a Radeon HD 5770
 GPU and I'm using mesa git (I've tried the mesa version in the Ubuntu 12.10
 repos, and the xorg-edgers PPA, same result).

 Does anyone know what the issue could be, or how to debug the problem
 further?


This patch series [1] should help. You might want to try it.

[1] http://lists.freedesktop.org/archives/mesa-dev/2013-March/036214.html

 Doing some debugging, it seems the call to ctx-Driver.ReadPixels() in
 _mesa_ReadnPixelsARB leads to _mesa_readpixels() being called in readpix.c.

 I'm attaching some output of gdb that will hopefully be useful.

 I'm also attaching the debug terminal output of running Weston with the DRM
 backend.

 Let me know if I can provide other useful information.

 ___
 mesa-dev mailing list
 mesa-dev@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/mesa-dev

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Very low framerate when recording desktop content in Weston using mesa git on Radeon 5770 (glReadPixels slow path)

2013-03-17 Thread Rune Kjær Svendsen
Thank you very much! This is much better. It's gone from 0.5-ish FPS when
zooming in to around 10 FPS, depending on screen content.

So I figure this isn't a bug? I assumed it was a bug, but is the case
simply that an efficient glReadPixels path for radeon/gallium doesn't exist?

The patch set sure helps in that regard, although it'd be really nice to
get 30 FPS consistently, if at all possible.

Thanks again.

/Rune


On Sun, Mar 17, 2013 at 6:46 PM, Andreas Boll andreas.boll@gmail.comwrote:

 2013/3/17 Rune Kjær Svendsen runesv...@gmail.com:
  Hello list
 
  I'm having problems recording the desktop content using the Weston
  compositor's built-in recording function. When I start a recording and do
  something that changes a lot of screen content (like zooming in on the
  desktop, for example), I get around 0.5 FPS. Using sysprof, I can see
 that
  ~98% of CPU is used in the function unpack_XRGB(). krh has told me
 this
  is caused by glReadPixels going through a slowpath. I have a Radeon HD
 5770
  GPU and I'm using mesa git (I've tried the mesa version in the Ubuntu
 12.10
  repos, and the xorg-edgers PPA, same result).
 
  Does anyone know what the issue could be, or how to debug the problem
  further?
 

 This patch series [1] should help. You might want to try it.

 [1] http://lists.freedesktop.org/archives/mesa-dev/2013-March/036214.html

  Doing some debugging, it seems the call to ctx-Driver.ReadPixels() in
  _mesa_ReadnPixelsARB leads to _mesa_readpixels() being called in
 readpix.c.
 
  I'm attaching some output of gdb that will hopefully be useful.
 
  I'm also attaching the debug terminal output of running Weston with the
 DRM
  backend.
 
  Let me know if I can provide other useful information.
 
  ___
  mesa-dev mailing list
  mesa-dev@lists.freedesktop.org
  http://lists.freedesktop.org/mailman/listinfo/mesa-dev
 

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Very low framerate when recording desktop content in Weston using mesa git on Radeon 5770 (glReadPixels slow path)

2013-03-17 Thread Marek Olšák
Slowness is not usually a bug.

I guess it can be optimized even more. It depends on where the
bottleneck is now.

Marek

On Sun, Mar 17, 2013 at 10:14 PM, Rune Kjær Svendsen
runesv...@gmail.com wrote:
 Thank you very much! This is much better. It's gone from 0.5-ish FPS when
 zooming in to around 10 FPS, depending on screen content.

 So I figure this isn't a bug? I assumed it was a bug, but is the case simply
 that an efficient glReadPixels path for radeon/gallium doesn't exist?

 The patch set sure helps in that regard, although it'd be really nice to get
 30 FPS consistently, if at all possible.

 Thanks again.

 /Rune


 On Sun, Mar 17, 2013 at 6:46 PM, Andreas Boll andreas.boll@gmail.com
 wrote:

 2013/3/17 Rune Kjær Svendsen runesv...@gmail.com:
  Hello list
 
  I'm having problems recording the desktop content using the Weston
  compositor's built-in recording function. When I start a recording and
  do
  something that changes a lot of screen content (like zooming in on the
  desktop, for example), I get around 0.5 FPS. Using sysprof, I can see
  that
  ~98% of CPU is used in the function unpack_XRGB(). krh has told me
  this
  is caused by glReadPixels going through a slowpath. I have a Radeon HD
  5770
  GPU and I'm using mesa git (I've tried the mesa version in the Ubuntu
  12.10
  repos, and the xorg-edgers PPA, same result).
 
  Does anyone know what the issue could be, or how to debug the problem
  further?
 

 This patch series [1] should help. You might want to try it.

 [1] http://lists.freedesktop.org/archives/mesa-dev/2013-March/036214.html

  Doing some debugging, it seems the call to ctx-Driver.ReadPixels() in
  _mesa_ReadnPixelsARB leads to _mesa_readpixels() being called in
  readpix.c.
 
  I'm attaching some output of gdb that will hopefully be useful.
 
  I'm also attaching the debug terminal output of running Weston with the
  DRM
  backend.
 
  Let me know if I can provide other useful information.
 
  ___
  mesa-dev mailing list
  mesa-dev@lists.freedesktop.org
  http://lists.freedesktop.org/mailman/listinfo/mesa-dev
 



 ___
 mesa-dev mailing list
 mesa-dev@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/mesa-dev

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev