Re: [Mesa-dev] [PATCH 3/3] i965/fs: Combine tex/fb_write operations (opt)

2015-04-12 Thread Pohjolainen, Topi
On Fri, Apr 10, 2015 at 12:52:04PM -0700, Ben Widawsky wrote:
 Certain platforms support the ability to sample from a texture, and write it 
 out
 to the file RT - thus saving a costly send instructions (note that this is a
 potnential win if one wanted to backport to a tag that didn't have the patch
 from Topi which removed excess MOVs from LOAD_PAYLOAD - 97caf5fa04dbd2),
 
 v2: Modify the algorithm. Instead of iterating in reverse through blocks and
 insts, since the last block/inst is the only thing which can benefit. Rebased
 on top of Ken's patching modifying is_last_send
 
 v3: Rebased over almost 2 months, and Incorporated feedback from Matt:
 Some comment typo fixes and rewordings.
 Whitespace
 Move the optimization pass outside of the optimize loop
 
 v4: Some cosmetic changes requested from Ken. These changes ensured that the
 optimization function always returned true when an optimization occurred, and
 false when one did not. This behavior did not exist with the original patch. 
 As
 a result, having the separate helper function which Matt did not like no 
 longer
 made sense, and so now I believe everyone should be happy.
 
 Braswell data:
 Benchmark (n=20)   %diff
 *OglBatch5 -1.4
 *OglBatch7 -1.79
 OglFillTexMulti5.57
 OglFillTexSingle   1.16
 OglShMapPcf0.05
 OglTexFilterAniso  3.01
 OglTexFilterTri1.94
 
 SKL data:
 NONE COLLECTED
 
 No piglit regressions:
 (http://otc-gfxtest-01.jf.intel.com:8080/view/dev/job/bwidawsk/112/)
 
 [*] I believe my measurements are incorrect for Batch5-7. If I add this new
 optimization, but never emit the new instruction I see similar results.

I'm seeing ~7% (with 95% confidence) decrease in OglBatch6/7 when I'm
launching resolve clears with the light-weight mechanism provided by blorp.
This may be totally unrelated but lets see if I get any smarter.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 3/3] i965/fs: Combine tex/fb_write operations (opt)

2015-04-12 Thread Pohjolainen, Topi
On Sun, Apr 12, 2015 at 10:02:03AM +0300, Pohjolainen, Topi wrote:
 On Fri, Apr 10, 2015 at 12:52:04PM -0700, Ben Widawsky wrote:
  Certain platforms support the ability to sample from a texture, and write 
  it out
  to the file RT - thus saving a costly send instructions (note that this is a
  potnential win if one wanted to backport to a tag that didn't have the patch
  from Topi which removed excess MOVs from LOAD_PAYLOAD - 97caf5fa04dbd2),
  
  v2: Modify the algorithm. Instead of iterating in reverse through blocks and
  insts, since the last block/inst is the only thing which can benefit. 
  Rebased
  on top of Ken's patching modifying is_last_send
  
  v3: Rebased over almost 2 months, and Incorporated feedback from Matt:
  Some comment typo fixes and rewordings.
  Whitespace
  Move the optimization pass outside of the optimize loop
  
  v4: Some cosmetic changes requested from Ken. These changes ensured that the
  optimization function always returned true when an optimization occurred, 
  and
  false when one did not. This behavior did not exist with the original 
  patch. As
  a result, having the separate helper function which Matt did not like no 
  longer
  made sense, and so now I believe everyone should be happy.
  
  Braswell data:
  Benchmark (n=20)   %diff
  *OglBatch5 -1.4
  *OglBatch7 -1.79
  OglFillTexMulti5.57
  OglFillTexSingle   1.16
  OglShMapPcf0.05
  OglTexFilterAniso  3.01
  OglTexFilterTri1.94
  
  SKL data:
  NONE COLLECTED
  
  No piglit regressions:
  (http://otc-gfxtest-01.jf.intel.com:8080/view/dev/job/bwidawsk/112/)
  
  [*] I believe my measurements are incorrect for Batch5-7. If I add this new
  optimization, but never emit the new instruction I see similar results.
 
 I'm seeing ~7% (with 95% confidence) decrease in OglBatch6/7 when I'm
 launching resolve clears with the light-weight mechanism provided by blorp.
 This may be totally unrelated but lets see if I get any smarter.

I let OglBatch6 run for some time (160 rounds each), and I get:

x /mnt/before
+ /mnt/after
+--+
|   +   x  |
|   +   x  |
|   +   x  x   |
|   +   +x  x   x  x   |
|   +   +  x x  x   x  x   |
|   +   +   ++ x xx x   xx x   |
|   + *++ * +* x*xx+xx     |
| +  +  + **+ *x+*+x**x+x** x  |
| +  ++ ++*** *x+*+***x+x** x  |
| +  +***+**+***x** xx |
|+   +  +++ ++**++**x**xxx x++ x   |
|+  +   ++   ** *+***+*+*+**+**x*x***x+*** * xx*+ x|
||__|AM___A__|_|   |
+--+
N   Min   MaxMedian   AvgStddev
x 160   102.365   122.348   113.472 113.21107 3.6714446
+ 160   93.4825   121.597   110.289 110.03581 4.3771895
Difference at 95.0% confidence
-3.17526 +/- 0.885251
-2.80473% +/- 0.781947%
(Student's t, pooled s = 4.03976)


I'm not sure if one can really conclude much from this, I would almost claim
that my changes just introduce more fluctuation in the fps numbers but nothing
else.

I examined what callgrind tells me. Both master and meta-blorp got the same
amount of frames rendered while the latter does a little less work with
cpu to achieve this. The latter also submits slightly less work for the GPU
since clears are executed without the vertex shader stage. Hence I can't
really explain why it should be any slower.

So if I were you I probably wouldn't worry too much about your results.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [PATCH 3/3] i965/fs: Combine tex/fb_write operations (opt)

2015-04-10 Thread Ben Widawsky
Certain platforms support the ability to sample from a texture, and write it out
to the file RT - thus saving a costly send instructions (note that this is a
potnential win if one wanted to backport to a tag that didn't have the patch
from Topi which removed excess MOVs from LOAD_PAYLOAD - 97caf5fa04dbd2),

v2: Modify the algorithm. Instead of iterating in reverse through blocks and
insts, since the last block/inst is the only thing which can benefit. Rebased
on top of Ken's patching modifying is_last_send

v3: Rebased over almost 2 months, and Incorporated feedback from Matt:
Some comment typo fixes and rewordings.
Whitespace
Move the optimization pass outside of the optimize loop

v4: Some cosmetic changes requested from Ken. These changes ensured that the
optimization function always returned true when an optimization occurred, and
false when one did not. This behavior did not exist with the original patch. As
a result, having the separate helper function which Matt did not like no longer
made sense, and so now I believe everyone should be happy.

Braswell data:
Benchmark (n=20)   %diff
*OglBatch5 -1.4
*OglBatch7 -1.79
OglFillTexMulti5.57
OglFillTexSingle   1.16
OglShMapPcf0.05
OglTexFilterAniso  3.01
OglTexFilterTri1.94

SKL data:
NONE COLLECTED

No piglit regressions:
(http://otc-gfxtest-01.jf.intel.com:8080/view/dev/job/bwidawsk/112/)

[*] I believe my measurements are incorrect for Batch5-7. If I add this new
optimization, but never emit the new instruction I see similar results.

Cc: Matt Turner matts...@gmail.com
Cc: Kenneth Graunke kenn...@whitecape.org
Cc: Jason Ekstrand ja...@jlekstrand.net
Signed-off-by: Ben Widawsky b...@bwidawsk.net
---
 src/mesa/drivers/dri/i965/brw_fs.cpp   | 92 ++
 src/mesa/drivers/dri/i965/brw_fs.h |  3 +
 src/mesa/drivers/dri/i965/brw_fs_generator.cpp | 12 
 3 files changed, 107 insertions(+)

diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp 
b/src/mesa/drivers/dri/i965/brw_fs.cpp
index 72000cf..d56d053 100644
--- a/src/mesa/drivers/dri/i965/brw_fs.cpp
+++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
@@ -2557,6 +2557,96 @@ fs_visitor::opt_algebraic()
return progress;
 }
 
+/**
+ * Optimize sample messages which are followed by the final RT write.
+ *
+ * CHV, and GEN9+ can mark a texturing SEND instruction with EOT to have its
+ * results sent directly to the framebuffer, bypassing the EU.  Recognize the
+ * final texturing results copied to the framebuffer write payload and modify
+ * them to write to the framebuffer directly.
+ */
+bool
+fs_visitor::opt_sampler_eot()
+{
+   brw_wm_prog_key *key = (brw_wm_prog_key*) this-key;
+
+   if (brw-gen  9  !brw-is_cherryview)
+  return false;
+
+   /* FINISHME: It should be possible to implement this optimization when there
+* are multiple drawbuffers.
+*/
+   if (key-nr_color_regions != 1)
+  return false;
+
+   /* Look for a texturing instruction immediately before the final FB_WRITE. 
*/
+   fs_inst *fb_write = (fs_inst *) cfg-blocks[cfg-num_blocks - 1]-end();
+   assert(fb_write-eot);
+   assert(fb_write-opcode == FS_OPCODE_FB_WRITE);
+
+   if (unlikely(fb_write-is_head_sentinel()))
+   return false;
+
+   fs_inst *tex_inst = (fs_inst *) fb_write-prev;
+
+   /* There wasn't one; nothing to do. */
+   if (unlikely(tex_inst-is_head_sentinel()) || !tex_inst-is_tex())
+  return false;
+
+   /* If there's no header present, we need to munge the LOAD_PAYLOAD as well.
+* It's very likely to be the previous instruction.
+*/
+   fs_inst *load_payload = (fs_inst *) tex_inst-prev;
+   if (load_payload-is_head_sentinel() ||
+   load_payload-opcode != SHADER_OPCODE_LOAD_PAYLOAD)
+  return false;
+
+   assert(!tex_inst-eot); /* We can't get here twice */
+   assert((tex_inst-offset  (0xff  24)) == 0);
+
+   tex_inst-offset |= fb_write-target  24;
+   tex_inst-eot = true;
+   fb_write-remove(cfg-blocks[cfg-num_blocks - 1]);
+
+   /* If a header is present, marking the eot is sufficient. Otherwise, we need
+* to create a new LOAD_PAYLOAD command with the same sources and a space
+* saved for the header. Using a new destination register not only makes 
sure
+* we have enough space, but it will make sure the dead code eliminator 
kills
+* the instruction that this will replace.
+*/
+   if (tex_inst-header_present)
+  return true;
+
+   fs_reg send_header = vgrf(load_payload-sources + 1);
+   fs_reg *new_sources =
+  ralloc_array(mem_ctx, fs_reg, load_payload-sources + 1);
+
+   new_sources[0] = fs_reg();
+   for (int i = 0; i  load_payload-sources; i++)
+  new_sources[i+1] = load_payload-src[i];
+
+   /* The LOAD_PAYLOAD helper seems like the obvious choice here. However, it
+* requires a lot of information about the sources to appropriately figure
+* out the number of registers needed to be used. Given this stage in our
+* optimization, we may not have the appropriate GRFs 

Re: [Mesa-dev] [PATCH 3/3] i965/fs: Combine tex/fb_write operations (opt)

2015-04-10 Thread Matt Turner
On Fri, Apr 10, 2015 at 12:52 PM, Ben Widawsky
benjamin.widaw...@intel.com wrote:
 Certain platforms support the ability to sample from a texture, and write it 
 out
 to the file RT - thus saving a costly send instructions (note that this is a
 potnential win if one wanted to backport to a tag that didn't have the patch
 from Topi which removed excess MOVs from LOAD_PAYLOAD - 97caf5fa04dbd2),

 v2: Modify the algorithm. Instead of iterating in reverse through blocks and
 insts, since the last block/inst is the only thing which can benefit. Rebased
 on top of Ken's patching modifying is_last_send

 v3: Rebased over almost 2 months, and Incorporated feedback from Matt:
 Some comment typo fixes and rewordings.
 Whitespace
 Move the optimization pass outside of the optimize loop

 v4: Some cosmetic changes requested from Ken. These changes ensured that the
 optimization function always returned true when an optimization occurred, and
 false when one did not. This behavior did not exist with the original patch. 
 As
 a result, having the separate helper function which Matt did not like no 
 longer
 made sense, and so now I believe everyone should be happy.

 Braswell data:
 Benchmark (n=20)   %diff
 *OglBatch5 -1.4
 *OglBatch7 -1.79
 OglFillTexMulti5.57
 OglFillTexSingle   1.16
 OglShMapPcf0.05
 OglTexFilterAniso  3.01
 OglTexFilterTri1.94

 SKL data:
 NONE COLLECTED

 No piglit regressions:
 (http://otc-gfxtest-01.jf.intel.com:8080/view/dev/job/bwidawsk/112/)

 [*] I believe my measurements are incorrect for Batch5-7. If I add this new
 optimization, but never emit the new instruction I see similar results.

 Cc: Matt Turner matts...@gmail.com
 Cc: Kenneth Graunke kenn...@whitecape.org
 Cc: Jason Ekstrand ja...@jlekstrand.net
 Signed-off-by: Ben Widawsky b...@bwidawsk.net
 ---
  src/mesa/drivers/dri/i965/brw_fs.cpp   | 92 
 ++
  src/mesa/drivers/dri/i965/brw_fs.h |  3 +
  src/mesa/drivers/dri/i965/brw_fs_generator.cpp | 12 
  3 files changed, 107 insertions(+)

 diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp 
 b/src/mesa/drivers/dri/i965/brw_fs.cpp
 index 72000cf..d56d053 100644
 --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
 +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
 @@ -2557,6 +2557,96 @@ fs_visitor::opt_algebraic()
 return progress;
  }

 +/**
 + * Optimize sample messages which are followed by the final RT write.
 + *
 + * CHV, and GEN9+ can mark a texturing SEND instruction with EOT to have its
 + * results sent directly to the framebuffer, bypassing the EU.  Recognize the
 + * final texturing results copied to the framebuffer write payload and modify
 + * them to write to the framebuffer directly.
 + */
 +bool
 +fs_visitor::opt_sampler_eot()
 +{
 +   brw_wm_prog_key *key = (brw_wm_prog_key*) this-key;
 +
 +   if (brw-gen  9  !brw-is_cherryview)
 +  return false;
 +
 +   /* FINISHME: It should be possible to implement this optimization when 
 there
 +* are multiple drawbuffers.
 +*/
 +   if (key-nr_color_regions != 1)
 +  return false;
 +
 +   /* Look for a texturing instruction immediately before the final 
 FB_WRITE. */
 +   fs_inst *fb_write = (fs_inst *) cfg-blocks[cfg-num_blocks - 1]-end();
 +   assert(fb_write-eot);
 +   assert(fb_write-opcode == FS_OPCODE_FB_WRITE);
 +
 +   if (unlikely(fb_write-is_head_sentinel()))
 +   return false;

This is impossible. The assertions above are already assuming (rightly
so) that it's an actual instruction. Remove these two lines.

 +
 +   fs_inst *tex_inst = (fs_inst *) fb_write-prev;
 +
 +   /* There wasn't one; nothing to do. */
 +   if (unlikely(tex_inst-is_head_sentinel()) || !tex_inst-is_tex())
 +  return false;

This looks good.

 +
 +   /* If there's no header present, we need to munge the LOAD_PAYLOAD as 
 well.
 +* It's very likely to be the previous instruction.
 +*/
 +   fs_inst *load_payload = (fs_inst *) tex_inst-prev;
 +   if (load_payload-is_head_sentinel() ||
 +   load_payload-opcode != SHADER_OPCODE_LOAD_PAYLOAD)
 +  return false;
 +
 +   assert(!tex_inst-eot); /* We can't get here twice */
 +   assert((tex_inst-offset  (0xff  24)) == 0);
 +
 +   tex_inst-offset |= fb_write-target  24;
 +   tex_inst-eot = true;
 +   fb_write-remove(cfg-blocks[cfg-num_blocks - 1]);
 +
 +   /* If a header is present, marking the eot is sufficient. Otherwise, we 
 need
 +* to create a new LOAD_PAYLOAD command with the same sources and a space
 +* saved for the header. Using a new destination register not only makes 
 sure
 +* we have enough space, but it will make sure the dead code eliminator 
 kills
 +* the instruction that this will replace.
 +*/
 +   if (tex_inst-header_present)
 +  return true;
 +
 +   fs_reg send_header = vgrf(load_payload-sources + 1);
 +   fs_reg *new_sources =
 +  ralloc_array(mem_ctx, fs_reg, load_payload-sources + 1);
 +
 +   new_sources[0] = fs_reg();
 +   for (int i = 0; i