Re: [Mesa-dev] [PATCH 0/4] improve buffer cache and reuse

2018-05-07 Thread James Xiong
On Sat, 5 May 2018 09:11:28 +0100
Chris Wilson  wrote:

> Quoting James Xiong (2018-05-05 01:56:01)
> > This series align the buffer size up to page instead of a bucket
> > size to improve memory allocation efficiency.  
> 
> It doesn't though. It still retrieves up to the bucket size, so with a
> little cache poisoning (or a series of unfortunate events) it will be
> no better than before.
> 
> Perhaps open with the problem statement. What is it you are trying to
> fix? Would adding metrics to the buffer cache be a good start to
> demonstrating what needs improving?
> -Chris
In the worse case, it is the same as before; the best case however
reduces the allocated size by about 25% of the requested size. in
the real world, the best case was gl_5_off test where the patch saved
102749K(100+M) out of 1143593K, the worse case saved only 64K out of
12730K.

The current implementation allocates 0% to 25% more memory than
requested size with or without reuse enabled. I am trying to reduce the
memory penalty.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 0/4] improve buffer cache and reuse

2018-05-07 Thread Eero Tamminen

Hi,

On 05.05.2018 03:56, James Xiong wrote:

From: "Xiong, James" 

With the current implementation, brw_bufmgr may round up a request
size to the next bucket size, result in 25% more memory allocated in
the worst senario. For example:
Request sizeActual size
32KB+1Byte  40KB
.
8MB+1Byte   10MB
.
96MB+1Byte  112MB
This series align the buffer size up to page instead of a bucket size
to improve memory allocation efficiency.

Performance and memory usage were measured on a gen9 platform using
Basemark ES3, GfxBench 4 and 5, each test case ran 6 times.

Basemark ES3
scorepeak memory size(KB)
beforeafter   diff   before  after   diff
max avg   max avg maxavg
22  2123  21  2.83%  1.21%   409928  395573  -14355
20  2020  20  0.53%  0.41%  


Thanks for the new data!

As the values below seem similar to what you earlier sent, I assume
the tests are listed here in the same order, i.e:


GfxBench 4.0
scorepeak memory size(KB)
  > score   peak memory 
size(KB)
  > before  after   diffbefore  after 
diff

  > max   avg   max   avg   max avg
gl_4  >  584   577   586   583  0.45%   1.02%   566489  539699 
-26791
manhattan > 1604  1144  1650  1202	2.81%   4.86%   439220  411596 
-27624
gl_trex   > 2711    2718  2152  0.25%  -3.25%   126065  121398 
-4667
gl_alu2   > 1218  1213  1212  1154 -0.53%  -5.10%54153   53868 
 -285
driver2   >  106   104   106   103  0.85%  -1.66%12730   12666 
  -64
gl_4_off  >  728   727   727   726 -0.03%  -0.16%   614730  586794 
-27936
manhattan_off > 1732  1709  1740  1728  0.49%   1.11%   475716  447726 
-27990
gl_trex_off   > 3051  2969  3066  3047  0.50%   2.55%   154169  148962 
-5207
gl_alu2_off   > 2626  2607  2626  2625  0.00%   0.70%84119   83150 
 -969
driver2_off   >  211   208   208   205 -1.26%  -1.21%39924   39667 
 -257



GfxBench 5.0

  > score   peak memory size(KB)
  > beforeafter diffbefore   afterdiff
  > max  avg  max  avg  max avg
gl_5  > 260  258  259  256 -0.39%  -0.85%   037  1013520 
-97517
gl_5_off  > 298  295  298  297  0.00%   0.45%   1143593  1040844 
-102749


As expected, max gives more stable results than average.

There could be performance improvement in Manhattan v3.0. At least it
had largest peak memory usage saving in GfxBench v4, both absolutely &
relatively (6%).

gl_alu2 onscreen average drop seems also suspiciously large, but as it's
not visible in max value, or in alu2 offscreen, or your previous test,
I think it it's just random variation.

In light of what I know of these tests variance on TDP limited devices,
I think rest of your GfxBench v4 & v5 performance changes also fall 
within random variance.



- Eero



Xiong, James (4):
   i965/drm: Reorganize code for the next patch
   i965/drm: Round down buffer size and calculate the bucket index
   i965/drm: Searching for a cached buffer for reuse
   i965/drm: Purge the bucket when its cached buffer is evicted

  src/mesa/drivers/dri/i965/brw_bufmgr.c | 139 ++---
  src/util/list.h|   5 ++
  2 files changed, 79 insertions(+), 65 deletions(-)

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 0/4] improve buffer cache and reuse

2018-05-05 Thread Chris Wilson
Quoting James Xiong (2018-05-05 01:56:01)
> This series align the buffer size up to page instead of a bucket size
> to improve memory allocation efficiency.

It doesn't though. It still retrieves up to the bucket size, so with a
little cache poisoning (or a series of unfortunate events) it will be no
better than before.

Perhaps open with the problem statement. What is it you are trying to
fix? Would adding metrics to the buffer cache be a good start to
demonstrating what needs improving?
-Chris
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [PATCH 0/4] improve buffer cache and reuse

2018-05-04 Thread James Xiong
From: "Xiong, James" 

With the current implementation, brw_bufmgr may round up a request
size to the next bucket size, result in 25% more memory allocated in
the worst senario. For example:
Request sizeActual size
32KB+1Byte  40KB
.
8MB+1Byte   10MB
.
96MB+1Byte  112MB
This series align the buffer size up to page instead of a bucket size
to improve memory allocation efficiency.

Performance and memory usage were measured on a gen9 platform using
Basemark ES3, GfxBench 4 and 5, each test case ran 6 times.

Basemark ES3
scorepeak memory size(KB)
beforeafter   diff   before  after   diff
max avg   max avg maxavg
22  2123  21  2.83%  1.21%   409928  395573  -14355
20  2020  20  0.53%  0.41%  

GfxBench 4.0
scorepeak memory size(KB)
score   peak memory size(KB) 
before  after   diffbefore  after   diff
max   avg   max   avg   max avg
584   577   586   583   0.45%   1.02%   566489  539699  -26791
728   727   727   726   -0.03%  -0.16%  614730  586794  -27936
1604  1144  1650  1202  2.81%   4.86%   439220  411596  -27624
2711    2718  2152  0.25%   -3.25%  126065  121398  -4667
1218  1213  1212  1154  -0.53%  -5.10%  54153   53868   -285
106   104   106   103   0.85%   -1.66%  12730   12666   -64
1732  1709  1740  1728  0.49%   1.11%   475716  447726  -27990
3051  2969  3066  3047  0.50%   2.55%   154169  148962  -5207
2626  2607  2626  2625  0.00%   0.70%   84119   83150   -969
211   208   208   205   -1.26%  -1.21%  39924   39667   -257


GfxBench 5.0
score   peak memory size(KB)
beforeafter diffbefore   afterdiff
max  avg  max  avg  max avg  
260  258  259  256  -0.39%  -0.85%  037  1013520  -97517
298  295  298  297  0.00%   0.45%   1143593  1040844  -102749

Xiong, James (4):
  i965/drm: Reorganize code for the next patch
  i965/drm: Round down buffer size and calculate the bucket index
  i965/drm: Searching for a cached buffer for reuse
  i965/drm: Purge the bucket when its cached buffer is evicted

 src/mesa/drivers/dri/i965/brw_bufmgr.c | 139 ++---
 src/util/list.h|   5 ++
 2 files changed, 79 insertions(+), 65 deletions(-)

-- 
2.7.4

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 0/4] improve buffer cache and reuse

2018-05-03 Thread Eero Tamminen

Hi,

On 02.05.2018 21:19, James Xiong wrote:

On Wed, 2 May 2018 14:18:21 +0300
Eero Tamminen  wrote:

[...]

You're missing information on:
* On which plaform you did the testing (affects variance)
* how many test rounds you ran, and
* what is your variance

>

I ran these tests on a gen9 platform/ubuntu 17.10 LTS.


If it's TDP limited in 3D tests (like all NUC and e.g. Broxton devices
seem to be in long running tests), it has clearly higher variance than
non-TDP (or temperature) limited desktop platforms.



Most of the tests
are consistent, especially the memory usage. The only exception is
GfxBench 4.0 gl_manhattan, I had to ran it 3 times and pick the highest
one. I will apply this method to all tests and re-send with updated
results.


(comments below are about FPS results, not memory usage.)

Performance of many GPU bound tests doesn't have normal Gaussian
distribution, but two (tight) peaks.  On our BXT machines these peaks
are currently e.g. in GfxBench Manhattan tests *3%* apart from each
other.

While you can get results from both performance peaks, whether your 
results fall onto either of these performance peaks, is more likely

to change between boots (I think due to alignment changes in kernel
memory allocations), than successive runs.
-> Your results may have less chance of being misleading, if you
   don't reboot when switching between Mesa version with your patch
   and one without.

Especially if you're running tests only on one machine (i.e. don't have
extra data from other machines against which you can correlate results),
I think you need more than 3 runs, both with and  without your patch.

While max() can provide better comparison for this kind of bimodal
result distribution than avg(), you should still calculate and provide
variance for your data with your patches.


- Eero
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 0/4] improve buffer cache and reuse

2018-05-02 Thread James Xiong
On Wed, 2 May 2018 14:18:21 +0300
Eero Tamminen  wrote:

> Hi,
> 
> On 02.05.2018 02:25, James Xiong wrote:
> > From: "Xiong, James" 
> > 
> > With the current implementation, brw_bufmgr may round up a request
> > size to the next bucket size, result in 25% more memory allocated in
> > the worst senario. For example:
> > Request sizeActual size
> > 32KB+1Byte  40KB
> > .
> > 8MB+1Byte   10MB
> > .
> > 96MB+1Byte  112MB
> > This series align the buffer size up to page instead of a bucket
> > size to improve memory allocation efficiency. Performances are
> > almost the same with Basemark ES3, GfxBench4 and 5:
> > 
> > Basemark ES3
> > scorepeak memory allocation
> >before  afterdiffbeforeafter  diff
> > 21.537462  21.888784  1.61%419766272  408809472  -10956800
> > 19.566198  19.763429  1.00%   
> 
> What memory you're measuring:
> 
> * VmSize (not that relevant unless you're running out of address
> space)?
> 
> * PrivateDirty (listed in /proc/PID/smaps and e.g. by "smem" tool
> [1])?
> 
> * total of allocation sizes used by Mesa?
> 
> Or something else?
> 
> In general, unused memory isn't much of a problem, only dirty
> (written) memory.  Kernel maps all unused memory to a single zero
> page, so unused memory takes only few bytes of RAM for the page table
> entries (required for tracking the allocation pages).
I did the measurements in brw_bufmgr from the user space, I kept tracks
of the allocated size for each brw_bufmgr context, and printed out the
peak allocated size when the test completed and context was destroyed.
basically I increased/decreased the size when I915_GEM_CREATE or
GEM_CLOSE were called, so the cached buffers, imported or user_ptr
buffers were excluded.

The brw_bufmgr context is created when the test starts and destroyed
after it completes, the size is for the test case in bytes. This method
can measure exact size allocated for a given test case and the result
is precise too.
> 
> 
> > GfxBench 4.0
> >  score
> > peak memory before after diff before
> > after diff gl_4 564.6052246094  565.2348632813
> > 0.11%   578490368 550199296 -28291072 gl_4_off
> > 727.0440063477  703.5833129883   -3.33% 629501952
> > 598216704 -31285248 gl_manhattan 1053.4223632813
> > 1057.3690185547 0.37%   449568768 421134336 -28434432
> > gl_trex  2708.0656738281 2699.2646484375 -0.33%
> > 130076672 125042688 -5033984 gl_alu2  1207.1490478516
> > 1212.2220458984 0.42%   55496704  55029760  -466944
> > gl_driver2   103.0383071899  103.5478439331  0.49%
> > 13107200  12980224  -126976 gl_manhattan_off 1703.4780273438
> > 1736.9074707031 1.92%   490016768 456548352 -33468416
> > gl_trex_off  2951.6809082031 3058.5422363281 3.49%
> > 157511680 152260608 -5251072 gl_alu2_off  2604.0903320313
> > 2626.2524414063 0.84%   86130688  85483520  -647168
> > gl_driver2_off   204.0173187256  207.0510101318  1.47%
> > 40869888  40615936  -253952  
> 
> You're missing information on:
> * On which plaform you did the testing (affects variance)
> * how many test rounds you ran, and
> * what is your variance
I ran these tests on a gen9 platform/ubuntu 17.10 LTS. Most of the tests
are consistent, especially the memory usage. The only exception is
GfxBench 4.0 gl_manhattan, I had to ran it 3 times and pick the highest
one. I will apply this method to all tests and re-send with updated
results.
> 
> -> I don't know whether your numbers are just random noise.  
> 
> 
> Memory is allocated in pages from kernel, so there's no point in
> showing its usage as bytes.  Please use KBs, that's more readable.
> 
> (Because of randomness e.g. interactions with the windowing system, 
> there can be some variance also in process memory usage, which may
> also be useful to report.)
> 
> Because of variance, you don't need that decimals for the scores. 
> Removing the extra ones makes that data a bit more readable too.
> 
> 
>   - Eero
> 
> [1] "smem" is python based tool available at least in Debian.
> If you want something simpler, e.g. shell script working with
> minimal shells like Busybox, you can use this:
> https://github.com/maemo-tools-old/sp-memusage/blob/master/scripts/mem-smaps-private
> 
> 
> > GfxBench 5.0
> >  score   peak memory
> >   beforeafter   before after   diff
> > gl_5   259   259  1137549312  1038286848 -99262464
> > gl_5_off   297   297  1170853888  1071357952 -99495936
> > 
> > Xiong, James (4):
> >i965/drm: Reorganize code for the next patch
> >i965/drm: Round down buffer size and calculate the bucket index
> >i965/drm: Searching for a cached buffer for reuse
> >i965/drm: Purge the bucket when its cached buffer is evicted
> > 
> >   

Re: [Mesa-dev] [PATCH 0/4] improve buffer cache and reuse

2018-05-02 Thread Eero Tamminen

Hi,

On 02.05.2018 02:25, James Xiong wrote:

From: "Xiong, James" 

With the current implementation, brw_bufmgr may round up a request
size to the next bucket size, result in 25% more memory allocated in
the worst senario. For example:
Request sizeActual size
32KB+1Byte  40KB
.
8MB+1Byte   10MB
.
96MB+1Byte  112MB
This series align the buffer size up to page instead of a bucket size
to improve memory allocation efficiency. Performances are almost the
same with Basemark ES3, GfxBench4 and 5:

Basemark ES3
scorepeak memory allocation
   before  afterdiffbeforeafter  diff
21.537462  21.888784  1.61%419766272  408809472  -10956800
19.566198  19.763429  1.00% 


What memory you're measuring:

* VmSize (not that relevant unless you're running out of address space)?

* PrivateDirty (listed in /proc/PID/smaps and e.g. by "smem" tool [1])?

* total of allocation sizes used by Mesa?

Or something else?

In general, unused memory isn't much of a problem, only dirty (written) 
memory.  Kernel maps all unused memory to a single zero page, so unused 
memory takes only few bytes of RAM for the page table entries (required 
for tracking the allocation pages).




GfxBench 4.0
 scorepeak memory
  before after diff before   after 
diff
gl_4 564.6052246094  565.2348632813  0.11%  578490368 550199296 
-28291072
gl_4_off 727.0440063477  703.5833129883  -3.33% 629501952 598216704 
-31285248
gl_manhattan 1053.4223632813 1057.3690185547 0.37%  449568768 421134336 
-28434432
gl_trex  2708.0656738281 2699.2646484375 -0.33% 130076672 125042688 
-5033984
gl_alu2  1207.1490478516 1212.2220458984 0.42%  55496704  55029760  
-466944
gl_driver2   103.0383071899  103.5478439331  0.49%  13107200  12980224  
-126976
gl_manhattan_off 1703.4780273438 1736.9074707031 1.92%  490016768 456548352 
-33468416
gl_trex_off  2951.6809082031 3058.5422363281 3.49%  157511680 152260608 
-5251072
gl_alu2_off  2604.0903320313 2626.2524414063 0.84%  86130688  85483520  
-647168
gl_driver2_off   204.0173187256  207.0510101318  1.47%  40869888  40615936  
-253952


You're missing information on:
* On which plaform you did the testing (affects variance)
* how many test rounds you ran, and
* what is your variance

-> I don't know whether your numbers are just random noise.


Memory is allocated in pages from kernel, so there's no point in showing 
its usage as bytes.  Please use KBs, that's more readable.


(Because of randomness e.g. interactions with the windowing system, 
there can be some variance also in process memory usage, which may

also be useful to report.)

Because of variance, you don't need that decimals for the scores. 
Removing the extra ones makes that data a bit more readable too.



- Eero

[1] "smem" is python based tool available at least in Debian.
If you want something simpler, e.g. shell script working with
minimal shells like Busybox, you can use this:
https://github.com/maemo-tools-old/sp-memusage/blob/master/scripts/mem-smaps-private



GfxBench 5.0
 score   peak memory
  beforeafter   before after   diff
gl_5   259   259  1137549312  1038286848 -99262464
gl_5_off   297   297  1170853888  1071357952 -99495936

Xiong, James (4):
   i965/drm: Reorganize code for the next patch
   i965/drm: Round down buffer size and calculate the bucket index
   i965/drm: Searching for a cached buffer for reuse
   i965/drm: Purge the bucket when its cached buffer is evicted

  src/mesa/drivers/dri/i965/brw_bufmgr.c | 139 ++---
  src/util/list.h|   5 ++
  2 files changed, 79 insertions(+), 65 deletions(-)



___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [PATCH 0/4] improve buffer cache and reuse

2018-05-01 Thread James Xiong
From: "Xiong, James" 

With the current implementation, brw_bufmgr may round up a request
size to the next bucket size, result in 25% more memory allocated in
the worst senario. For example:
Request sizeActual size
32KB+1Byte  40KB
.
8MB+1Byte   10MB
.
96MB+1Byte  112MB
This series align the buffer size up to page instead of a bucket size
to improve memory allocation efficiency. Performances are almost the
same with Basemark ES3, GfxBench4 and 5:

Basemark ES3
   scorepeak memory allocation
  before  afterdiffbeforeafter  diff
21.537462  21.888784  1.61%419766272  408809472  -10956800
19.566198  19.763429  1.00% 

GfxBench 4.0
scorepeak memory
 before after diff before   after 
diff
gl_4 564.6052246094  565.2348632813  0.11%  578490368 550199296 
-28291072
gl_4_off 727.0440063477  703.5833129883  -3.33% 629501952 598216704 
-31285248
gl_manhattan 1053.4223632813 1057.3690185547 0.37%  449568768 421134336 
-28434432
gl_trex  2708.0656738281 2699.2646484375 -0.33% 130076672 125042688 
-5033984
gl_alu2  1207.1490478516 1212.2220458984 0.42%  55496704  55029760  
-466944
gl_driver2   103.0383071899  103.5478439331  0.49%  13107200  12980224  
-126976
gl_manhattan_off 1703.4780273438 1736.9074707031 1.92%  490016768 456548352 
-33468416
gl_trex_off  2951.6809082031 3058.5422363281 3.49%  157511680 152260608 
-5251072
gl_alu2_off  2604.0903320313 2626.2524414063 0.84%  86130688  85483520  
-647168
gl_driver2_off   204.0173187256  207.0510101318  1.47%  40869888  40615936  
-253952

GfxBench 5.0
score   peak memory 
 before after   before after   diff
gl_5   259   259  1137549312  1038286848 -99262464
gl_5_off   297   297  1170853888  1071357952 -99495936

Xiong, James (4):
  i965/drm: Reorganize code for the next patch
  i965/drm: Round down buffer size and calculate the bucket index
  i965/drm: Searching for a cached buffer for reuse
  i965/drm: Purge the bucket when its cached buffer is evicted

 src/mesa/drivers/dri/i965/brw_bufmgr.c | 139 ++---
 src/util/list.h|   5 ++
 2 files changed, 79 insertions(+), 65 deletions(-)

-- 
2.7.4

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev