Re: [Mesa-dev] [PATCH 0/4] improve buffer cache and reuse
On Sat, 5 May 2018 09:11:28 +0100 Chris Wilsonwrote: > Quoting James Xiong (2018-05-05 01:56:01) > > This series align the buffer size up to page instead of a bucket > > size to improve memory allocation efficiency. > > It doesn't though. It still retrieves up to the bucket size, so with a > little cache poisoning (or a series of unfortunate events) it will be > no better than before. > > Perhaps open with the problem statement. What is it you are trying to > fix? Would adding metrics to the buffer cache be a good start to > demonstrating what needs improving? > -Chris In the worse case, it is the same as before; the best case however reduces the allocated size by about 25% of the requested size. in the real world, the best case was gl_5_off test where the patch saved 102749K(100+M) out of 1143593K, the worse case saved only 64K out of 12730K. The current implementation allocates 0% to 25% more memory than requested size with or without reuse enabled. I am trying to reduce the memory penalty. ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 0/4] improve buffer cache and reuse
Hi, On 05.05.2018 03:56, James Xiong wrote: From: "Xiong, James"With the current implementation, brw_bufmgr may round up a request size to the next bucket size, result in 25% more memory allocated in the worst senario. For example: Request sizeActual size 32KB+1Byte 40KB . 8MB+1Byte 10MB . 96MB+1Byte 112MB This series align the buffer size up to page instead of a bucket size to improve memory allocation efficiency. Performance and memory usage were measured on a gen9 platform using Basemark ES3, GfxBench 4 and 5, each test case ran 6 times. Basemark ES3 scorepeak memory size(KB) beforeafter diff before after diff max avg max avg maxavg 22 2123 21 2.83% 1.21% 409928 395573 -14355 20 2020 20 0.53% 0.41% Thanks for the new data! As the values below seem similar to what you earlier sent, I assume the tests are listed here in the same order, i.e: GfxBench 4.0 scorepeak memory size(KB) > score peak memory size(KB) > before after diffbefore after diff > max avg max avg max avg gl_4 > 584 577 586 583 0.45% 1.02% 566489 539699 -26791 manhattan > 1604 1144 1650 1202 2.81% 4.86% 439220 411596 -27624 gl_trex > 2711 2718 2152 0.25% -3.25% 126065 121398 -4667 gl_alu2 > 1218 1213 1212 1154 -0.53% -5.10%54153 53868 -285 driver2 > 106 104 106 103 0.85% -1.66%12730 12666 -64 gl_4_off > 728 727 727 726 -0.03% -0.16% 614730 586794 -27936 manhattan_off > 1732 1709 1740 1728 0.49% 1.11% 475716 447726 -27990 gl_trex_off > 3051 2969 3066 3047 0.50% 2.55% 154169 148962 -5207 gl_alu2_off > 2626 2607 2626 2625 0.00% 0.70%84119 83150 -969 driver2_off > 211 208 208 205 -1.26% -1.21%39924 39667 -257 GfxBench 5.0 > score peak memory size(KB) > beforeafter diffbefore afterdiff > max avg max avg max avg gl_5 > 260 258 259 256 -0.39% -0.85% 037 1013520 -97517 gl_5_off > 298 295 298 297 0.00% 0.45% 1143593 1040844 -102749 As expected, max gives more stable results than average. There could be performance improvement in Manhattan v3.0. At least it had largest peak memory usage saving in GfxBench v4, both absolutely & relatively (6%). gl_alu2 onscreen average drop seems also suspiciously large, but as it's not visible in max value, or in alu2 offscreen, or your previous test, I think it it's just random variation. In light of what I know of these tests variance on TDP limited devices, I think rest of your GfxBench v4 & v5 performance changes also fall within random variance. - Eero Xiong, James (4): i965/drm: Reorganize code for the next patch i965/drm: Round down buffer size and calculate the bucket index i965/drm: Searching for a cached buffer for reuse i965/drm: Purge the bucket when its cached buffer is evicted src/mesa/drivers/dri/i965/brw_bufmgr.c | 139 ++--- src/util/list.h| 5 ++ 2 files changed, 79 insertions(+), 65 deletions(-) ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 0/4] improve buffer cache and reuse
Quoting James Xiong (2018-05-05 01:56:01) > This series align the buffer size up to page instead of a bucket size > to improve memory allocation efficiency. It doesn't though. It still retrieves up to the bucket size, so with a little cache poisoning (or a series of unfortunate events) it will be no better than before. Perhaps open with the problem statement. What is it you are trying to fix? Would adding metrics to the buffer cache be a good start to demonstrating what needs improving? -Chris ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev
[Mesa-dev] [PATCH 0/4] improve buffer cache and reuse
From: "Xiong, James"With the current implementation, brw_bufmgr may round up a request size to the next bucket size, result in 25% more memory allocated in the worst senario. For example: Request sizeActual size 32KB+1Byte 40KB . 8MB+1Byte 10MB . 96MB+1Byte 112MB This series align the buffer size up to page instead of a bucket size to improve memory allocation efficiency. Performance and memory usage were measured on a gen9 platform using Basemark ES3, GfxBench 4 and 5, each test case ran 6 times. Basemark ES3 scorepeak memory size(KB) beforeafter diff before after diff max avg max avg maxavg 22 2123 21 2.83% 1.21% 409928 395573 -14355 20 2020 20 0.53% 0.41% GfxBench 4.0 scorepeak memory size(KB) score peak memory size(KB) before after diffbefore after diff max avg max avg max avg 584 577 586 583 0.45% 1.02% 566489 539699 -26791 728 727 727 726 -0.03% -0.16% 614730 586794 -27936 1604 1144 1650 1202 2.81% 4.86% 439220 411596 -27624 2711 2718 2152 0.25% -3.25% 126065 121398 -4667 1218 1213 1212 1154 -0.53% -5.10% 54153 53868 -285 106 104 106 103 0.85% -1.66% 12730 12666 -64 1732 1709 1740 1728 0.49% 1.11% 475716 447726 -27990 3051 2969 3066 3047 0.50% 2.55% 154169 148962 -5207 2626 2607 2626 2625 0.00% 0.70% 84119 83150 -969 211 208 208 205 -1.26% -1.21% 39924 39667 -257 GfxBench 5.0 score peak memory size(KB) beforeafter diffbefore afterdiff max avg max avg max avg 260 258 259 256 -0.39% -0.85% 037 1013520 -97517 298 295 298 297 0.00% 0.45% 1143593 1040844 -102749 Xiong, James (4): i965/drm: Reorganize code for the next patch i965/drm: Round down buffer size and calculate the bucket index i965/drm: Searching for a cached buffer for reuse i965/drm: Purge the bucket when its cached buffer is evicted src/mesa/drivers/dri/i965/brw_bufmgr.c | 139 ++--- src/util/list.h| 5 ++ 2 files changed, 79 insertions(+), 65 deletions(-) -- 2.7.4 ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 0/4] improve buffer cache and reuse
Hi, On 02.05.2018 21:19, James Xiong wrote: On Wed, 2 May 2018 14:18:21 +0300 Eero Tamminenwrote: [...] You're missing information on: * On which plaform you did the testing (affects variance) * how many test rounds you ran, and * what is your variance > I ran these tests on a gen9 platform/ubuntu 17.10 LTS. If it's TDP limited in 3D tests (like all NUC and e.g. Broxton devices seem to be in long running tests), it has clearly higher variance than non-TDP (or temperature) limited desktop platforms. Most of the tests are consistent, especially the memory usage. The only exception is GfxBench 4.0 gl_manhattan, I had to ran it 3 times and pick the highest one. I will apply this method to all tests and re-send with updated results. (comments below are about FPS results, not memory usage.) Performance of many GPU bound tests doesn't have normal Gaussian distribution, but two (tight) peaks. On our BXT machines these peaks are currently e.g. in GfxBench Manhattan tests *3%* apart from each other. While you can get results from both performance peaks, whether your results fall onto either of these performance peaks, is more likely to change between boots (I think due to alignment changes in kernel memory allocations), than successive runs. -> Your results may have less chance of being misleading, if you don't reboot when switching between Mesa version with your patch and one without. Especially if you're running tests only on one machine (i.e. don't have extra data from other machines against which you can correlate results), I think you need more than 3 runs, both with and without your patch. While max() can provide better comparison for this kind of bimodal result distribution than avg(), you should still calculate and provide variance for your data with your patches. - Eero ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 0/4] improve buffer cache and reuse
On Wed, 2 May 2018 14:18:21 +0300 Eero Tamminenwrote: > Hi, > > On 02.05.2018 02:25, James Xiong wrote: > > From: "Xiong, James" > > > > With the current implementation, brw_bufmgr may round up a request > > size to the next bucket size, result in 25% more memory allocated in > > the worst senario. For example: > > Request sizeActual size > > 32KB+1Byte 40KB > > . > > 8MB+1Byte 10MB > > . > > 96MB+1Byte 112MB > > This series align the buffer size up to page instead of a bucket > > size to improve memory allocation efficiency. Performances are > > almost the same with Basemark ES3, GfxBench4 and 5: > > > > Basemark ES3 > > scorepeak memory allocation > >before afterdiffbeforeafter diff > > 21.537462 21.888784 1.61%419766272 408809472 -10956800 > > 19.566198 19.763429 1.00% > > What memory you're measuring: > > * VmSize (not that relevant unless you're running out of address > space)? > > * PrivateDirty (listed in /proc/PID/smaps and e.g. by "smem" tool > [1])? > > * total of allocation sizes used by Mesa? > > Or something else? > > In general, unused memory isn't much of a problem, only dirty > (written) memory. Kernel maps all unused memory to a single zero > page, so unused memory takes only few bytes of RAM for the page table > entries (required for tracking the allocation pages). I did the measurements in brw_bufmgr from the user space, I kept tracks of the allocated size for each brw_bufmgr context, and printed out the peak allocated size when the test completed and context was destroyed. basically I increased/decreased the size when I915_GEM_CREATE or GEM_CLOSE were called, so the cached buffers, imported or user_ptr buffers were excluded. The brw_bufmgr context is created when the test starts and destroyed after it completes, the size is for the test case in bytes. This method can measure exact size allocated for a given test case and the result is precise too. > > > > GfxBench 4.0 > > score > > peak memory before after diff before > > after diff gl_4 564.6052246094 565.2348632813 > > 0.11% 578490368 550199296 -28291072 gl_4_off > > 727.0440063477 703.5833129883 -3.33% 629501952 > > 598216704 -31285248 gl_manhattan 1053.4223632813 > > 1057.3690185547 0.37% 449568768 421134336 -28434432 > > gl_trex 2708.0656738281 2699.2646484375 -0.33% > > 130076672 125042688 -5033984 gl_alu2 1207.1490478516 > > 1212.2220458984 0.42% 55496704 55029760 -466944 > > gl_driver2 103.0383071899 103.5478439331 0.49% > > 13107200 12980224 -126976 gl_manhattan_off 1703.4780273438 > > 1736.9074707031 1.92% 490016768 456548352 -33468416 > > gl_trex_off 2951.6809082031 3058.5422363281 3.49% > > 157511680 152260608 -5251072 gl_alu2_off 2604.0903320313 > > 2626.2524414063 0.84% 86130688 85483520 -647168 > > gl_driver2_off 204.0173187256 207.0510101318 1.47% > > 40869888 40615936 -253952 > > You're missing information on: > * On which plaform you did the testing (affects variance) > * how many test rounds you ran, and > * what is your variance I ran these tests on a gen9 platform/ubuntu 17.10 LTS. Most of the tests are consistent, especially the memory usage. The only exception is GfxBench 4.0 gl_manhattan, I had to ran it 3 times and pick the highest one. I will apply this method to all tests and re-send with updated results. > > -> I don't know whether your numbers are just random noise. > > > Memory is allocated in pages from kernel, so there's no point in > showing its usage as bytes. Please use KBs, that's more readable. > > (Because of randomness e.g. interactions with the windowing system, > there can be some variance also in process memory usage, which may > also be useful to report.) > > Because of variance, you don't need that decimals for the scores. > Removing the extra ones makes that data a bit more readable too. > > > - Eero > > [1] "smem" is python based tool available at least in Debian. > If you want something simpler, e.g. shell script working with > minimal shells like Busybox, you can use this: > https://github.com/maemo-tools-old/sp-memusage/blob/master/scripts/mem-smaps-private > > > > GfxBench 5.0 > > score peak memory > > beforeafter before after diff > > gl_5 259 259 1137549312 1038286848 -99262464 > > gl_5_off 297 297 1170853888 1071357952 -99495936 > > > > Xiong, James (4): > >i965/drm: Reorganize code for the next patch > >i965/drm: Round down buffer size and calculate the bucket index > >i965/drm: Searching for a cached buffer for reuse > >i965/drm: Purge the bucket when its cached buffer is evicted > > > >
Re: [Mesa-dev] [PATCH 0/4] improve buffer cache and reuse
Hi, On 02.05.2018 02:25, James Xiong wrote: From: "Xiong, James"With the current implementation, brw_bufmgr may round up a request size to the next bucket size, result in 25% more memory allocated in the worst senario. For example: Request sizeActual size 32KB+1Byte 40KB . 8MB+1Byte 10MB . 96MB+1Byte 112MB This series align the buffer size up to page instead of a bucket size to improve memory allocation efficiency. Performances are almost the same with Basemark ES3, GfxBench4 and 5: Basemark ES3 scorepeak memory allocation before afterdiffbeforeafter diff 21.537462 21.888784 1.61%419766272 408809472 -10956800 19.566198 19.763429 1.00% What memory you're measuring: * VmSize (not that relevant unless you're running out of address space)? * PrivateDirty (listed in /proc/PID/smaps and e.g. by "smem" tool [1])? * total of allocation sizes used by Mesa? Or something else? In general, unused memory isn't much of a problem, only dirty (written) memory. Kernel maps all unused memory to a single zero page, so unused memory takes only few bytes of RAM for the page table entries (required for tracking the allocation pages). GfxBench 4.0 scorepeak memory before after diff before after diff gl_4 564.6052246094 565.2348632813 0.11% 578490368 550199296 -28291072 gl_4_off 727.0440063477 703.5833129883 -3.33% 629501952 598216704 -31285248 gl_manhattan 1053.4223632813 1057.3690185547 0.37% 449568768 421134336 -28434432 gl_trex 2708.0656738281 2699.2646484375 -0.33% 130076672 125042688 -5033984 gl_alu2 1207.1490478516 1212.2220458984 0.42% 55496704 55029760 -466944 gl_driver2 103.0383071899 103.5478439331 0.49% 13107200 12980224 -126976 gl_manhattan_off 1703.4780273438 1736.9074707031 1.92% 490016768 456548352 -33468416 gl_trex_off 2951.6809082031 3058.5422363281 3.49% 157511680 152260608 -5251072 gl_alu2_off 2604.0903320313 2626.2524414063 0.84% 86130688 85483520 -647168 gl_driver2_off 204.0173187256 207.0510101318 1.47% 40869888 40615936 -253952 You're missing information on: * On which plaform you did the testing (affects variance) * how many test rounds you ran, and * what is your variance -> I don't know whether your numbers are just random noise. Memory is allocated in pages from kernel, so there's no point in showing its usage as bytes. Please use KBs, that's more readable. (Because of randomness e.g. interactions with the windowing system, there can be some variance also in process memory usage, which may also be useful to report.) Because of variance, you don't need that decimals for the scores. Removing the extra ones makes that data a bit more readable too. - Eero [1] "smem" is python based tool available at least in Debian. If you want something simpler, e.g. shell script working with minimal shells like Busybox, you can use this: https://github.com/maemo-tools-old/sp-memusage/blob/master/scripts/mem-smaps-private GfxBench 5.0 score peak memory beforeafter before after diff gl_5 259 259 1137549312 1038286848 -99262464 gl_5_off 297 297 1170853888 1071357952 -99495936 Xiong, James (4): i965/drm: Reorganize code for the next patch i965/drm: Round down buffer size and calculate the bucket index i965/drm: Searching for a cached buffer for reuse i965/drm: Purge the bucket when its cached buffer is evicted src/mesa/drivers/dri/i965/brw_bufmgr.c | 139 ++--- src/util/list.h| 5 ++ 2 files changed, 79 insertions(+), 65 deletions(-) ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev
[Mesa-dev] [PATCH 0/4] improve buffer cache and reuse
From: "Xiong, James"With the current implementation, brw_bufmgr may round up a request size to the next bucket size, result in 25% more memory allocated in the worst senario. For example: Request sizeActual size 32KB+1Byte 40KB . 8MB+1Byte 10MB . 96MB+1Byte 112MB This series align the buffer size up to page instead of a bucket size to improve memory allocation efficiency. Performances are almost the same with Basemark ES3, GfxBench4 and 5: Basemark ES3 scorepeak memory allocation before afterdiffbeforeafter diff 21.537462 21.888784 1.61%419766272 408809472 -10956800 19.566198 19.763429 1.00% GfxBench 4.0 scorepeak memory before after diff before after diff gl_4 564.6052246094 565.2348632813 0.11% 578490368 550199296 -28291072 gl_4_off 727.0440063477 703.5833129883 -3.33% 629501952 598216704 -31285248 gl_manhattan 1053.4223632813 1057.3690185547 0.37% 449568768 421134336 -28434432 gl_trex 2708.0656738281 2699.2646484375 -0.33% 130076672 125042688 -5033984 gl_alu2 1207.1490478516 1212.2220458984 0.42% 55496704 55029760 -466944 gl_driver2 103.0383071899 103.5478439331 0.49% 13107200 12980224 -126976 gl_manhattan_off 1703.4780273438 1736.9074707031 1.92% 490016768 456548352 -33468416 gl_trex_off 2951.6809082031 3058.5422363281 3.49% 157511680 152260608 -5251072 gl_alu2_off 2604.0903320313 2626.2524414063 0.84% 86130688 85483520 -647168 gl_driver2_off 204.0173187256 207.0510101318 1.47% 40869888 40615936 -253952 GfxBench 5.0 score peak memory before after before after diff gl_5 259 259 1137549312 1038286848 -99262464 gl_5_off 297 297 1170853888 1071357952 -99495936 Xiong, James (4): i965/drm: Reorganize code for the next patch i965/drm: Round down buffer size and calculate the bucket index i965/drm: Searching for a cached buffer for reuse i965/drm: Purge the bucket when its cached buffer is evicted src/mesa/drivers/dri/i965/brw_bufmgr.c | 139 ++--- src/util/list.h| 5 ++ 2 files changed, 79 insertions(+), 65 deletions(-) -- 2.7.4 ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev