Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
2012/12/5, Wanpeng Li : > Hi Namjae, > > How about set bdi->dirty_background_bytes according to bdi_thresh? I found > an issue during background flush process when review codes, if over > background > flush threshold, wb_check_background_flush will kick a work to current > per-bdi > flusher, but maybe it is other heavy dirties written in other bdis who > heavily > dirty pages instead of current bdi, the worst case is current bdi has many > frequently used data and flush lead to cache thresh. How about add a check > in wb_check_background_flush if it is not current bdi who contributes large > > number of dirty pages to background flush threshold(over > bdi->dirty_background_bytes), > then don't bother it. Hi Wanpeng. First, Thanks for your suggestion! Yes, I think that it looks reasonable. I will start checking it. Thanks. > > Regards, > Wanpeng Li > > On Tue, Nov 20, 2012 at 08:18:59AM +0900, Namjae Jeon wrote: >>2012/10/22, Dave Chinner : >>> On Fri, Oct 19, 2012 at 04:51:05PM +0900, Namjae Jeon wrote: Hi Dave. Test Procedure: 1) Local USB disk WRITE speed on NFS server is ~25 MB/s 2) Run WRITE test(create 1 GB file) on NFS Client with default writeback settings on NFS Server. By default bdi->dirty_background_bytes = 0, that means no change in default writeback behaviour 3) Next we change bdi->dirty_background_bytes = 25 MB (almost equal to local USB disk write speed on NFS Server) *** only on NFS Server - not on NFS Client *** >>> >>> Ok, so the results look good, but it's not really addressing what I >>> was asking, though. A typical desktop PC has a disk that can do >>> 100MB/s and GbE, so I was expecting a test that showed throughput >>> close to GbE maximums at least (ie. around that 100MB/s). I have 3 >>> year old, low end, low power hardware (atom) that hanles twice the >>> throughput you are testing here, and most current consumer NAS >>> devices are more powerful than this. IOWs, I think the rates you are >>> testing at are probably too low even for the consumer NAS market to >>> consider relevant... >>> -- Multiple NFS Client test: --- Sorry - We could not arrange multiple PCs to verify this. So, we tried 1 NFS Server + 2 NFS Clients using 3 target boards: ARM Target + 512 MB RAM + ethernet - 100 Mbits/s, create 1 GB File >>> >>> But this really doesn't tells us anything - it's still only 100Mb/s, >>> which we'd expect is already getting very close to line rate even >>> with low powered client hardware. >>> >>> What I'm concerned about the NFS server "sweet spot" - a $10k server >>> that exports 20TB of storage and can sustain close to a GB/s of NFS >>> traffic over a single 10GbE link with tens to hundreds of clients. >>> 100MB/s and 10 clients is about the minimum needed to be able to >>> extrapolate a litle and make an informed guess of how it will scale >>> up >>> > 1. what's the comparison in performance to typical NFS > server writeback parameter tuning? i.e. dirty_background_ratio=5, > dirty_ratio=10, dirty_expire_centiseconds=1000, > dirty_writeback_centisecs=1? i.e. does this give change give any > benefit over the current common practice for configuring NFS > servers? Agreed, that above improvement in write speed can be achieved by tuning above write-back parameters. But if we change these settings, it will change write-back behavior system wide. On the other hand, if we change proposed per bdi setting, bdi->dirty_background_bytes it will change write-back behavior for the block device exported on NFS server. >>> >>> I already know what the difference between global vs per-bdi tuning >>> means. What I want to know is how your results compare >>> *numerically* to just having a tweaked global setting on a vanilla >>> kernel. i.e. is there really any performance benefit to per-bdi >>> configuration that cannot be gained by existing methods? >>> > 2. what happens when you have 10 clients all writing to the server > at once? Or a 100? NFS servers rarely have a single writer to a > single file at a time, so what impact does this change have on > multiple concurrent file write performance from multiple clients Sorry, we could not arrange more than 2 PCs for verifying this. >>> >>> Really? Well, perhaps there's some tools that might be useful for >>> you here: >>> >>> http://oss.sgi.com/projects/nfs/testtools/ >>> >>> "Weber >>> >>> Test load generator for NFS. Uses multiple threads, multiple >>> sockets and multiple IP addresses to simulate loads from many >>> machines, thus enabling testing of NFS server setups with larger >>> client counts than can be tested with physical infrastructure (or >>> Virtual Machine clients). Has
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
2012/12/5, Wanpeng Li liw...@linux.vnet.ibm.com: Hi Namjae, How about set bdi-dirty_background_bytes according to bdi_thresh? I found an issue during background flush process when review codes, if over background flush threshold, wb_check_background_flush will kick a work to current per-bdi flusher, but maybe it is other heavy dirties written in other bdis who heavily dirty pages instead of current bdi, the worst case is current bdi has many frequently used data and flush lead to cache thresh. How about add a check in wb_check_background_flush if it is not current bdi who contributes large number of dirty pages to background flush threshold(over bdi-dirty_background_bytes), then don't bother it. Hi Wanpeng. First, Thanks for your suggestion! Yes, I think that it looks reasonable. I will start checking it. Thanks. Regards, Wanpeng Li On Tue, Nov 20, 2012 at 08:18:59AM +0900, Namjae Jeon wrote: 2012/10/22, Dave Chinner da...@fromorbit.com: On Fri, Oct 19, 2012 at 04:51:05PM +0900, Namjae Jeon wrote: Hi Dave. Test Procedure: 1) Local USB disk WRITE speed on NFS server is ~25 MB/s 2) Run WRITE test(create 1 GB file) on NFS Client with default writeback settings on NFS Server. By default bdi-dirty_background_bytes = 0, that means no change in default writeback behaviour 3) Next we change bdi-dirty_background_bytes = 25 MB (almost equal to local USB disk write speed on NFS Server) *** only on NFS Server - not on NFS Client *** Ok, so the results look good, but it's not really addressing what I was asking, though. A typical desktop PC has a disk that can do 100MB/s and GbE, so I was expecting a test that showed throughput close to GbE maximums at least (ie. around that 100MB/s). I have 3 year old, low end, low power hardware (atom) that hanles twice the throughput you are testing here, and most current consumer NAS devices are more powerful than this. IOWs, I think the rates you are testing at are probably too low even for the consumer NAS market to consider relevant... -- Multiple NFS Client test: --- Sorry - We could not arrange multiple PCs to verify this. So, we tried 1 NFS Server + 2 NFS Clients using 3 target boards: ARM Target + 512 MB RAM + ethernet - 100 Mbits/s, create 1 GB File But this really doesn't tells us anything - it's still only 100Mb/s, which we'd expect is already getting very close to line rate even with low powered client hardware. What I'm concerned about the NFS server sweet spot - a $10k server that exports 20TB of storage and can sustain close to a GB/s of NFS traffic over a single 10GbE link with tens to hundreds of clients. 100MB/s and 10 clients is about the minimum needed to be able to extrapolate a litle and make an informed guess of how it will scale up 1. what's the comparison in performance to typical NFS server writeback parameter tuning? i.e. dirty_background_ratio=5, dirty_ratio=10, dirty_expire_centiseconds=1000, dirty_writeback_centisecs=1? i.e. does this give change give any benefit over the current common practice for configuring NFS servers? Agreed, that above improvement in write speed can be achieved by tuning above write-back parameters. But if we change these settings, it will change write-back behavior system wide. On the other hand, if we change proposed per bdi setting, bdi-dirty_background_bytes it will change write-back behavior for the block device exported on NFS server. I already know what the difference between global vs per-bdi tuning means. What I want to know is how your results compare *numerically* to just having a tweaked global setting on a vanilla kernel. i.e. is there really any performance benefit to per-bdi configuration that cannot be gained by existing methods? 2. what happens when you have 10 clients all writing to the server at once? Or a 100? NFS servers rarely have a single writer to a single file at a time, so what impact does this change have on multiple concurrent file write performance from multiple clients Sorry, we could not arrange more than 2 PCs for verifying this. Really? Well, perhaps there's some tools that might be useful for you here: http://oss.sgi.com/projects/nfs/testtools/ Weber Test load generator for NFS. Uses multiple threads, multiple sockets and multiple IP addresses to simulate loads from many machines, thus enabling testing of NFS server setups with larger client counts than can be tested with physical infrastructure (or Virtual Machine clients). Has been useful in automated NFS testing and as a pinpoint NFS load generator tool for performance development. Hi Dave, We ran weber test on below setup: 1) SATA HDD - Local WRITE speed ~120 MB/s, NFS WRITE speed ~90 MB/s 2) Used 10GbE - network interface to mount NFS We ran weber test
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
2012/10/22, Dave Chinner : > On Fri, Oct 19, 2012 at 04:51:05PM +0900, Namjae Jeon wrote: >> Hi Dave. >> >> Test Procedure: >> >> 1) Local USB disk WRITE speed on NFS server is ~25 MB/s >> >> 2) Run WRITE test(create 1 GB file) on NFS Client with default >> writeback settings on NFS Server. By default >> bdi->dirty_background_bytes = 0, that means no change in default >> writeback behaviour >> >> 3) Next we change bdi->dirty_background_bytes = 25 MB (almost equal to >> local USB disk write speed on NFS Server) >> *** only on NFS Server - not on NFS Client *** > > Ok, so the results look good, but it's not really addressing what I > was asking, though. A typical desktop PC has a disk that can do > 100MB/s and GbE, so I was expecting a test that showed throughput > close to GbE maximums at least (ie. around that 100MB/s). I have 3 > year old, low end, low power hardware (atom) that hanles twice the > throughput you are testing here, and most current consumer NAS > devices are more powerful than this. IOWs, I think the rates you are > testing at are probably too low even for the consumer NAS market to > consider relevant... > >> -- >> Multiple NFS Client test: >> --- >> Sorry - We could not arrange multiple PCs to verify this. >> So, we tried 1 NFS Server + 2 NFS Clients using 3 target boards: >> ARM Target + 512 MB RAM + ethernet - 100 Mbits/s, create 1 GB File > > But this really doesn't tells us anything - it's still only 100Mb/s, > which we'd expect is already getting very close to line rate even > with low powered client hardware. > > What I'm concerned about the NFS server "sweet spot" - a $10k server > that exports 20TB of storage and can sustain close to a GB/s of NFS > traffic over a single 10GbE link with tens to hundreds of clients. > 100MB/s and 10 clients is about the minimum needed to be able to > extrapolate a litle and make an informed guess of how it will scale > up > >> > 1. what's the comparison in performance to typical NFS >> > server writeback parameter tuning? i.e. dirty_background_ratio=5, >> > dirty_ratio=10, dirty_expire_centiseconds=1000, >> > dirty_writeback_centisecs=1? i.e. does this give change give any >> > benefit over the current common practice for configuring NFS >> > servers? >> >> Agreed, that above improvement in write speed can be achieved by >> tuning above write-back parameters. >> But if we change these settings, it will change write-back behavior >> system wide. >> On the other hand, if we change proposed per bdi setting, >> bdi->dirty_background_bytes it will change write-back behavior for the >> block device exported on NFS server. > > I already know what the difference between global vs per-bdi tuning > means. What I want to know is how your results compare > *numerically* to just having a tweaked global setting on a vanilla > kernel. i.e. is there really any performance benefit to per-bdi > configuration that cannot be gained by existing methods? > >> > 2. what happens when you have 10 clients all writing to the server >> > at once? Or a 100? NFS servers rarely have a single writer to a >> > single file at a time, so what impact does this change have on >> > multiple concurrent file write performance from multiple clients >> >> Sorry, we could not arrange more than 2 PCs for verifying this. > > Really? Well, perhaps there's some tools that might be useful for > you here: > > http://oss.sgi.com/projects/nfs/testtools/ > > "Weber > > Test load generator for NFS. Uses multiple threads, multiple > sockets and multiple IP addresses to simulate loads from many > machines, thus enabling testing of NFS server setups with larger > client counts than can be tested with physical infrastructure (or > Virtual Machine clients). Has been useful in automated NFS testing > and as a pinpoint NFS load generator tool for performance > development." > Hi Dave, We ran "weber" test on below setup: 1) SATA HDD - Local WRITE speed ~120 MB/s, NFS WRITE speed ~90 MB/s 2) Used 10GbE - network interface to mount NFS We ran "weber" test with NFS clients ranging from 1 to 100, below is the % GAIN in NFS WRITE speed with bdi->dirty_background_bytes = 100 MB at NFS server - | Number of NFS Clients |% GAIN in WRITE Speed | |---| | 1 | 19.83 % | |---| | 2 | 2.97 % | |---| | 3 | 2.01 % | |---| |10 | 0.25 % | |---| |20 | 0.23 % | |---| |30
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
2012/10/22, Dave Chinner da...@fromorbit.com: On Fri, Oct 19, 2012 at 04:51:05PM +0900, Namjae Jeon wrote: Hi Dave. Test Procedure: 1) Local USB disk WRITE speed on NFS server is ~25 MB/s 2) Run WRITE test(create 1 GB file) on NFS Client with default writeback settings on NFS Server. By default bdi-dirty_background_bytes = 0, that means no change in default writeback behaviour 3) Next we change bdi-dirty_background_bytes = 25 MB (almost equal to local USB disk write speed on NFS Server) *** only on NFS Server - not on NFS Client *** Ok, so the results look good, but it's not really addressing what I was asking, though. A typical desktop PC has a disk that can do 100MB/s and GbE, so I was expecting a test that showed throughput close to GbE maximums at least (ie. around that 100MB/s). I have 3 year old, low end, low power hardware (atom) that hanles twice the throughput you are testing here, and most current consumer NAS devices are more powerful than this. IOWs, I think the rates you are testing at are probably too low even for the consumer NAS market to consider relevant... -- Multiple NFS Client test: --- Sorry - We could not arrange multiple PCs to verify this. So, we tried 1 NFS Server + 2 NFS Clients using 3 target boards: ARM Target + 512 MB RAM + ethernet - 100 Mbits/s, create 1 GB File But this really doesn't tells us anything - it's still only 100Mb/s, which we'd expect is already getting very close to line rate even with low powered client hardware. What I'm concerned about the NFS server sweet spot - a $10k server that exports 20TB of storage and can sustain close to a GB/s of NFS traffic over a single 10GbE link with tens to hundreds of clients. 100MB/s and 10 clients is about the minimum needed to be able to extrapolate a litle and make an informed guess of how it will scale up 1. what's the comparison in performance to typical NFS server writeback parameter tuning? i.e. dirty_background_ratio=5, dirty_ratio=10, dirty_expire_centiseconds=1000, dirty_writeback_centisecs=1? i.e. does this give change give any benefit over the current common practice for configuring NFS servers? Agreed, that above improvement in write speed can be achieved by tuning above write-back parameters. But if we change these settings, it will change write-back behavior system wide. On the other hand, if we change proposed per bdi setting, bdi-dirty_background_bytes it will change write-back behavior for the block device exported on NFS server. I already know what the difference between global vs per-bdi tuning means. What I want to know is how your results compare *numerically* to just having a tweaked global setting on a vanilla kernel. i.e. is there really any performance benefit to per-bdi configuration that cannot be gained by existing methods? 2. what happens when you have 10 clients all writing to the server at once? Or a 100? NFS servers rarely have a single writer to a single file at a time, so what impact does this change have on multiple concurrent file write performance from multiple clients Sorry, we could not arrange more than 2 PCs for verifying this. Really? Well, perhaps there's some tools that might be useful for you here: http://oss.sgi.com/projects/nfs/testtools/ Weber Test load generator for NFS. Uses multiple threads, multiple sockets and multiple IP addresses to simulate loads from many machines, thus enabling testing of NFS server setups with larger client counts than can be tested with physical infrastructure (or Virtual Machine clients). Has been useful in automated NFS testing and as a pinpoint NFS load generator tool for performance development. Hi Dave, We ran weber test on below setup: 1) SATA HDD - Local WRITE speed ~120 MB/s, NFS WRITE speed ~90 MB/s 2) Used 10GbE - network interface to mount NFS We ran weber test with NFS clients ranging from 1 to 100, below is the % GAIN in NFS WRITE speed with bdi-dirty_background_bytes = 100 MB at NFS server - | Number of NFS Clients |% GAIN in WRITE Speed | |---| | 1 | 19.83 % | |---| | 2 | 2.97 % | |---| | 3 | 2.01 % | |---| |10 | 0.25 % | |---| |20 | 0.23 % | |---| |30 | 0.13 % | |---| | 100 |- 0.60 %
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
On Fri, Oct 19, 2012 at 04:51:05PM +0900, Namjae Jeon wrote: > Hi Dave. > > Test Procedure: > > 1) Local USB disk WRITE speed on NFS server is ~25 MB/s > > 2) Run WRITE test(create 1 GB file) on NFS Client with default > writeback settings on NFS Server. By default > bdi->dirty_background_bytes = 0, that means no change in default > writeback behaviour > > 3) Next we change bdi->dirty_background_bytes = 25 MB (almost equal to > local USB disk write speed on NFS Server) > *** only on NFS Server - not on NFS Client *** Ok, so the results look good, but it's not really addressing what I was asking, though. A typical desktop PC has a disk that can do 100MB/s and GbE, so I was expecting a test that showed throughput close to GbE maximums at least (ie. around that 100MB/s). I have 3 year old, low end, low power hardware (atom) that hanles twice the throughput you are testing here, and most current consumer NAS devices are more powerful than this. IOWs, I think the rates you are testing at are probably too low even for the consumer NAS market to consider relevant... > -- > Multiple NFS Client test: > --- > Sorry - We could not arrange multiple PCs to verify this. > So, we tried 1 NFS Server + 2 NFS Clients using 3 target boards: > ARM Target + 512 MB RAM + ethernet - 100 Mbits/s, create 1 GB File But this really doesn't tells us anything - it's still only 100Mb/s, which we'd expect is already getting very close to line rate even with low powered client hardware. What I'm concerned about the NFS server "sweet spot" - a $10k server that exports 20TB of storage and can sustain close to a GB/s of NFS traffic over a single 10GbE link with tens to hundreds of clients. 100MB/s and 10 clients is about the minimum needed to be able to extrapolate a litle and make an informed guess of how it will scale up > > 1. what's the comparison in performance to typical NFS > > server writeback parameter tuning? i.e. dirty_background_ratio=5, > > dirty_ratio=10, dirty_expire_centiseconds=1000, > > dirty_writeback_centisecs=1? i.e. does this give change give any > > benefit over the current common practice for configuring NFS > > servers? > > Agreed, that above improvement in write speed can be achieved by > tuning above write-back parameters. > But if we change these settings, it will change write-back behavior > system wide. > On the other hand, if we change proposed per bdi setting, > bdi->dirty_background_bytes it will change write-back behavior for the > block device exported on NFS server. I already know what the difference between global vs per-bdi tuning means. What I want to know is how your results compare *numerically* to just having a tweaked global setting on a vanilla kernel. i.e. is there really any performance benefit to per-bdi configuration that cannot be gained by existing methods? > > 2. what happens when you have 10 clients all writing to the server > > at once? Or a 100? NFS servers rarely have a single writer to a > > single file at a time, so what impact does this change have on > > multiple concurrent file write performance from multiple clients > > Sorry, we could not arrange more than 2 PCs for verifying this. Really? Well, perhaps there's some tools that might be useful for you here: http://oss.sgi.com/projects/nfs/testtools/ "Weber Test load generator for NFS. Uses multiple threads, multiple sockets and multiple IP addresses to simulate loads from many machines, thus enabling testing of NFS server setups with larger client counts than can be tested with physical infrastructure (or Virtual Machine clients). Has been useful in automated NFS testing and as a pinpoint NFS load generator tool for performance development." > > 3. Following on from the multiple client test, what difference does it > > make to file fragmentation rates? Writing more frequently means > > smaller allocations and writes, and that tends to lead to higher > > fragmentation rates, especially when multiple files are being > > written concurrently. Higher fragmentation also means lower > > performance over time as fragmentation accelerates filesystem aging > > effects on performance. IOWs, it may be faster when new, but it > > will be slower 3 months down the track and that's a bad tradeoff to > > make. > > We agree that there could be bit more framentation. But as you know, > we are not changing writeback settings at NFS clients. > So, write-back behavior on NFS client will not change - IO requests > will be buffered at NFS client as per existing write-back behavior. I think you misunderstand - writeback settings on the server greatly impact the way the server writes data and therefore the way files are fragmented. It has nothing to do with client side tuning. Effectively, what you are presenting is best case numbers - empty filesystem, single client,
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
On Fri, Oct 19, 2012 at 04:51:05PM +0900, Namjae Jeon wrote: Hi Dave. Test Procedure: 1) Local USB disk WRITE speed on NFS server is ~25 MB/s 2) Run WRITE test(create 1 GB file) on NFS Client with default writeback settings on NFS Server. By default bdi-dirty_background_bytes = 0, that means no change in default writeback behaviour 3) Next we change bdi-dirty_background_bytes = 25 MB (almost equal to local USB disk write speed on NFS Server) *** only on NFS Server - not on NFS Client *** Ok, so the results look good, but it's not really addressing what I was asking, though. A typical desktop PC has a disk that can do 100MB/s and GbE, so I was expecting a test that showed throughput close to GbE maximums at least (ie. around that 100MB/s). I have 3 year old, low end, low power hardware (atom) that hanles twice the throughput you are testing here, and most current consumer NAS devices are more powerful than this. IOWs, I think the rates you are testing at are probably too low even for the consumer NAS market to consider relevant... -- Multiple NFS Client test: --- Sorry - We could not arrange multiple PCs to verify this. So, we tried 1 NFS Server + 2 NFS Clients using 3 target boards: ARM Target + 512 MB RAM + ethernet - 100 Mbits/s, create 1 GB File But this really doesn't tells us anything - it's still only 100Mb/s, which we'd expect is already getting very close to line rate even with low powered client hardware. What I'm concerned about the NFS server sweet spot - a $10k server that exports 20TB of storage and can sustain close to a GB/s of NFS traffic over a single 10GbE link with tens to hundreds of clients. 100MB/s and 10 clients is about the minimum needed to be able to extrapolate a litle and make an informed guess of how it will scale up 1. what's the comparison in performance to typical NFS server writeback parameter tuning? i.e. dirty_background_ratio=5, dirty_ratio=10, dirty_expire_centiseconds=1000, dirty_writeback_centisecs=1? i.e. does this give change give any benefit over the current common practice for configuring NFS servers? Agreed, that above improvement in write speed can be achieved by tuning above write-back parameters. But if we change these settings, it will change write-back behavior system wide. On the other hand, if we change proposed per bdi setting, bdi-dirty_background_bytes it will change write-back behavior for the block device exported on NFS server. I already know what the difference between global vs per-bdi tuning means. What I want to know is how your results compare *numerically* to just having a tweaked global setting on a vanilla kernel. i.e. is there really any performance benefit to per-bdi configuration that cannot be gained by existing methods? 2. what happens when you have 10 clients all writing to the server at once? Or a 100? NFS servers rarely have a single writer to a single file at a time, so what impact does this change have on multiple concurrent file write performance from multiple clients Sorry, we could not arrange more than 2 PCs for verifying this. Really? Well, perhaps there's some tools that might be useful for you here: http://oss.sgi.com/projects/nfs/testtools/ Weber Test load generator for NFS. Uses multiple threads, multiple sockets and multiple IP addresses to simulate loads from many machines, thus enabling testing of NFS server setups with larger client counts than can be tested with physical infrastructure (or Virtual Machine clients). Has been useful in automated NFS testing and as a pinpoint NFS load generator tool for performance development. 3. Following on from the multiple client test, what difference does it make to file fragmentation rates? Writing more frequently means smaller allocations and writes, and that tends to lead to higher fragmentation rates, especially when multiple files are being written concurrently. Higher fragmentation also means lower performance over time as fragmentation accelerates filesystem aging effects on performance. IOWs, it may be faster when new, but it will be slower 3 months down the track and that's a bad tradeoff to make. We agree that there could be bit more framentation. But as you know, we are not changing writeback settings at NFS clients. So, write-back behavior on NFS client will not change - IO requests will be buffered at NFS client as per existing write-back behavior. I think you misunderstand - writeback settings on the server greatly impact the way the server writes data and therefore the way files are fragmented. It has nothing to do with client side tuning. Effectively, what you are presenting is best case numbers - empty filesystem, single client, streaming write, no fragmentation, no allocation contention, no competing IO
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
Hi Dave. Test Procedure: 1) Local USB disk WRITE speed on NFS server is ~25 MB/s 2) Run WRITE test(create 1 GB file) on NFS Client with default writeback settings on NFS Server. By default bdi->dirty_background_bytes = 0, that means no change in default writeback behaviour 3) Next we change bdi->dirty_background_bytes = 25 MB (almost equal to local USB disk write speed on NFS Server) *** only on NFS Server - not on NFS Client *** [NFS Server] # echo $((25*1024*1024)) > /sys/block/sdb/bdi/dirty_background_bytes # cat /sys/block/sdb/bdi/dirty_background_bytes 26214400 4) Run WRITE test again on NFS client to see change in WRITE speed at NFS client Test setup details: Test result on PC - FC16 - RAM 3 GB - ethernet - 1000 Mbits/s, Create 1 GB File Table 1: XFS over NFS - WRITE SPEED on NFS Client default writebackbdi->dirty_background_bytes setting = 25 MB RecSize write speed(MB/s) write speed(MB/s) % Change 10485760 27.39 28.53 4% 1048576 27.928.59 2% 524288 27.55 28.94 5% 262144 25.428.58 13% 131072 25.73 27.55 7% 65536 25.85 28.4510% 32768 26.13 28.6410% 16384 26.17 27.93 7% 8192 25.6428.07 9% 4096 26.2828.19 7% -- Table 2: EXT4 over NFS - WRITE SPEED on NFS Client -- default writebackbdi->dirty_background_bytes setting = 25 MB RecSize write speed(MB/s) write speed(MB/s) % Change 10485760 23.87 28.319% 1048576 24.8127.79 12% 52428824.53 28.14 15% 26214424.21 27.99 16% 13107224.11 28.33 18% 65536 23.73 28.21 19% 32768 25.66 27.527% 16384 24.3 27.67 14% 819223.6 27.08 15% 409623.35 27.2417% As mentioned in the above Table 1 & 2, there is performance improvement on NFS client on gigabit Ethernet on both EXT4/XFS over NFS. We did not observe any degradation in write speed. However, performance gain varies on different file systems i.e. different on XFS & EXT4 over NFS. We also tried this change on BTRFS over NFS, but we did not see any significant change in WRITE speed. -- Multiple NFS Client test: --- Sorry - We could not arrange multiple PCs to verify this. So, we tried 1 NFS Server + 2 NFS Clients using 3 target boards: ARM Target + 512 MB RAM + ethernet - 100 Mbits/s, create 1 GB File - Table 3: bdi->dirty_background_bytes = 0 MB - default writeback behaviour - RecSizeWrite SpeedWrite Speed Combined on Client 1 on client 2 write speed (MB/s) (MB/s) (MB/s) 104857605.45 5.36 10.81 1048576 5.44 5.34 10.78 5242885.48 5.51 10.99 2621446.24 4.83 11.07 1310725.58 5.53 11.11 65536 5.51 5.48 10.99 32768 5.42 5.46 10.88 16384 5.62 5.58 11.2 81925.59 5.49 11.08 40965.57 6.38
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
Hi Dave. Test Procedure: 1) Local USB disk WRITE speed on NFS server is ~25 MB/s 2) Run WRITE test(create 1 GB file) on NFS Client with default writeback settings on NFS Server. By default bdi-dirty_background_bytes = 0, that means no change in default writeback behaviour 3) Next we change bdi-dirty_background_bytes = 25 MB (almost equal to local USB disk write speed on NFS Server) *** only on NFS Server - not on NFS Client *** [NFS Server] # echo $((25*1024*1024)) /sys/block/sdb/bdi/dirty_background_bytes # cat /sys/block/sdb/bdi/dirty_background_bytes 26214400 4) Run WRITE test again on NFS client to see change in WRITE speed at NFS client Test setup details: Test result on PC - FC16 - RAM 3 GB - ethernet - 1000 Mbits/s, Create 1 GB File Table 1: XFS over NFS - WRITE SPEED on NFS Client default writebackbdi-dirty_background_bytes setting = 25 MB RecSize write speed(MB/s) write speed(MB/s) % Change 10485760 27.39 28.53 4% 1048576 27.928.59 2% 524288 27.55 28.94 5% 262144 25.428.58 13% 131072 25.73 27.55 7% 65536 25.85 28.4510% 32768 26.13 28.6410% 16384 26.17 27.93 7% 8192 25.6428.07 9% 4096 26.2828.19 7% -- Table 2: EXT4 over NFS - WRITE SPEED on NFS Client -- default writebackbdi-dirty_background_bytes setting = 25 MB RecSize write speed(MB/s) write speed(MB/s) % Change 10485760 23.87 28.319% 1048576 24.8127.79 12% 52428824.53 28.14 15% 26214424.21 27.99 16% 13107224.11 28.33 18% 65536 23.73 28.21 19% 32768 25.66 27.527% 16384 24.3 27.67 14% 819223.6 27.08 15% 409623.35 27.2417% As mentioned in the above Table 1 2, there is performance improvement on NFS client on gigabit Ethernet on both EXT4/XFS over NFS. We did not observe any degradation in write speed. However, performance gain varies on different file systems i.e. different on XFS EXT4 over NFS. We also tried this change on BTRFS over NFS, but we did not see any significant change in WRITE speed. -- Multiple NFS Client test: --- Sorry - We could not arrange multiple PCs to verify this. So, we tried 1 NFS Server + 2 NFS Clients using 3 target boards: ARM Target + 512 MB RAM + ethernet - 100 Mbits/s, create 1 GB File - Table 3: bdi-dirty_background_bytes = 0 MB - default writeback behaviour - RecSizeWrite SpeedWrite Speed Combined on Client 1 on client 2 write speed (MB/s) (MB/s) (MB/s) 104857605.45 5.36 10.81 1048576 5.44 5.34 10.78 5242885.48 5.51 10.99 2621446.24 4.83 11.07 1310725.58 5.53 11.11 65536 5.51 5.48 10.99 32768 5.42 5.46 10.88 16384 5.62 5.58 11.2 81925.59 5.49 11.08 40965.57 6.38
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
2012/9/27, Jan Kara : > On Thu 27-09-12 15:00:18, Namjae Jeon wrote: >> 2012/9/27, Jan Kara : >> > On Thu 27-09-12 00:56:02, Wu Fengguang wrote: >> >> On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote: >> >> > On Thu 20-09-12 16:44:22, Wu Fengguang wrote: >> >> > > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: >> >> > > > From: Namjae Jeon >> >> > > > >> >> > > > This patch is based on suggestion by Wu Fengguang: >> >> > > > https://lkml.org/lkml/2011/8/19/19 >> >> > > > >> >> > > > kernel has mechanism to do writeback as per dirty_ratio and >> >> > > > dirty_background >> >> > > > ratio. It also maintains per task dirty rate limit to keep >> >> > > > balance >> >> > > > of >> >> > > > dirty pages at any given instance by doing bdi bandwidth >> >> > > > estimation. >> >> > > > >> >> > > > Kernel also has max_ratio/min_ratio tunables to specify >> >> > > > percentage >> >> > > > of >> >> > > > writecache to control per bdi dirty limits and task throttling. >> >> > > > >> >> > > > However, there might be a usecase where user wants a per bdi >> >> > > > writeback tuning >> >> > > > parameter to flush dirty data once per bdi dirty data reach a >> >> > > > threshold >> >> > > > especially at NFS server. >> >> > > > >> >> > > > dirty_background_centisecs provides an interface where user can >> >> > > > tune >> >> > > > background writeback start threshold using >> >> > > > /sys/block/sda/bdi/dirty_background_centisecs >> >> > > > >> >> > > > dirty_background_centisecs is used alongwith average bdi write >> >> > > > bandwidth >> >> > > > estimation to start background writeback. >> >> > The functionality you describe, i.e. start flushing bdi when >> >> > there's >> >> > reasonable amount of dirty data on it, looks sensible and useful. >> >> > However >> >> > I'm not so sure whether the interface you propose is the right one. >> >> > Traditionally, we allow user to set amount of dirty data (either in >> >> > bytes >> >> > or percentage of memory) when background writeback should start. You >> >> > propose setting the amount of data in centisecs-to-write. Why that >> >> > difference? Also this interface ties our throughput estimation code >> >> > (which >> >> > is an implementation detail of current dirty throttling) with the >> >> > userspace >> >> > API. So we'd have to maintain the estimation code forever, possibly >> >> > also >> >> > face problems when we change the estimation code (and thus estimates >> >> > in >> >> > some cases) and users will complain that the values they set >> >> > originally >> >> > no >> >> > longer work as they used to. >> >> >> >> Yes, that bandwidth estimation is not all that (and in theory cannot >> >> be made) reliable which may be a surprise to the user. Which make the >> >> interface flaky. >> >> >> >> > Also, as with each knob, there's a problem how to properly set its >> >> > value? >> >> > Most admins won't know about the knob and so won't touch it. Others >> >> > might >> >> > know about the knob but will have hard time figuring out what value >> >> > should >> >> > they set. So if there's a new knob, it should have a sensible >> >> > initial >> >> > value. And since this feature looks like a useful one, it shouldn't >> >> > be >> >> > zero. >> >> >> >> Agreed in principle. There seems be no reasonable defaults for the >> >> centisecs-to-write interface, mainly due to its inaccurate nature, >> >> especially the initial value may be wildly wrong on fresh system >> >> bootup. This is also true for your proposed interfaces, see below. >> >> >> >> > So my personal preference would be to have >> >> > bdi->dirty_background_ratio >> >> > and >> >> > bdi->dirty_background_bytes and start background writeback whenever >> >> > one of global background limit and per-bdi background limit is >> >> > exceeded. >> >> > I >> >> > think this interface will do the job as well and it's easier to >> >> > maintain >> >> > in >> >> > future. >> >> >> >> bdi->dirty_background_ratio, if I understand its semantics right, is >> >> unfortunately flaky in the same principle as centisecs-to-write, >> >> because it relies on the (implicitly estimation of) writeout >> >> proportions. The writeout proportions for each bdi starts with 0, >> >> which is even worse than the 100MB/s initial value for >> >> bdi->write_bandwidth and will trigger background writeback on the >> >> first write. >> > Well, I meant bdi->dirty_backround_ratio wouldn't use writeout >> > proportion >> > estimates at all. Limit would be >> > dirtiable_memory * bdi->dirty_backround_ratio. >> > >> > After all we want to start writeout to bdi when we have enough pages to >> > reasonably load the device for a while which has nothing to do with how >> > much is written to this device as compared to other devices. >> > >> > OTOH I'm not particularly attached to this interface. Especially since >> > on a >> > lot of today's machines, 1% is rather big so people might often end up >> > using dirty_background_bytes
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
On Thu 27-09-12 15:00:18, Namjae Jeon wrote: > 2012/9/27, Jan Kara : > > On Thu 27-09-12 00:56:02, Wu Fengguang wrote: > >> On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote: > >> > On Thu 20-09-12 16:44:22, Wu Fengguang wrote: > >> > > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: > >> > > > From: Namjae Jeon > >> > > > > >> > > > This patch is based on suggestion by Wu Fengguang: > >> > > > https://lkml.org/lkml/2011/8/19/19 > >> > > > > >> > > > kernel has mechanism to do writeback as per dirty_ratio and > >> > > > dirty_background > >> > > > ratio. It also maintains per task dirty rate limit to keep balance > >> > > > of > >> > > > dirty pages at any given instance by doing bdi bandwidth > >> > > > estimation. > >> > > > > >> > > > Kernel also has max_ratio/min_ratio tunables to specify percentage > >> > > > of > >> > > > writecache to control per bdi dirty limits and task throttling. > >> > > > > >> > > > However, there might be a usecase where user wants a per bdi > >> > > > writeback tuning > >> > > > parameter to flush dirty data once per bdi dirty data reach a > >> > > > threshold > >> > > > especially at NFS server. > >> > > > > >> > > > dirty_background_centisecs provides an interface where user can > >> > > > tune > >> > > > background writeback start threshold using > >> > > > /sys/block/sda/bdi/dirty_background_centisecs > >> > > > > >> > > > dirty_background_centisecs is used alongwith average bdi write > >> > > > bandwidth > >> > > > estimation to start background writeback. > >> > The functionality you describe, i.e. start flushing bdi when there's > >> > reasonable amount of dirty data on it, looks sensible and useful. > >> > However > >> > I'm not so sure whether the interface you propose is the right one. > >> > Traditionally, we allow user to set amount of dirty data (either in > >> > bytes > >> > or percentage of memory) when background writeback should start. You > >> > propose setting the amount of data in centisecs-to-write. Why that > >> > difference? Also this interface ties our throughput estimation code > >> > (which > >> > is an implementation detail of current dirty throttling) with the > >> > userspace > >> > API. So we'd have to maintain the estimation code forever, possibly > >> > also > >> > face problems when we change the estimation code (and thus estimates in > >> > some cases) and users will complain that the values they set originally > >> > no > >> > longer work as they used to. > >> > >> Yes, that bandwidth estimation is not all that (and in theory cannot > >> be made) reliable which may be a surprise to the user. Which make the > >> interface flaky. > >> > >> > Also, as with each knob, there's a problem how to properly set its > >> > value? > >> > Most admins won't know about the knob and so won't touch it. Others > >> > might > >> > know about the knob but will have hard time figuring out what value > >> > should > >> > they set. So if there's a new knob, it should have a sensible initial > >> > value. And since this feature looks like a useful one, it shouldn't be > >> > zero. > >> > >> Agreed in principle. There seems be no reasonable defaults for the > >> centisecs-to-write interface, mainly due to its inaccurate nature, > >> especially the initial value may be wildly wrong on fresh system > >> bootup. This is also true for your proposed interfaces, see below. > >> > >> > So my personal preference would be to have bdi->dirty_background_ratio > >> > and > >> > bdi->dirty_background_bytes and start background writeback whenever > >> > one of global background limit and per-bdi background limit is exceeded. > >> > I > >> > think this interface will do the job as well and it's easier to maintain > >> > in > >> > future. > >> > >> bdi->dirty_background_ratio, if I understand its semantics right, is > >> unfortunately flaky in the same principle as centisecs-to-write, > >> because it relies on the (implicitly estimation of) writeout > >> proportions. The writeout proportions for each bdi starts with 0, > >> which is even worse than the 100MB/s initial value for > >> bdi->write_bandwidth and will trigger background writeback on the > >> first write. > > Well, I meant bdi->dirty_backround_ratio wouldn't use writeout proportion > > estimates at all. Limit would be > > dirtiable_memory * bdi->dirty_backround_ratio. > > > > After all we want to start writeout to bdi when we have enough pages to > > reasonably load the device for a while which has nothing to do with how > > much is written to this device as compared to other devices. > > > > OTOH I'm not particularly attached to this interface. Especially since on a > > lot of today's machines, 1% is rather big so people might often end up > > using dirty_background_bytes anyway. > > > >> bdi->dirty_background_bytes is, however, reliable, and gives users > >> total control. If we export this interface alone, I'd imagine users > >> who want to control centisecs-to-write could run a simple
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
2012/9/27, Jan Kara : > On Thu 27-09-12 00:56:02, Wu Fengguang wrote: >> On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote: >> > On Thu 20-09-12 16:44:22, Wu Fengguang wrote: >> > > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: >> > > > From: Namjae Jeon >> > > > >> > > > This patch is based on suggestion by Wu Fengguang: >> > > > https://lkml.org/lkml/2011/8/19/19 >> > > > >> > > > kernel has mechanism to do writeback as per dirty_ratio and >> > > > dirty_background >> > > > ratio. It also maintains per task dirty rate limit to keep balance >> > > > of >> > > > dirty pages at any given instance by doing bdi bandwidth >> > > > estimation. >> > > > >> > > > Kernel also has max_ratio/min_ratio tunables to specify percentage >> > > > of >> > > > writecache to control per bdi dirty limits and task throttling. >> > > > >> > > > However, there might be a usecase where user wants a per bdi >> > > > writeback tuning >> > > > parameter to flush dirty data once per bdi dirty data reach a >> > > > threshold >> > > > especially at NFS server. >> > > > >> > > > dirty_background_centisecs provides an interface where user can >> > > > tune >> > > > background writeback start threshold using >> > > > /sys/block/sda/bdi/dirty_background_centisecs >> > > > >> > > > dirty_background_centisecs is used alongwith average bdi write >> > > > bandwidth >> > > > estimation to start background writeback. >> > The functionality you describe, i.e. start flushing bdi when there's >> > reasonable amount of dirty data on it, looks sensible and useful. >> > However >> > I'm not so sure whether the interface you propose is the right one. >> > Traditionally, we allow user to set amount of dirty data (either in >> > bytes >> > or percentage of memory) when background writeback should start. You >> > propose setting the amount of data in centisecs-to-write. Why that >> > difference? Also this interface ties our throughput estimation code >> > (which >> > is an implementation detail of current dirty throttling) with the >> > userspace >> > API. So we'd have to maintain the estimation code forever, possibly >> > also >> > face problems when we change the estimation code (and thus estimates in >> > some cases) and users will complain that the values they set originally >> > no >> > longer work as they used to. >> >> Yes, that bandwidth estimation is not all that (and in theory cannot >> be made) reliable which may be a surprise to the user. Which make the >> interface flaky. >> >> > Also, as with each knob, there's a problem how to properly set its >> > value? >> > Most admins won't know about the knob and so won't touch it. Others >> > might >> > know about the knob but will have hard time figuring out what value >> > should >> > they set. So if there's a new knob, it should have a sensible initial >> > value. And since this feature looks like a useful one, it shouldn't be >> > zero. >> >> Agreed in principle. There seems be no reasonable defaults for the >> centisecs-to-write interface, mainly due to its inaccurate nature, >> especially the initial value may be wildly wrong on fresh system >> bootup. This is also true for your proposed interfaces, see below. >> >> > So my personal preference would be to have bdi->dirty_background_ratio >> > and >> > bdi->dirty_background_bytes and start background writeback whenever >> > one of global background limit and per-bdi background limit is exceeded. >> > I >> > think this interface will do the job as well and it's easier to maintain >> > in >> > future. >> >> bdi->dirty_background_ratio, if I understand its semantics right, is >> unfortunately flaky in the same principle as centisecs-to-write, >> because it relies on the (implicitly estimation of) writeout >> proportions. The writeout proportions for each bdi starts with 0, >> which is even worse than the 100MB/s initial value for >> bdi->write_bandwidth and will trigger background writeback on the >> first write. > Well, I meant bdi->dirty_backround_ratio wouldn't use writeout proportion > estimates at all. Limit would be > dirtiable_memory * bdi->dirty_backround_ratio. > > After all we want to start writeout to bdi when we have enough pages to > reasonably load the device for a while which has nothing to do with how > much is written to this device as compared to other devices. > > OTOH I'm not particularly attached to this interface. Especially since on a > lot of today's machines, 1% is rather big so people might often end up > using dirty_background_bytes anyway. > >> bdi->dirty_background_bytes is, however, reliable, and gives users >> total control. If we export this interface alone, I'd imagine users >> who want to control centisecs-to-write could run a simple script to >> periodically get the write bandwith value out of the existing bdi >> interface and echo it into bdi->dirty_background_bytes. Which makes >> simple yet good enough centisecs-to-write controlling. >> >> So what do you think about exporting a
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
2012/9/27, Jan Kara j...@suse.cz: On Thu 27-09-12 00:56:02, Wu Fengguang wrote: On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote: On Thu 20-09-12 16:44:22, Wu Fengguang wrote: On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: From: Namjae Jeon namjae.j...@samsung.com This patch is based on suggestion by Wu Fengguang: https://lkml.org/lkml/2011/8/19/19 kernel has mechanism to do writeback as per dirty_ratio and dirty_background ratio. It also maintains per task dirty rate limit to keep balance of dirty pages at any given instance by doing bdi bandwidth estimation. Kernel also has max_ratio/min_ratio tunables to specify percentage of writecache to control per bdi dirty limits and task throttling. However, there might be a usecase where user wants a per bdi writeback tuning parameter to flush dirty data once per bdi dirty data reach a threshold especially at NFS server. dirty_background_centisecs provides an interface where user can tune background writeback start threshold using /sys/block/sda/bdi/dirty_background_centisecs dirty_background_centisecs is used alongwith average bdi write bandwidth estimation to start background writeback. The functionality you describe, i.e. start flushing bdi when there's reasonable amount of dirty data on it, looks sensible and useful. However I'm not so sure whether the interface you propose is the right one. Traditionally, we allow user to set amount of dirty data (either in bytes or percentage of memory) when background writeback should start. You propose setting the amount of data in centisecs-to-write. Why that difference? Also this interface ties our throughput estimation code (which is an implementation detail of current dirty throttling) with the userspace API. So we'd have to maintain the estimation code forever, possibly also face problems when we change the estimation code (and thus estimates in some cases) and users will complain that the values they set originally no longer work as they used to. Yes, that bandwidth estimation is not all that (and in theory cannot be made) reliable which may be a surprise to the user. Which make the interface flaky. Also, as with each knob, there's a problem how to properly set its value? Most admins won't know about the knob and so won't touch it. Others might know about the knob but will have hard time figuring out what value should they set. So if there's a new knob, it should have a sensible initial value. And since this feature looks like a useful one, it shouldn't be zero. Agreed in principle. There seems be no reasonable defaults for the centisecs-to-write interface, mainly due to its inaccurate nature, especially the initial value may be wildly wrong on fresh system bootup. This is also true for your proposed interfaces, see below. So my personal preference would be to have bdi-dirty_background_ratio and bdi-dirty_background_bytes and start background writeback whenever one of global background limit and per-bdi background limit is exceeded. I think this interface will do the job as well and it's easier to maintain in future. bdi-dirty_background_ratio, if I understand its semantics right, is unfortunately flaky in the same principle as centisecs-to-write, because it relies on the (implicitly estimation of) writeout proportions. The writeout proportions for each bdi starts with 0, which is even worse than the 100MB/s initial value for bdi-write_bandwidth and will trigger background writeback on the first write. Well, I meant bdi-dirty_backround_ratio wouldn't use writeout proportion estimates at all. Limit would be dirtiable_memory * bdi-dirty_backround_ratio. After all we want to start writeout to bdi when we have enough pages to reasonably load the device for a while which has nothing to do with how much is written to this device as compared to other devices. OTOH I'm not particularly attached to this interface. Especially since on a lot of today's machines, 1% is rather big so people might often end up using dirty_background_bytes anyway. bdi-dirty_background_bytes is, however, reliable, and gives users total control. If we export this interface alone, I'd imagine users who want to control centisecs-to-write could run a simple script to periodically get the write bandwith value out of the existing bdi interface and echo it into bdi-dirty_background_bytes. Which makes simple yet good enough centisecs-to-write controlling. So what do you think about exporting a really dumb bdi-dirty_background_bytes, which will effectively give smart users the freedom to do smart control over per-bdi background writeback threshold? The users are offered the freedom to do his own bandwidth estimation and choose not to rely on the kernel estimation, which will free us from the
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
On Thu 27-09-12 15:00:18, Namjae Jeon wrote: 2012/9/27, Jan Kara j...@suse.cz: On Thu 27-09-12 00:56:02, Wu Fengguang wrote: On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote: On Thu 20-09-12 16:44:22, Wu Fengguang wrote: On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: From: Namjae Jeon namjae.j...@samsung.com This patch is based on suggestion by Wu Fengguang: https://lkml.org/lkml/2011/8/19/19 kernel has mechanism to do writeback as per dirty_ratio and dirty_background ratio. It also maintains per task dirty rate limit to keep balance of dirty pages at any given instance by doing bdi bandwidth estimation. Kernel also has max_ratio/min_ratio tunables to specify percentage of writecache to control per bdi dirty limits and task throttling. However, there might be a usecase where user wants a per bdi writeback tuning parameter to flush dirty data once per bdi dirty data reach a threshold especially at NFS server. dirty_background_centisecs provides an interface where user can tune background writeback start threshold using /sys/block/sda/bdi/dirty_background_centisecs dirty_background_centisecs is used alongwith average bdi write bandwidth estimation to start background writeback. The functionality you describe, i.e. start flushing bdi when there's reasonable amount of dirty data on it, looks sensible and useful. However I'm not so sure whether the interface you propose is the right one. Traditionally, we allow user to set amount of dirty data (either in bytes or percentage of memory) when background writeback should start. You propose setting the amount of data in centisecs-to-write. Why that difference? Also this interface ties our throughput estimation code (which is an implementation detail of current dirty throttling) with the userspace API. So we'd have to maintain the estimation code forever, possibly also face problems when we change the estimation code (and thus estimates in some cases) and users will complain that the values they set originally no longer work as they used to. Yes, that bandwidth estimation is not all that (and in theory cannot be made) reliable which may be a surprise to the user. Which make the interface flaky. Also, as with each knob, there's a problem how to properly set its value? Most admins won't know about the knob and so won't touch it. Others might know about the knob but will have hard time figuring out what value should they set. So if there's a new knob, it should have a sensible initial value. And since this feature looks like a useful one, it shouldn't be zero. Agreed in principle. There seems be no reasonable defaults for the centisecs-to-write interface, mainly due to its inaccurate nature, especially the initial value may be wildly wrong on fresh system bootup. This is also true for your proposed interfaces, see below. So my personal preference would be to have bdi-dirty_background_ratio and bdi-dirty_background_bytes and start background writeback whenever one of global background limit and per-bdi background limit is exceeded. I think this interface will do the job as well and it's easier to maintain in future. bdi-dirty_background_ratio, if I understand its semantics right, is unfortunately flaky in the same principle as centisecs-to-write, because it relies on the (implicitly estimation of) writeout proportions. The writeout proportions for each bdi starts with 0, which is even worse than the 100MB/s initial value for bdi-write_bandwidth and will trigger background writeback on the first write. Well, I meant bdi-dirty_backround_ratio wouldn't use writeout proportion estimates at all. Limit would be dirtiable_memory * bdi-dirty_backround_ratio. After all we want to start writeout to bdi when we have enough pages to reasonably load the device for a while which has nothing to do with how much is written to this device as compared to other devices. OTOH I'm not particularly attached to this interface. Especially since on a lot of today's machines, 1% is rather big so people might often end up using dirty_background_bytes anyway. bdi-dirty_background_bytes is, however, reliable, and gives users total control. If we export this interface alone, I'd imagine users who want to control centisecs-to-write could run a simple script to periodically get the write bandwith value out of the existing bdi interface and echo it into bdi-dirty_background_bytes. Which makes simple yet good enough centisecs-to-write controlling. So what do you think about exporting a really dumb bdi-dirty_background_bytes, which will effectively give smart users the freedom to do smart control over per-bdi background writeback
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
2012/9/27, Jan Kara j...@suse.cz: On Thu 27-09-12 15:00:18, Namjae Jeon wrote: 2012/9/27, Jan Kara j...@suse.cz: On Thu 27-09-12 00:56:02, Wu Fengguang wrote: On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote: On Thu 20-09-12 16:44:22, Wu Fengguang wrote: On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: From: Namjae Jeon namjae.j...@samsung.com This patch is based on suggestion by Wu Fengguang: https://lkml.org/lkml/2011/8/19/19 kernel has mechanism to do writeback as per dirty_ratio and dirty_background ratio. It also maintains per task dirty rate limit to keep balance of dirty pages at any given instance by doing bdi bandwidth estimation. Kernel also has max_ratio/min_ratio tunables to specify percentage of writecache to control per bdi dirty limits and task throttling. However, there might be a usecase where user wants a per bdi writeback tuning parameter to flush dirty data once per bdi dirty data reach a threshold especially at NFS server. dirty_background_centisecs provides an interface where user can tune background writeback start threshold using /sys/block/sda/bdi/dirty_background_centisecs dirty_background_centisecs is used alongwith average bdi write bandwidth estimation to start background writeback. The functionality you describe, i.e. start flushing bdi when there's reasonable amount of dirty data on it, looks sensible and useful. However I'm not so sure whether the interface you propose is the right one. Traditionally, we allow user to set amount of dirty data (either in bytes or percentage of memory) when background writeback should start. You propose setting the amount of data in centisecs-to-write. Why that difference? Also this interface ties our throughput estimation code (which is an implementation detail of current dirty throttling) with the userspace API. So we'd have to maintain the estimation code forever, possibly also face problems when we change the estimation code (and thus estimates in some cases) and users will complain that the values they set originally no longer work as they used to. Yes, that bandwidth estimation is not all that (and in theory cannot be made) reliable which may be a surprise to the user. Which make the interface flaky. Also, as with each knob, there's a problem how to properly set its value? Most admins won't know about the knob and so won't touch it. Others might know about the knob but will have hard time figuring out what value should they set. So if there's a new knob, it should have a sensible initial value. And since this feature looks like a useful one, it shouldn't be zero. Agreed in principle. There seems be no reasonable defaults for the centisecs-to-write interface, mainly due to its inaccurate nature, especially the initial value may be wildly wrong on fresh system bootup. This is also true for your proposed interfaces, see below. So my personal preference would be to have bdi-dirty_background_ratio and bdi-dirty_background_bytes and start background writeback whenever one of global background limit and per-bdi background limit is exceeded. I think this interface will do the job as well and it's easier to maintain in future. bdi-dirty_background_ratio, if I understand its semantics right, is unfortunately flaky in the same principle as centisecs-to-write, because it relies on the (implicitly estimation of) writeout proportions. The writeout proportions for each bdi starts with 0, which is even worse than the 100MB/s initial value for bdi-write_bandwidth and will trigger background writeback on the first write. Well, I meant bdi-dirty_backround_ratio wouldn't use writeout proportion estimates at all. Limit would be dirtiable_memory * bdi-dirty_backround_ratio. After all we want to start writeout to bdi when we have enough pages to reasonably load the device for a while which has nothing to do with how much is written to this device as compared to other devices. OTOH I'm not particularly attached to this interface. Especially since on a lot of today's machines, 1% is rather big so people might often end up using dirty_background_bytes anyway. bdi-dirty_background_bytes is, however, reliable, and gives users total control. If we export this interface alone, I'd imagine users who want to control centisecs-to-write could run a simple script to periodically get the write bandwith value out of the existing bdi interface and echo it into bdi-dirty_background_bytes. Which makes simple yet good enough centisecs-to-write controlling. So what do you think about exporting a really dumb bdi-dirty_background_bytes, which will effectively give smart users
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
On Thu 27-09-12 00:56:02, Wu Fengguang wrote: > On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote: > > On Thu 20-09-12 16:44:22, Wu Fengguang wrote: > > > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: > > > > From: Namjae Jeon > > > > > > > > This patch is based on suggestion by Wu Fengguang: > > > > https://lkml.org/lkml/2011/8/19/19 > > > > > > > > kernel has mechanism to do writeback as per dirty_ratio and > > > > dirty_background > > > > ratio. It also maintains per task dirty rate limit to keep balance of > > > > dirty pages at any given instance by doing bdi bandwidth estimation. > > > > > > > > Kernel also has max_ratio/min_ratio tunables to specify percentage of > > > > writecache to control per bdi dirty limits and task throttling. > > > > > > > > However, there might be a usecase where user wants a per bdi writeback > > > > tuning > > > > parameter to flush dirty data once per bdi dirty data reach a threshold > > > > especially at NFS server. > > > > > > > > dirty_background_centisecs provides an interface where user can tune > > > > background writeback start threshold using > > > > /sys/block/sda/bdi/dirty_background_centisecs > > > > > > > > dirty_background_centisecs is used alongwith average bdi write bandwidth > > > > estimation to start background writeback. > > The functionality you describe, i.e. start flushing bdi when there's > > reasonable amount of dirty data on it, looks sensible and useful. However > > I'm not so sure whether the interface you propose is the right one. > > Traditionally, we allow user to set amount of dirty data (either in bytes > > or percentage of memory) when background writeback should start. You > > propose setting the amount of data in centisecs-to-write. Why that > > difference? Also this interface ties our throughput estimation code (which > > is an implementation detail of current dirty throttling) with the userspace > > API. So we'd have to maintain the estimation code forever, possibly also > > face problems when we change the estimation code (and thus estimates in > > some cases) and users will complain that the values they set originally no > > longer work as they used to. > > Yes, that bandwidth estimation is not all that (and in theory cannot > be made) reliable which may be a surprise to the user. Which make the > interface flaky. > > > Also, as with each knob, there's a problem how to properly set its value? > > Most admins won't know about the knob and so won't touch it. Others might > > know about the knob but will have hard time figuring out what value should > > they set. So if there's a new knob, it should have a sensible initial > > value. And since this feature looks like a useful one, it shouldn't be > > zero. > > Agreed in principle. There seems be no reasonable defaults for the > centisecs-to-write interface, mainly due to its inaccurate nature, > especially the initial value may be wildly wrong on fresh system > bootup. This is also true for your proposed interfaces, see below. > > > So my personal preference would be to have bdi->dirty_background_ratio and > > bdi->dirty_background_bytes and start background writeback whenever > > one of global background limit and per-bdi background limit is exceeded. I > > think this interface will do the job as well and it's easier to maintain in > > future. > > bdi->dirty_background_ratio, if I understand its semantics right, is > unfortunately flaky in the same principle as centisecs-to-write, > because it relies on the (implicitly estimation of) writeout > proportions. The writeout proportions for each bdi starts with 0, > which is even worse than the 100MB/s initial value for > bdi->write_bandwidth and will trigger background writeback on the > first write. Well, I meant bdi->dirty_backround_ratio wouldn't use writeout proportion estimates at all. Limit would be dirtiable_memory * bdi->dirty_backround_ratio. After all we want to start writeout to bdi when we have enough pages to reasonably load the device for a while which has nothing to do with how much is written to this device as compared to other devices. OTOH I'm not particularly attached to this interface. Especially since on a lot of today's machines, 1% is rather big so people might often end up using dirty_background_bytes anyway. > bdi->dirty_background_bytes is, however, reliable, and gives users > total control. If we export this interface alone, I'd imagine users > who want to control centisecs-to-write could run a simple script to > periodically get the write bandwith value out of the existing bdi > interface and echo it into bdi->dirty_background_bytes. Which makes > simple yet good enough centisecs-to-write controlling. > > So what do you think about exporting a really dumb > bdi->dirty_background_bytes, which will effectively give smart users > the freedom to do smart control over per-bdi background writeback > threshold? The users are offered the freedom to do his own bandwidth >
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote: > On Thu 20-09-12 16:44:22, Wu Fengguang wrote: > > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: > > > From: Namjae Jeon > > > > > > This patch is based on suggestion by Wu Fengguang: > > > https://lkml.org/lkml/2011/8/19/19 > > > > > > kernel has mechanism to do writeback as per dirty_ratio and > > > dirty_background > > > ratio. It also maintains per task dirty rate limit to keep balance of > > > dirty pages at any given instance by doing bdi bandwidth estimation. > > > > > > Kernel also has max_ratio/min_ratio tunables to specify percentage of > > > writecache to control per bdi dirty limits and task throttling. > > > > > > However, there might be a usecase where user wants a per bdi writeback > > > tuning > > > parameter to flush dirty data once per bdi dirty data reach a threshold > > > especially at NFS server. > > > > > > dirty_background_centisecs provides an interface where user can tune > > > background writeback start threshold using > > > /sys/block/sda/bdi/dirty_background_centisecs > > > > > > dirty_background_centisecs is used alongwith average bdi write bandwidth > > > estimation to start background writeback. > The functionality you describe, i.e. start flushing bdi when there's > reasonable amount of dirty data on it, looks sensible and useful. However > I'm not so sure whether the interface you propose is the right one. > Traditionally, we allow user to set amount of dirty data (either in bytes > or percentage of memory) when background writeback should start. You > propose setting the amount of data in centisecs-to-write. Why that > difference? Also this interface ties our throughput estimation code (which > is an implementation detail of current dirty throttling) with the userspace > API. So we'd have to maintain the estimation code forever, possibly also > face problems when we change the estimation code (and thus estimates in > some cases) and users will complain that the values they set originally no > longer work as they used to. Yes, that bandwidth estimation is not all that (and in theory cannot be made) reliable which may be a surprise to the user. Which make the interface flaky. > Also, as with each knob, there's a problem how to properly set its value? > Most admins won't know about the knob and so won't touch it. Others might > know about the knob but will have hard time figuring out what value should > they set. So if there's a new knob, it should have a sensible initial > value. And since this feature looks like a useful one, it shouldn't be > zero. Agreed in principle. There seems be no reasonable defaults for the centisecs-to-write interface, mainly due to its inaccurate nature, especially the initial value may be wildly wrong on fresh system bootup. This is also true for your proposed interfaces, see below. > So my personal preference would be to have bdi->dirty_background_ratio and > bdi->dirty_background_bytes and start background writeback whenever > one of global background limit and per-bdi background limit is exceeded. I > think this interface will do the job as well and it's easier to maintain in > future. bdi->dirty_background_ratio, if I understand its semantics right, is unfortunately flaky in the same principle as centisecs-to-write, because it relies on the (implicitly estimation of) writeout proportions. The writeout proportions for each bdi starts with 0, which is even worse than the 100MB/s initial value for bdi->write_bandwidth and will trigger background writeback on the first write. bdi->dirty_background_bytes is, however, reliable, and gives users total control. If we export this interface alone, I'd imagine users who want to control centisecs-to-write could run a simple script to periodically get the write bandwith value out of the existing bdi interface and echo it into bdi->dirty_background_bytes. Which makes simple yet good enough centisecs-to-write controlling. So what do you think about exporting a really dumb bdi->dirty_background_bytes, which will effectively give smart users the freedom to do smart control over per-bdi background writeback threshold? The users are offered the freedom to do his own bandwidth estimation and choose not to rely on the kernel estimation, which will free us from the burden of maintaining a flaky interface as well. :) Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote: On Thu 20-09-12 16:44:22, Wu Fengguang wrote: On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: From: Namjae Jeon namjae.j...@samsung.com This patch is based on suggestion by Wu Fengguang: https://lkml.org/lkml/2011/8/19/19 kernel has mechanism to do writeback as per dirty_ratio and dirty_background ratio. It also maintains per task dirty rate limit to keep balance of dirty pages at any given instance by doing bdi bandwidth estimation. Kernel also has max_ratio/min_ratio tunables to specify percentage of writecache to control per bdi dirty limits and task throttling. However, there might be a usecase where user wants a per bdi writeback tuning parameter to flush dirty data once per bdi dirty data reach a threshold especially at NFS server. dirty_background_centisecs provides an interface where user can tune background writeback start threshold using /sys/block/sda/bdi/dirty_background_centisecs dirty_background_centisecs is used alongwith average bdi write bandwidth estimation to start background writeback. The functionality you describe, i.e. start flushing bdi when there's reasonable amount of dirty data on it, looks sensible and useful. However I'm not so sure whether the interface you propose is the right one. Traditionally, we allow user to set amount of dirty data (either in bytes or percentage of memory) when background writeback should start. You propose setting the amount of data in centisecs-to-write. Why that difference? Also this interface ties our throughput estimation code (which is an implementation detail of current dirty throttling) with the userspace API. So we'd have to maintain the estimation code forever, possibly also face problems when we change the estimation code (and thus estimates in some cases) and users will complain that the values they set originally no longer work as they used to. Yes, that bandwidth estimation is not all that (and in theory cannot be made) reliable which may be a surprise to the user. Which make the interface flaky. Also, as with each knob, there's a problem how to properly set its value? Most admins won't know about the knob and so won't touch it. Others might know about the knob but will have hard time figuring out what value should they set. So if there's a new knob, it should have a sensible initial value. And since this feature looks like a useful one, it shouldn't be zero. Agreed in principle. There seems be no reasonable defaults for the centisecs-to-write interface, mainly due to its inaccurate nature, especially the initial value may be wildly wrong on fresh system bootup. This is also true for your proposed interfaces, see below. So my personal preference would be to have bdi-dirty_background_ratio and bdi-dirty_background_bytes and start background writeback whenever one of global background limit and per-bdi background limit is exceeded. I think this interface will do the job as well and it's easier to maintain in future. bdi-dirty_background_ratio, if I understand its semantics right, is unfortunately flaky in the same principle as centisecs-to-write, because it relies on the (implicitly estimation of) writeout proportions. The writeout proportions for each bdi starts with 0, which is even worse than the 100MB/s initial value for bdi-write_bandwidth and will trigger background writeback on the first write. bdi-dirty_background_bytes is, however, reliable, and gives users total control. If we export this interface alone, I'd imagine users who want to control centisecs-to-write could run a simple script to periodically get the write bandwith value out of the existing bdi interface and echo it into bdi-dirty_background_bytes. Which makes simple yet good enough centisecs-to-write controlling. So what do you think about exporting a really dumb bdi-dirty_background_bytes, which will effectively give smart users the freedom to do smart control over per-bdi background writeback threshold? The users are offered the freedom to do his own bandwidth estimation and choose not to rely on the kernel estimation, which will free us from the burden of maintaining a flaky interface as well. :) Thanks, Fengguang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
On Thu 27-09-12 00:56:02, Wu Fengguang wrote: On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote: On Thu 20-09-12 16:44:22, Wu Fengguang wrote: On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: From: Namjae Jeon namjae.j...@samsung.com This patch is based on suggestion by Wu Fengguang: https://lkml.org/lkml/2011/8/19/19 kernel has mechanism to do writeback as per dirty_ratio and dirty_background ratio. It also maintains per task dirty rate limit to keep balance of dirty pages at any given instance by doing bdi bandwidth estimation. Kernel also has max_ratio/min_ratio tunables to specify percentage of writecache to control per bdi dirty limits and task throttling. However, there might be a usecase where user wants a per bdi writeback tuning parameter to flush dirty data once per bdi dirty data reach a threshold especially at NFS server. dirty_background_centisecs provides an interface where user can tune background writeback start threshold using /sys/block/sda/bdi/dirty_background_centisecs dirty_background_centisecs is used alongwith average bdi write bandwidth estimation to start background writeback. The functionality you describe, i.e. start flushing bdi when there's reasonable amount of dirty data on it, looks sensible and useful. However I'm not so sure whether the interface you propose is the right one. Traditionally, we allow user to set amount of dirty data (either in bytes or percentage of memory) when background writeback should start. You propose setting the amount of data in centisecs-to-write. Why that difference? Also this interface ties our throughput estimation code (which is an implementation detail of current dirty throttling) with the userspace API. So we'd have to maintain the estimation code forever, possibly also face problems when we change the estimation code (and thus estimates in some cases) and users will complain that the values they set originally no longer work as they used to. Yes, that bandwidth estimation is not all that (and in theory cannot be made) reliable which may be a surprise to the user. Which make the interface flaky. Also, as with each knob, there's a problem how to properly set its value? Most admins won't know about the knob and so won't touch it. Others might know about the knob but will have hard time figuring out what value should they set. So if there's a new knob, it should have a sensible initial value. And since this feature looks like a useful one, it shouldn't be zero. Agreed in principle. There seems be no reasonable defaults for the centisecs-to-write interface, mainly due to its inaccurate nature, especially the initial value may be wildly wrong on fresh system bootup. This is also true for your proposed interfaces, see below. So my personal preference would be to have bdi-dirty_background_ratio and bdi-dirty_background_bytes and start background writeback whenever one of global background limit and per-bdi background limit is exceeded. I think this interface will do the job as well and it's easier to maintain in future. bdi-dirty_background_ratio, if I understand its semantics right, is unfortunately flaky in the same principle as centisecs-to-write, because it relies on the (implicitly estimation of) writeout proportions. The writeout proportions for each bdi starts with 0, which is even worse than the 100MB/s initial value for bdi-write_bandwidth and will trigger background writeback on the first write. Well, I meant bdi-dirty_backround_ratio wouldn't use writeout proportion estimates at all. Limit would be dirtiable_memory * bdi-dirty_backround_ratio. After all we want to start writeout to bdi when we have enough pages to reasonably load the device for a while which has nothing to do with how much is written to this device as compared to other devices. OTOH I'm not particularly attached to this interface. Especially since on a lot of today's machines, 1% is rather big so people might often end up using dirty_background_bytes anyway. bdi-dirty_background_bytes is, however, reliable, and gives users total control. If we export this interface alone, I'd imagine users who want to control centisecs-to-write could run a simple script to periodically get the write bandwith value out of the existing bdi interface and echo it into bdi-dirty_background_bytes. Which makes simple yet good enough centisecs-to-write controlling. So what do you think about exporting a really dumb bdi-dirty_background_bytes, which will effectively give smart users the freedom to do smart control over per-bdi background writeback threshold? The users are offered the freedom to do his own bandwidth estimation and choose not to rely on the kernel estimation, which will free us from the burden of maintaining a flaky interface as well. :) That's fine with me.
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
2012/9/25, Namjae Jeon : > 2012/9/25, Dave Chinner : >> On Thu, Sep 20, 2012 at 04:44:22PM +0800, Fengguang Wu wrote: >>> [ CC FS and MM lists ] >>> >>> Patch looks good to me, however we need to be careful because it's >>> introducing a new interface. So it's desirable to get some acks from >>> the FS/MM developers. >>> >>> Thanks, >>> Fengguang >>> >>> On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: >>> > From: Namjae Jeon >>> > >>> > This patch is based on suggestion by Wu Fengguang: >>> > https://lkml.org/lkml/2011/8/19/19 >>> > >>> > kernel has mechanism to do writeback as per dirty_ratio and >>> > dirty_background >>> > ratio. It also maintains per task dirty rate limit to keep balance of >>> > dirty pages at any given instance by doing bdi bandwidth estimation. >>> > >>> > Kernel also has max_ratio/min_ratio tunables to specify percentage of >>> > writecache to control per bdi dirty limits and task throttling. >>> > >>> > However, there might be a usecase where user wants a per bdi writeback >>> > tuning >>> > parameter to flush dirty data once per bdi dirty data reach a >>> > threshold >>> > especially at NFS server. >>> > >>> > dirty_background_centisecs provides an interface where user can tune >>> > background writeback start threshold using >>> > /sys/block/sda/bdi/dirty_background_centisecs >>> > >>> > dirty_background_centisecs is used alongwith average bdi write >>> > bandwidth >>> > estimation to start background writeback. >>> > >>> > One of the use case to demonstrate the patch functionality can be >>> > on NFS setup:- >>> > We have a NFS setup with ethernet line of 100Mbps, while the USB >>> > disk is attached to server, which has a local speed of 25MBps. Server >>> > and client both are arm target boards. >>> > >>> > Now if we perform a write operation over NFS (client to server), as >>> > per the network speed, data can travel at max speed of 100Mbps. But >>> > if we check the default write speed of USB hdd over NFS it comes >>> > around to 8MB/sec, far below the speed of network. >>> > >>> > Reason being is as per the NFS logic, during write operation, >>> > initially >>> > pages are dirtied on NFS client side, then after reaching the dirty >>> > threshold/writeback limit (or in case of sync) data is actually sent >>> > to NFS server (so now again pages are dirtied on server side). This >>> > will be done in COMMIT call from client to server i.e if 100MB of data >>> > is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9 >>> > seconds. >>> > >>> > After the data is received, now it will take approx 100/25 ~4 Seconds >>> > to >>> > write the data to USB Hdd on server side. Hence making the overall >>> > time >>> > to write this much of data ~12 seconds, which in practically comes out >>> > to >>> > be near 7 to 8MB/second. After this a COMMIT response will be sent to >>> > NFS >>> > client. >>> > >>> > However we may improve this write performace by making the use of NFS >>> > server idle time i.e while data is being received from the client, >>> > simultaneously initiate the writeback thread on server side. So >>> > instead >>> > of waiting for the complete data to come and then start the writeback, >>> > we can work in parallel while the network is still busy in receiving >>> > the >>> > data. Hence in this way overall performace will be improved. >>> > >>> > If we tune dirty_background_centisecs, we can see there >>> > is increase in the performace and it comes out to be ~ 11MB/seconds. >>> > Results are:- >>> > >>> > Write test(create a 1 GB file) result at 'NFS client' after changing >>> > /sys/block/sda/bdi/dirty_background_centisecs >>> > on *** NFS Server only - not on NFS Client >> > > Hi. Dave. > >> What is the configuration of the client and server? How much RAM, >> what their dirty_* parameters are set to, network speed, server disk >> speed for local sequential IO, etc? > these results are on ARM, 512MB RAM and XFS over NFS with default > writeback settings(only our writeback setting - dirty_background_cen > tisecs changed at nfs server only). Network speed is ~100MB/sec and Sorry, there is typo:) ^^100Mb/sec > local disk speed is ~25MB/sec. > >> >>> > - >>> > |WRITE Test with various 'dirty_background_centisecs' at NFS Server | >>> > - >>> > | | default = 0 | 300 centisec| 200 centisec| 100 centisec | >>> > - >>> > |RecSize | WriteSpeed | WriteSpeed | WriteSpeed | WriteSpeed | >>> > - >>> > |10485760 | 8.44MB/sec | 8.60MB/sec | 9.30MB/sec | 10.27MB/sec | >>> > | 1048576 | 8.48MB/sec | 8.87MB/sec | 9.31MB/sec | 10.34MB/sec | >>> > | 524288 | 8.37MB/sec | 8.42MB/sec | 9.84MB/sec | 10.47MB/sec | >>> > | 262144 |
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
2012/9/25, Jan Kara : > On Thu 20-09-12 16:44:22, Wu Fengguang wrote: >> On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: >> > From: Namjae Jeon >> > >> > This patch is based on suggestion by Wu Fengguang: >> > https://lkml.org/lkml/2011/8/19/19 >> > >> > kernel has mechanism to do writeback as per dirty_ratio and >> > dirty_background >> > ratio. It also maintains per task dirty rate limit to keep balance of >> > dirty pages at any given instance by doing bdi bandwidth estimation. >> > >> > Kernel also has max_ratio/min_ratio tunables to specify percentage of >> > writecache to control per bdi dirty limits and task throttling. >> > >> > However, there might be a usecase where user wants a per bdi writeback >> > tuning >> > parameter to flush dirty data once per bdi dirty data reach a threshold >> > especially at NFS server. >> > >> > dirty_background_centisecs provides an interface where user can tune >> > background writeback start threshold using >> > /sys/block/sda/bdi/dirty_background_centisecs >> > >> > dirty_background_centisecs is used alongwith average bdi write >> > bandwidth >> > estimation to start background writeback. > The functionality you describe, i.e. start flushing bdi when there's > reasonable amount of dirty data on it, looks sensible and useful. However > I'm not so sure whether the interface you propose is the right one. > Traditionally, we allow user to set amount of dirty data (either in bytes > or percentage of memory) when background writeback should start. You > propose setting the amount of data in centisecs-to-write. Why that > difference? Also this interface ties our throughput estimation code (which > is an implementation detail of current dirty throttling) with the userspace > API. So we'd have to maintain the estimation code forever, possibly also > face problems when we change the estimation code (and thus estimates in > some cases) and users will complain that the values they set originally no > longer work as they used to. > > Also, as with each knob, there's a problem how to properly set its value? > Most admins won't know about the knob and so won't touch it. Others might > know about the knob but will have hard time figuring out what value should > they set. So if there's a new knob, it should have a sensible initial > value. And since this feature looks like a useful one, it shouldn't be > zero. > > So my personal preference would be to have bdi->dirty_background_ratio and > bdi->dirty_background_bytes and start background writeback whenever > one of global background limit and per-bdi background limit is exceeded. I > think this interface will do the job as well and it's easier to maintain in > future. Hi Jan. Thanks for review and your opinion. Hi. Wu. How about adding per-bdi - bdi->dirty_background_ratio and bdi->dirty_background_bytes interface as suggested by Jan? Thanks. > > Honza -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
2012/9/25, Dave Chinner : > On Thu, Sep 20, 2012 at 04:44:22PM +0800, Fengguang Wu wrote: >> [ CC FS and MM lists ] >> >> Patch looks good to me, however we need to be careful because it's >> introducing a new interface. So it's desirable to get some acks from >> the FS/MM developers. >> >> Thanks, >> Fengguang >> >> On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: >> > From: Namjae Jeon >> > >> > This patch is based on suggestion by Wu Fengguang: >> > https://lkml.org/lkml/2011/8/19/19 >> > >> > kernel has mechanism to do writeback as per dirty_ratio and >> > dirty_background >> > ratio. It also maintains per task dirty rate limit to keep balance of >> > dirty pages at any given instance by doing bdi bandwidth estimation. >> > >> > Kernel also has max_ratio/min_ratio tunables to specify percentage of >> > writecache to control per bdi dirty limits and task throttling. >> > >> > However, there might be a usecase where user wants a per bdi writeback >> > tuning >> > parameter to flush dirty data once per bdi dirty data reach a threshold >> > especially at NFS server. >> > >> > dirty_background_centisecs provides an interface where user can tune >> > background writeback start threshold using >> > /sys/block/sda/bdi/dirty_background_centisecs >> > >> > dirty_background_centisecs is used alongwith average bdi write >> > bandwidth >> > estimation to start background writeback. >> > >> > One of the use case to demonstrate the patch functionality can be >> > on NFS setup:- >> > We have a NFS setup with ethernet line of 100Mbps, while the USB >> > disk is attached to server, which has a local speed of 25MBps. Server >> > and client both are arm target boards. >> > >> > Now if we perform a write operation over NFS (client to server), as >> > per the network speed, data can travel at max speed of 100Mbps. But >> > if we check the default write speed of USB hdd over NFS it comes >> > around to 8MB/sec, far below the speed of network. >> > >> > Reason being is as per the NFS logic, during write operation, initially >> > pages are dirtied on NFS client side, then after reaching the dirty >> > threshold/writeback limit (or in case of sync) data is actually sent >> > to NFS server (so now again pages are dirtied on server side). This >> > will be done in COMMIT call from client to server i.e if 100MB of data >> > is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9 >> > seconds. >> > >> > After the data is received, now it will take approx 100/25 ~4 Seconds >> > to >> > write the data to USB Hdd on server side. Hence making the overall time >> > to write this much of data ~12 seconds, which in practically comes out >> > to >> > be near 7 to 8MB/second. After this a COMMIT response will be sent to >> > NFS >> > client. >> > >> > However we may improve this write performace by making the use of NFS >> > server idle time i.e while data is being received from the client, >> > simultaneously initiate the writeback thread on server side. So instead >> > of waiting for the complete data to come and then start the writeback, >> > we can work in parallel while the network is still busy in receiving >> > the >> > data. Hence in this way overall performace will be improved. >> > >> > If we tune dirty_background_centisecs, we can see there >> > is increase in the performace and it comes out to be ~ 11MB/seconds. >> > Results are:- >> > >> > Write test(create a 1 GB file) result at 'NFS client' after changing >> > /sys/block/sda/bdi/dirty_background_centisecs >> > on *** NFS Server only - not on NFS Client > Hi. Dave. > What is the configuration of the client and server? How much RAM, > what their dirty_* parameters are set to, network speed, server disk > speed for local sequential IO, etc? these results are on ARM, 512MB RAM and XFS over NFS with default writeback settings(only our writeback setting - dirty_background_cen tisecs changed at nfs server only). Network speed is ~100MB/sec and local disk speed is ~25MB/sec. > >> > - >> > |WRITE Test with various 'dirty_background_centisecs' at NFS Server | >> > - >> > | | default = 0 | 300 centisec| 200 centisec| 100 centisec | >> > - >> > |RecSize | WriteSpeed | WriteSpeed | WriteSpeed | WriteSpeed | >> > - >> > |10485760 | 8.44MB/sec | 8.60MB/sec | 9.30MB/sec | 10.27MB/sec | >> > | 1048576 | 8.48MB/sec | 8.87MB/sec | 9.31MB/sec | 10.34MB/sec | >> > | 524288 | 8.37MB/sec | 8.42MB/sec | 9.84MB/sec | 10.47MB/sec | >> > | 262144 | 8.16MB/sec | 8.51MB/sec | 9.52MB/sec | 10.62MB/sec | >> > | 131072 | 8.48MB/sec | 8.81MB/sec | 9.42MB/sec | 10.55MB/sec | >> > | 65536 | 8.38MB/sec | 9.09MB/sec | 9.76MB/sec | 10.53MB/sec | >> > | 32768 |
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
2012/9/25, Dave Chinner da...@fromorbit.com: On Thu, Sep 20, 2012 at 04:44:22PM +0800, Fengguang Wu wrote: [ CC FS and MM lists ] Patch looks good to me, however we need to be careful because it's introducing a new interface. So it's desirable to get some acks from the FS/MM developers. Thanks, Fengguang On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: From: Namjae Jeon namjae.j...@samsung.com This patch is based on suggestion by Wu Fengguang: https://lkml.org/lkml/2011/8/19/19 kernel has mechanism to do writeback as per dirty_ratio and dirty_background ratio. It also maintains per task dirty rate limit to keep balance of dirty pages at any given instance by doing bdi bandwidth estimation. Kernel also has max_ratio/min_ratio tunables to specify percentage of writecache to control per bdi dirty limits and task throttling. However, there might be a usecase where user wants a per bdi writeback tuning parameter to flush dirty data once per bdi dirty data reach a threshold especially at NFS server. dirty_background_centisecs provides an interface where user can tune background writeback start threshold using /sys/block/sda/bdi/dirty_background_centisecs dirty_background_centisecs is used alongwith average bdi write bandwidth estimation to start background writeback. One of the use case to demonstrate the patch functionality can be on NFS setup:- We have a NFS setup with ethernet line of 100Mbps, while the USB disk is attached to server, which has a local speed of 25MBps. Server and client both are arm target boards. Now if we perform a write operation over NFS (client to server), as per the network speed, data can travel at max speed of 100Mbps. But if we check the default write speed of USB hdd over NFS it comes around to 8MB/sec, far below the speed of network. Reason being is as per the NFS logic, during write operation, initially pages are dirtied on NFS client side, then after reaching the dirty threshold/writeback limit (or in case of sync) data is actually sent to NFS server (so now again pages are dirtied on server side). This will be done in COMMIT call from client to server i.e if 100MB of data is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9 seconds. After the data is received, now it will take approx 100/25 ~4 Seconds to write the data to USB Hdd on server side. Hence making the overall time to write this much of data ~12 seconds, which in practically comes out to be near 7 to 8MB/second. After this a COMMIT response will be sent to NFS client. However we may improve this write performace by making the use of NFS server idle time i.e while data is being received from the client, simultaneously initiate the writeback thread on server side. So instead of waiting for the complete data to come and then start the writeback, we can work in parallel while the network is still busy in receiving the data. Hence in this way overall performace will be improved. If we tune dirty_background_centisecs, we can see there is increase in the performace and it comes out to be ~ 11MB/seconds. Results are:- Write test(create a 1 GB file) result at 'NFS client' after changing /sys/block/sda/bdi/dirty_background_centisecs on *** NFS Server only - not on NFS Client Hi. Dave. What is the configuration of the client and server? How much RAM, what their dirty_* parameters are set to, network speed, server disk speed for local sequential IO, etc? these results are on ARM, 512MB RAM and XFS over NFS with default writeback settings(only our writeback setting - dirty_background_cen tisecs changed at nfs server only). Network speed is ~100MB/sec and local disk speed is ~25MB/sec. - |WRITE Test with various 'dirty_background_centisecs' at NFS Server | - | | default = 0 | 300 centisec| 200 centisec| 100 centisec | - |RecSize | WriteSpeed | WriteSpeed | WriteSpeed | WriteSpeed | - |10485760 | 8.44MB/sec | 8.60MB/sec | 9.30MB/sec | 10.27MB/sec | | 1048576 | 8.48MB/sec | 8.87MB/sec | 9.31MB/sec | 10.34MB/sec | | 524288 | 8.37MB/sec | 8.42MB/sec | 9.84MB/sec | 10.47MB/sec | | 262144 | 8.16MB/sec | 8.51MB/sec | 9.52MB/sec | 10.62MB/sec | | 131072 | 8.48MB/sec | 8.81MB/sec | 9.42MB/sec | 10.55MB/sec | | 65536 | 8.38MB/sec | 9.09MB/sec | 9.76MB/sec | 10.53MB/sec | | 32768 | 8.65MB/sec | 9.00MB/sec | 9.57MB/sec | 10.54MB/sec | | 16384 | 8.27MB/sec | 8.80MB/sec | 9.39MB/sec | 10.43MB/sec | |8192 | 8.52MB/sec | 8.70MB/sec | 9.40MB/sec | 10.50MB/sec | |4096 | 8.20MB/sec |
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
2012/9/25, Jan Kara j...@suse.cz: On Thu 20-09-12 16:44:22, Wu Fengguang wrote: On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: From: Namjae Jeon namjae.j...@samsung.com This patch is based on suggestion by Wu Fengguang: https://lkml.org/lkml/2011/8/19/19 kernel has mechanism to do writeback as per dirty_ratio and dirty_background ratio. It also maintains per task dirty rate limit to keep balance of dirty pages at any given instance by doing bdi bandwidth estimation. Kernel also has max_ratio/min_ratio tunables to specify percentage of writecache to control per bdi dirty limits and task throttling. However, there might be a usecase where user wants a per bdi writeback tuning parameter to flush dirty data once per bdi dirty data reach a threshold especially at NFS server. dirty_background_centisecs provides an interface where user can tune background writeback start threshold using /sys/block/sda/bdi/dirty_background_centisecs dirty_background_centisecs is used alongwith average bdi write bandwidth estimation to start background writeback. The functionality you describe, i.e. start flushing bdi when there's reasonable amount of dirty data on it, looks sensible and useful. However I'm not so sure whether the interface you propose is the right one. Traditionally, we allow user to set amount of dirty data (either in bytes or percentage of memory) when background writeback should start. You propose setting the amount of data in centisecs-to-write. Why that difference? Also this interface ties our throughput estimation code (which is an implementation detail of current dirty throttling) with the userspace API. So we'd have to maintain the estimation code forever, possibly also face problems when we change the estimation code (and thus estimates in some cases) and users will complain that the values they set originally no longer work as they used to. Also, as with each knob, there's a problem how to properly set its value? Most admins won't know about the knob and so won't touch it. Others might know about the knob but will have hard time figuring out what value should they set. So if there's a new knob, it should have a sensible initial value. And since this feature looks like a useful one, it shouldn't be zero. So my personal preference would be to have bdi-dirty_background_ratio and bdi-dirty_background_bytes and start background writeback whenever one of global background limit and per-bdi background limit is exceeded. I think this interface will do the job as well and it's easier to maintain in future. Hi Jan. Thanks for review and your opinion. Hi. Wu. How about adding per-bdi - bdi-dirty_background_ratio and bdi-dirty_background_bytes interface as suggested by Jan? Thanks. Honza -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
2012/9/25, Namjae Jeon linkinj...@gmail.com: 2012/9/25, Dave Chinner da...@fromorbit.com: On Thu, Sep 20, 2012 at 04:44:22PM +0800, Fengguang Wu wrote: [ CC FS and MM lists ] Patch looks good to me, however we need to be careful because it's introducing a new interface. So it's desirable to get some acks from the FS/MM developers. Thanks, Fengguang On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: From: Namjae Jeon namjae.j...@samsung.com This patch is based on suggestion by Wu Fengguang: https://lkml.org/lkml/2011/8/19/19 kernel has mechanism to do writeback as per dirty_ratio and dirty_background ratio. It also maintains per task dirty rate limit to keep balance of dirty pages at any given instance by doing bdi bandwidth estimation. Kernel also has max_ratio/min_ratio tunables to specify percentage of writecache to control per bdi dirty limits and task throttling. However, there might be a usecase where user wants a per bdi writeback tuning parameter to flush dirty data once per bdi dirty data reach a threshold especially at NFS server. dirty_background_centisecs provides an interface where user can tune background writeback start threshold using /sys/block/sda/bdi/dirty_background_centisecs dirty_background_centisecs is used alongwith average bdi write bandwidth estimation to start background writeback. One of the use case to demonstrate the patch functionality can be on NFS setup:- We have a NFS setup with ethernet line of 100Mbps, while the USB disk is attached to server, which has a local speed of 25MBps. Server and client both are arm target boards. Now if we perform a write operation over NFS (client to server), as per the network speed, data can travel at max speed of 100Mbps. But if we check the default write speed of USB hdd over NFS it comes around to 8MB/sec, far below the speed of network. Reason being is as per the NFS logic, during write operation, initially pages are dirtied on NFS client side, then after reaching the dirty threshold/writeback limit (or in case of sync) data is actually sent to NFS server (so now again pages are dirtied on server side). This will be done in COMMIT call from client to server i.e if 100MB of data is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9 seconds. After the data is received, now it will take approx 100/25 ~4 Seconds to write the data to USB Hdd on server side. Hence making the overall time to write this much of data ~12 seconds, which in practically comes out to be near 7 to 8MB/second. After this a COMMIT response will be sent to NFS client. However we may improve this write performace by making the use of NFS server idle time i.e while data is being received from the client, simultaneously initiate the writeback thread on server side. So instead of waiting for the complete data to come and then start the writeback, we can work in parallel while the network is still busy in receiving the data. Hence in this way overall performace will be improved. If we tune dirty_background_centisecs, we can see there is increase in the performace and it comes out to be ~ 11MB/seconds. Results are:- Write test(create a 1 GB file) result at 'NFS client' after changing /sys/block/sda/bdi/dirty_background_centisecs on *** NFS Server only - not on NFS Client Hi. Dave. What is the configuration of the client and server? How much RAM, what their dirty_* parameters are set to, network speed, server disk speed for local sequential IO, etc? these results are on ARM, 512MB RAM and XFS over NFS with default writeback settings(only our writeback setting - dirty_background_cen tisecs changed at nfs server only). Network speed is ~100MB/sec and Sorry, there is typo:) ^^100Mb/sec local disk speed is ~25MB/sec. - |WRITE Test with various 'dirty_background_centisecs' at NFS Server | - | | default = 0 | 300 centisec| 200 centisec| 100 centisec | - |RecSize | WriteSpeed | WriteSpeed | WriteSpeed | WriteSpeed | - |10485760 | 8.44MB/sec | 8.60MB/sec | 9.30MB/sec | 10.27MB/sec | | 1048576 | 8.48MB/sec | 8.87MB/sec | 9.31MB/sec | 10.34MB/sec | | 524288 | 8.37MB/sec | 8.42MB/sec | 9.84MB/sec | 10.47MB/sec | | 262144 | 8.16MB/sec | 8.51MB/sec | 9.52MB/sec | 10.62MB/sec | | 131072 | 8.48MB/sec | 8.81MB/sec | 9.42MB/sec | 10.55MB/sec | | 65536 | 8.38MB/sec | 9.09MB/sec | 9.76MB/sec | 10.53MB/sec | | 32768 | 8.65MB/sec | 9.00MB/sec | 9.57MB/sec | 10.54MB/sec | | 16384 | 8.27MB/sec | 8.80MB/sec | 9.39MB/sec |
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
On Thu, Sep 20, 2012 at 04:44:22PM +0800, Fengguang Wu wrote: > [ CC FS and MM lists ] > > Patch looks good to me, however we need to be careful because it's > introducing a new interface. So it's desirable to get some acks from > the FS/MM developers. > > Thanks, > Fengguang > > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: > > From: Namjae Jeon > > > > This patch is based on suggestion by Wu Fengguang: > > https://lkml.org/lkml/2011/8/19/19 > > > > kernel has mechanism to do writeback as per dirty_ratio and dirty_background > > ratio. It also maintains per task dirty rate limit to keep balance of > > dirty pages at any given instance by doing bdi bandwidth estimation. > > > > Kernel also has max_ratio/min_ratio tunables to specify percentage of > > writecache to control per bdi dirty limits and task throttling. > > > > However, there might be a usecase where user wants a per bdi writeback > > tuning > > parameter to flush dirty data once per bdi dirty data reach a threshold > > especially at NFS server. > > > > dirty_background_centisecs provides an interface where user can tune > > background writeback start threshold using > > /sys/block/sda/bdi/dirty_background_centisecs > > > > dirty_background_centisecs is used alongwith average bdi write bandwidth > > estimation to start background writeback. > > > > One of the use case to demonstrate the patch functionality can be > > on NFS setup:- > > We have a NFS setup with ethernet line of 100Mbps, while the USB > > disk is attached to server, which has a local speed of 25MBps. Server > > and client both are arm target boards. > > > > Now if we perform a write operation over NFS (client to server), as > > per the network speed, data can travel at max speed of 100Mbps. But > > if we check the default write speed of USB hdd over NFS it comes > > around to 8MB/sec, far below the speed of network. > > > > Reason being is as per the NFS logic, during write operation, initially > > pages are dirtied on NFS client side, then after reaching the dirty > > threshold/writeback limit (or in case of sync) data is actually sent > > to NFS server (so now again pages are dirtied on server side). This > > will be done in COMMIT call from client to server i.e if 100MB of data > > is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9 seconds. > > > > After the data is received, now it will take approx 100/25 ~4 Seconds to > > write the data to USB Hdd on server side. Hence making the overall time > > to write this much of data ~12 seconds, which in practically comes out to > > be near 7 to 8MB/second. After this a COMMIT response will be sent to NFS > > client. > > > > However we may improve this write performace by making the use of NFS > > server idle time i.e while data is being received from the client, > > simultaneously initiate the writeback thread on server side. So instead > > of waiting for the complete data to come and then start the writeback, > > we can work in parallel while the network is still busy in receiving the > > data. Hence in this way overall performace will be improved. > > > > If we tune dirty_background_centisecs, we can see there > > is increase in the performace and it comes out to be ~ 11MB/seconds. > > Results are:- > > > > Write test(create a 1 GB file) result at 'NFS client' after changing > > /sys/block/sda/bdi/dirty_background_centisecs > > on *** NFS Server only - not on NFS Client What is the configuration of the client and server? How much RAM, what their dirty_* parameters are set to, network speed, server disk speed for local sequential IO, etc? > > - > > |WRITE Test with various 'dirty_background_centisecs' at NFS Server | > > - > > | | default = 0 | 300 centisec| 200 centisec| 100 centisec | > > - > > |RecSize | WriteSpeed | WriteSpeed | WriteSpeed | WriteSpeed | > > - > > |10485760 | 8.44MB/sec | 8.60MB/sec | 9.30MB/sec | 10.27MB/sec | > > | 1048576 | 8.48MB/sec | 8.87MB/sec | 9.31MB/sec | 10.34MB/sec | > > | 524288 | 8.37MB/sec | 8.42MB/sec | 9.84MB/sec | 10.47MB/sec | > > | 262144 | 8.16MB/sec | 8.51MB/sec | 9.52MB/sec | 10.62MB/sec | > > | 131072 | 8.48MB/sec | 8.81MB/sec | 9.42MB/sec | 10.55MB/sec | > > | 65536 | 8.38MB/sec | 9.09MB/sec | 9.76MB/sec | 10.53MB/sec | > > | 32768 | 8.65MB/sec | 9.00MB/sec | 9.57MB/sec | 10.54MB/sec | > > | 16384 | 8.27MB/sec | 8.80MB/sec | 9.39MB/sec | 10.43MB/sec | > > |8192 | 8.52MB/sec | 8.70MB/sec | 9.40MB/sec | 10.50MB/sec | > > |4096 | 8.20MB/sec | 8.63MB/sec | 9.80MB/sec | 10.35MB/sec | > > - While this set of numbers looks
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
On Thu 20-09-12 16:44:22, Wu Fengguang wrote: > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: > > From: Namjae Jeon > > > > This patch is based on suggestion by Wu Fengguang: > > https://lkml.org/lkml/2011/8/19/19 > > > > kernel has mechanism to do writeback as per dirty_ratio and dirty_background > > ratio. It also maintains per task dirty rate limit to keep balance of > > dirty pages at any given instance by doing bdi bandwidth estimation. > > > > Kernel also has max_ratio/min_ratio tunables to specify percentage of > > writecache to control per bdi dirty limits and task throttling. > > > > However, there might be a usecase where user wants a per bdi writeback > > tuning > > parameter to flush dirty data once per bdi dirty data reach a threshold > > especially at NFS server. > > > > dirty_background_centisecs provides an interface where user can tune > > background writeback start threshold using > > /sys/block/sda/bdi/dirty_background_centisecs > > > > dirty_background_centisecs is used alongwith average bdi write bandwidth > > estimation to start background writeback. The functionality you describe, i.e. start flushing bdi when there's reasonable amount of dirty data on it, looks sensible and useful. However I'm not so sure whether the interface you propose is the right one. Traditionally, we allow user to set amount of dirty data (either in bytes or percentage of memory) when background writeback should start. You propose setting the amount of data in centisecs-to-write. Why that difference? Also this interface ties our throughput estimation code (which is an implementation detail of current dirty throttling) with the userspace API. So we'd have to maintain the estimation code forever, possibly also face problems when we change the estimation code (and thus estimates in some cases) and users will complain that the values they set originally no longer work as they used to. Also, as with each knob, there's a problem how to properly set its value? Most admins won't know about the knob and so won't touch it. Others might know about the knob but will have hard time figuring out what value should they set. So if there's a new knob, it should have a sensible initial value. And since this feature looks like a useful one, it shouldn't be zero. So my personal preference would be to have bdi->dirty_background_ratio and bdi->dirty_background_bytes and start background writeback whenever one of global background limit and per-bdi background limit is exceeded. I think this interface will do the job as well and it's easier to maintain in future. Honza > > One of the use case to demonstrate the patch functionality can be > > on NFS setup:- > > We have a NFS setup with ethernet line of 100Mbps, while the USB > > disk is attached to server, which has a local speed of 25MBps. Server > > and client both are arm target boards. > > > > Now if we perform a write operation over NFS (client to server), as > > per the network speed, data can travel at max speed of 100Mbps. But > > if we check the default write speed of USB hdd over NFS it comes > > around to 8MB/sec, far below the speed of network. > > > > Reason being is as per the NFS logic, during write operation, initially > > pages are dirtied on NFS client side, then after reaching the dirty > > threshold/writeback limit (or in case of sync) data is actually sent > > to NFS server (so now again pages are dirtied on server side). This > > will be done in COMMIT call from client to server i.e if 100MB of data > > is dirtied and sent then it will take minimum 100MB/100Mbps ~ 8-9 seconds. > > > > After the data is received, now it will take approx 100/25 ~4 Seconds to > > write the data to USB Hdd on server side. Hence making the overall time > > to write this much of data ~12 seconds, which in practically comes out to > > be near 7 to 8MB/second. After this a COMMIT response will be sent to NFS > > client. > > > > However we may improve this write performace by making the use of NFS > > server idle time i.e while data is being received from the client, > > simultaneously initiate the writeback thread on server side. So instead > > of waiting for the complete data to come and then start the writeback, > > we can work in parallel while the network is still busy in receiving the > > data. Hence in this way overall performace will be improved. > > > > If we tune dirty_background_centisecs, we can see there > > is increase in the performace and it comes out to be ~ 11MB/seconds. > > Results are:- > > > > Write test(create a 1 GB file) result at 'NFS client' after changing > > /sys/block/sda/bdi/dirty_background_centisecs > > on *** NFS Server only - not on NFS Client > > > > - > > |WRITE Test with various 'dirty_background_centisecs' at NFS Server | > >
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
On Thu 20-09-12 16:44:22, Wu Fengguang wrote: On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: From: Namjae Jeon namjae.j...@samsung.com This patch is based on suggestion by Wu Fengguang: https://lkml.org/lkml/2011/8/19/19 kernel has mechanism to do writeback as per dirty_ratio and dirty_background ratio. It also maintains per task dirty rate limit to keep balance of dirty pages at any given instance by doing bdi bandwidth estimation. Kernel also has max_ratio/min_ratio tunables to specify percentage of writecache to control per bdi dirty limits and task throttling. However, there might be a usecase where user wants a per bdi writeback tuning parameter to flush dirty data once per bdi dirty data reach a threshold especially at NFS server. dirty_background_centisecs provides an interface where user can tune background writeback start threshold using /sys/block/sda/bdi/dirty_background_centisecs dirty_background_centisecs is used alongwith average bdi write bandwidth estimation to start background writeback. The functionality you describe, i.e. start flushing bdi when there's reasonable amount of dirty data on it, looks sensible and useful. However I'm not so sure whether the interface you propose is the right one. Traditionally, we allow user to set amount of dirty data (either in bytes or percentage of memory) when background writeback should start. You propose setting the amount of data in centisecs-to-write. Why that difference? Also this interface ties our throughput estimation code (which is an implementation detail of current dirty throttling) with the userspace API. So we'd have to maintain the estimation code forever, possibly also face problems when we change the estimation code (and thus estimates in some cases) and users will complain that the values they set originally no longer work as they used to. Also, as with each knob, there's a problem how to properly set its value? Most admins won't know about the knob and so won't touch it. Others might know about the knob but will have hard time figuring out what value should they set. So if there's a new knob, it should have a sensible initial value. And since this feature looks like a useful one, it shouldn't be zero. So my personal preference would be to have bdi-dirty_background_ratio and bdi-dirty_background_bytes and start background writeback whenever one of global background limit and per-bdi background limit is exceeded. I think this interface will do the job as well and it's easier to maintain in future. Honza One of the use case to demonstrate the patch functionality can be on NFS setup:- We have a NFS setup with ethernet line of 100Mbps, while the USB disk is attached to server, which has a local speed of 25MBps. Server and client both are arm target boards. Now if we perform a write operation over NFS (client to server), as per the network speed, data can travel at max speed of 100Mbps. But if we check the default write speed of USB hdd over NFS it comes around to 8MB/sec, far below the speed of network. Reason being is as per the NFS logic, during write operation, initially pages are dirtied on NFS client side, then after reaching the dirty threshold/writeback limit (or in case of sync) data is actually sent to NFS server (so now again pages are dirtied on server side). This will be done in COMMIT call from client to server i.e if 100MB of data is dirtied and sent then it will take minimum 100MB/100Mbps ~ 8-9 seconds. After the data is received, now it will take approx 100/25 ~4 Seconds to write the data to USB Hdd on server side. Hence making the overall time to write this much of data ~12 seconds, which in practically comes out to be near 7 to 8MB/second. After this a COMMIT response will be sent to NFS client. However we may improve this write performace by making the use of NFS server idle time i.e while data is being received from the client, simultaneously initiate the writeback thread on server side. So instead of waiting for the complete data to come and then start the writeback, we can work in parallel while the network is still busy in receiving the data. Hence in this way overall performace will be improved. If we tune dirty_background_centisecs, we can see there is increase in the performace and it comes out to be ~ 11MB/seconds. Results are:- Write test(create a 1 GB file) result at 'NFS client' after changing /sys/block/sda/bdi/dirty_background_centisecs on *** NFS Server only - not on NFS Client - |WRITE Test with various 'dirty_background_centisecs' at NFS Server | - | | default = 0 | 300 centisec| 200 centisec| 100 centisec |
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
On Thu, Sep 20, 2012 at 04:44:22PM +0800, Fengguang Wu wrote: [ CC FS and MM lists ] Patch looks good to me, however we need to be careful because it's introducing a new interface. So it's desirable to get some acks from the FS/MM developers. Thanks, Fengguang On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: From: Namjae Jeon namjae.j...@samsung.com This patch is based on suggestion by Wu Fengguang: https://lkml.org/lkml/2011/8/19/19 kernel has mechanism to do writeback as per dirty_ratio and dirty_background ratio. It also maintains per task dirty rate limit to keep balance of dirty pages at any given instance by doing bdi bandwidth estimation. Kernel also has max_ratio/min_ratio tunables to specify percentage of writecache to control per bdi dirty limits and task throttling. However, there might be a usecase where user wants a per bdi writeback tuning parameter to flush dirty data once per bdi dirty data reach a threshold especially at NFS server. dirty_background_centisecs provides an interface where user can tune background writeback start threshold using /sys/block/sda/bdi/dirty_background_centisecs dirty_background_centisecs is used alongwith average bdi write bandwidth estimation to start background writeback. One of the use case to demonstrate the patch functionality can be on NFS setup:- We have a NFS setup with ethernet line of 100Mbps, while the USB disk is attached to server, which has a local speed of 25MBps. Server and client both are arm target boards. Now if we perform a write operation over NFS (client to server), as per the network speed, data can travel at max speed of 100Mbps. But if we check the default write speed of USB hdd over NFS it comes around to 8MB/sec, far below the speed of network. Reason being is as per the NFS logic, during write operation, initially pages are dirtied on NFS client side, then after reaching the dirty threshold/writeback limit (or in case of sync) data is actually sent to NFS server (so now again pages are dirtied on server side). This will be done in COMMIT call from client to server i.e if 100MB of data is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9 seconds. After the data is received, now it will take approx 100/25 ~4 Seconds to write the data to USB Hdd on server side. Hence making the overall time to write this much of data ~12 seconds, which in practically comes out to be near 7 to 8MB/second. After this a COMMIT response will be sent to NFS client. However we may improve this write performace by making the use of NFS server idle time i.e while data is being received from the client, simultaneously initiate the writeback thread on server side. So instead of waiting for the complete data to come and then start the writeback, we can work in parallel while the network is still busy in receiving the data. Hence in this way overall performace will be improved. If we tune dirty_background_centisecs, we can see there is increase in the performace and it comes out to be ~ 11MB/seconds. Results are:- Write test(create a 1 GB file) result at 'NFS client' after changing /sys/block/sda/bdi/dirty_background_centisecs on *** NFS Server only - not on NFS Client What is the configuration of the client and server? How much RAM, what their dirty_* parameters are set to, network speed, server disk speed for local sequential IO, etc? - |WRITE Test with various 'dirty_background_centisecs' at NFS Server | - | | default = 0 | 300 centisec| 200 centisec| 100 centisec | - |RecSize | WriteSpeed | WriteSpeed | WriteSpeed | WriteSpeed | - |10485760 | 8.44MB/sec | 8.60MB/sec | 9.30MB/sec | 10.27MB/sec | | 1048576 | 8.48MB/sec | 8.87MB/sec | 9.31MB/sec | 10.34MB/sec | | 524288 | 8.37MB/sec | 8.42MB/sec | 9.84MB/sec | 10.47MB/sec | | 262144 | 8.16MB/sec | 8.51MB/sec | 9.52MB/sec | 10.62MB/sec | | 131072 | 8.48MB/sec | 8.81MB/sec | 9.42MB/sec | 10.55MB/sec | | 65536 | 8.38MB/sec | 9.09MB/sec | 9.76MB/sec | 10.53MB/sec | | 32768 | 8.65MB/sec | 9.00MB/sec | 9.57MB/sec | 10.54MB/sec | | 16384 | 8.27MB/sec | 8.80MB/sec | 9.39MB/sec | 10.43MB/sec | |8192 | 8.52MB/sec | 8.70MB/sec | 9.40MB/sec | 10.50MB/sec | |4096 | 8.20MB/sec | 8.63MB/sec | 9.80MB/sec | 10.35MB/sec | - While this set of numbers looks good, it's a very limited in scope. I can't evaluate whether the change is worthwhile or not from this test. If I was writing this patch, the
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
[ CC FS and MM lists ] Patch looks good to me, however we need to be careful because it's introducing a new interface. So it's desirable to get some acks from the FS/MM developers. Thanks, Fengguang On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: > From: Namjae Jeon > > This patch is based on suggestion by Wu Fengguang: > https://lkml.org/lkml/2011/8/19/19 > > kernel has mechanism to do writeback as per dirty_ratio and dirty_background > ratio. It also maintains per task dirty rate limit to keep balance of > dirty pages at any given instance by doing bdi bandwidth estimation. > > Kernel also has max_ratio/min_ratio tunables to specify percentage of > writecache to control per bdi dirty limits and task throttling. > > However, there might be a usecase where user wants a per bdi writeback tuning > parameter to flush dirty data once per bdi dirty data reach a threshold > especially at NFS server. > > dirty_background_centisecs provides an interface where user can tune > background writeback start threshold using > /sys/block/sda/bdi/dirty_background_centisecs > > dirty_background_centisecs is used alongwith average bdi write bandwidth > estimation to start background writeback. > > One of the use case to demonstrate the patch functionality can be > on NFS setup:- > We have a NFS setup with ethernet line of 100Mbps, while the USB > disk is attached to server, which has a local speed of 25MBps. Server > and client both are arm target boards. > > Now if we perform a write operation over NFS (client to server), as > per the network speed, data can travel at max speed of 100Mbps. But > if we check the default write speed of USB hdd over NFS it comes > around to 8MB/sec, far below the speed of network. > > Reason being is as per the NFS logic, during write operation, initially > pages are dirtied on NFS client side, then after reaching the dirty > threshold/writeback limit (or in case of sync) data is actually sent > to NFS server (so now again pages are dirtied on server side). This > will be done in COMMIT call from client to server i.e if 100MB of data > is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9 seconds. > > After the data is received, now it will take approx 100/25 ~4 Seconds to > write the data to USB Hdd on server side. Hence making the overall time > to write this much of data ~12 seconds, which in practically comes out to > be near 7 to 8MB/second. After this a COMMIT response will be sent to NFS > client. > > However we may improve this write performace by making the use of NFS > server idle time i.e while data is being received from the client, > simultaneously initiate the writeback thread on server side. So instead > of waiting for the complete data to come and then start the writeback, > we can work in parallel while the network is still busy in receiving the > data. Hence in this way overall performace will be improved. > > If we tune dirty_background_centisecs, we can see there > is increase in the performace and it comes out to be ~ 11MB/seconds. > Results are:- > > Write test(create a 1 GB file) result at 'NFS client' after changing > /sys/block/sda/bdi/dirty_background_centisecs > on *** NFS Server only - not on NFS Client > > - > |WRITE Test with various 'dirty_background_centisecs' at NFS Server | > - > | | default = 0 | 300 centisec| 200 centisec| 100 centisec | > - > |RecSize | WriteSpeed | WriteSpeed | WriteSpeed | WriteSpeed | > - > |10485760 | 8.44MB/sec | 8.60MB/sec | 9.30MB/sec | 10.27MB/sec | > | 1048576 | 8.48MB/sec | 8.87MB/sec | 9.31MB/sec | 10.34MB/sec | > | 524288 | 8.37MB/sec | 8.42MB/sec | 9.84MB/sec | 10.47MB/sec | > | 262144 | 8.16MB/sec | 8.51MB/sec | 9.52MB/sec | 10.62MB/sec | > | 131072 | 8.48MB/sec | 8.81MB/sec | 9.42MB/sec | 10.55MB/sec | > | 65536 | 8.38MB/sec | 9.09MB/sec | 9.76MB/sec | 10.53MB/sec | > | 32768 | 8.65MB/sec | 9.00MB/sec | 9.57MB/sec | 10.54MB/sec | > | 16384 | 8.27MB/sec | 8.80MB/sec | 9.39MB/sec | 10.43MB/sec | > |8192 | 8.52MB/sec | 8.70MB/sec | 9.40MB/sec | 10.50MB/sec | > |4096 | 8.20MB/sec | 8.63MB/sec | 9.80MB/sec | 10.35MB/sec | > - > > we can see, average write speed is increased to ~10-11MB/sec. > > > This patch provides the changes per block devices. So that we may modify the > dirty_background_centisecs as per the device and overall system is not > impacted > by the changes and we get improved perforamace in certain use cases. > > NOTE: dirty_background_centisecs is used alongwith average bdi write
Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
[ CC FS and MM lists ] Patch looks good to me, however we need to be careful because it's introducing a new interface. So it's desirable to get some acks from the FS/MM developers. Thanks, Fengguang On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote: From: Namjae Jeon namjae.j...@samsung.com This patch is based on suggestion by Wu Fengguang: https://lkml.org/lkml/2011/8/19/19 kernel has mechanism to do writeback as per dirty_ratio and dirty_background ratio. It also maintains per task dirty rate limit to keep balance of dirty pages at any given instance by doing bdi bandwidth estimation. Kernel also has max_ratio/min_ratio tunables to specify percentage of writecache to control per bdi dirty limits and task throttling. However, there might be a usecase where user wants a per bdi writeback tuning parameter to flush dirty data once per bdi dirty data reach a threshold especially at NFS server. dirty_background_centisecs provides an interface where user can tune background writeback start threshold using /sys/block/sda/bdi/dirty_background_centisecs dirty_background_centisecs is used alongwith average bdi write bandwidth estimation to start background writeback. One of the use case to demonstrate the patch functionality can be on NFS setup:- We have a NFS setup with ethernet line of 100Mbps, while the USB disk is attached to server, which has a local speed of 25MBps. Server and client both are arm target boards. Now if we perform a write operation over NFS (client to server), as per the network speed, data can travel at max speed of 100Mbps. But if we check the default write speed of USB hdd over NFS it comes around to 8MB/sec, far below the speed of network. Reason being is as per the NFS logic, during write operation, initially pages are dirtied on NFS client side, then after reaching the dirty threshold/writeback limit (or in case of sync) data is actually sent to NFS server (so now again pages are dirtied on server side). This will be done in COMMIT call from client to server i.e if 100MB of data is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9 seconds. After the data is received, now it will take approx 100/25 ~4 Seconds to write the data to USB Hdd on server side. Hence making the overall time to write this much of data ~12 seconds, which in practically comes out to be near 7 to 8MB/second. After this a COMMIT response will be sent to NFS client. However we may improve this write performace by making the use of NFS server idle time i.e while data is being received from the client, simultaneously initiate the writeback thread on server side. So instead of waiting for the complete data to come and then start the writeback, we can work in parallel while the network is still busy in receiving the data. Hence in this way overall performace will be improved. If we tune dirty_background_centisecs, we can see there is increase in the performace and it comes out to be ~ 11MB/seconds. Results are:- Write test(create a 1 GB file) result at 'NFS client' after changing /sys/block/sda/bdi/dirty_background_centisecs on *** NFS Server only - not on NFS Client - |WRITE Test with various 'dirty_background_centisecs' at NFS Server | - | | default = 0 | 300 centisec| 200 centisec| 100 centisec | - |RecSize | WriteSpeed | WriteSpeed | WriteSpeed | WriteSpeed | - |10485760 | 8.44MB/sec | 8.60MB/sec | 9.30MB/sec | 10.27MB/sec | | 1048576 | 8.48MB/sec | 8.87MB/sec | 9.31MB/sec | 10.34MB/sec | | 524288 | 8.37MB/sec | 8.42MB/sec | 9.84MB/sec | 10.47MB/sec | | 262144 | 8.16MB/sec | 8.51MB/sec | 9.52MB/sec | 10.62MB/sec | | 131072 | 8.48MB/sec | 8.81MB/sec | 9.42MB/sec | 10.55MB/sec | | 65536 | 8.38MB/sec | 9.09MB/sec | 9.76MB/sec | 10.53MB/sec | | 32768 | 8.65MB/sec | 9.00MB/sec | 9.57MB/sec | 10.54MB/sec | | 16384 | 8.27MB/sec | 8.80MB/sec | 9.39MB/sec | 10.43MB/sec | |8192 | 8.52MB/sec | 8.70MB/sec | 9.40MB/sec | 10.50MB/sec | |4096 | 8.20MB/sec | 8.63MB/sec | 9.80MB/sec | 10.35MB/sec | - we can see, average write speed is increased to ~10-11MB/sec. This patch provides the changes per block devices. So that we may modify the dirty_background_centisecs as per the device and overall system is not impacted by the changes and we get improved perforamace in certain use cases. NOTE: dirty_background_centisecs is used alongwith average bdi write bandwidth estimation to start background writeback. But, bdi write