Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-12-05 Thread Namjae Jeon
2012/12/5, Wanpeng Li :
> Hi Namjae,
>
> How about set bdi->dirty_background_bytes according to bdi_thresh? I found
> an issue during background flush process when review codes, if over
> background
> flush threshold, wb_check_background_flush will kick a work to current
> per-bdi
> flusher, but maybe it is other heavy dirties written in other bdis who
> heavily
> dirty pages instead of current bdi, the worst case is current bdi has many
> frequently used data and flush lead to cache thresh. How about add a check
> in wb_check_background_flush if it is not current bdi who contributes large
>
> number of dirty pages to background flush threshold(over
> bdi->dirty_background_bytes),
> then don't bother it.

Hi Wanpeng.

First, Thanks for your suggestion!
Yes, I think that it looks reasonable.
I will start checking it.

Thanks.
>
> Regards,
> Wanpeng Li
>
> On Tue, Nov 20, 2012 at 08:18:59AM +0900, Namjae Jeon wrote:
>>2012/10/22, Dave Chinner :
>>> On Fri, Oct 19, 2012 at 04:51:05PM +0900, Namjae Jeon wrote:
 Hi Dave.

 Test Procedure:

 1) Local USB disk WRITE speed on NFS server is ~25 MB/s

 2) Run WRITE test(create 1 GB file) on NFS Client with default
 writeback settings on NFS Server. By default
 bdi->dirty_background_bytes = 0, that means no change in default
 writeback behaviour

 3) Next we change bdi->dirty_background_bytes = 25 MB (almost equal to
 local USB disk write speed on NFS Server)
 *** only on NFS Server - not on NFS Client ***
>>>
>>> Ok, so the results look good, but it's not really addressing what I
>>> was asking, though.  A typical desktop PC has a disk that can do
>>> 100MB/s and GbE, so I was expecting a test that showed throughput
>>> close to GbE maximums at least (ie. around that 100MB/s). I have 3
>>> year old, low end, low power hardware (atom) that hanles twice the
>>> throughput you are testing here, and most current consumer NAS
>>> devices are more powerful than this. IOWs, I think the rates you are
>>> testing at are probably too low even for the consumer NAS market to
>>> consider relevant...
>>>
 --
 Multiple NFS Client test:
 ---
 Sorry - We could not arrange multiple PCs to verify this.
 So, we tried 1 NFS Server + 2 NFS Clients using 3 target boards:
 ARM Target + 512 MB RAM + ethernet - 100 Mbits/s, create 1 GB File
>>>
>>> But this really doesn't tells us anything - it's still only 100Mb/s,
>>> which we'd expect is already getting very close to line rate even
>>> with low powered client hardware.
>>>
>>> What I'm concerned about the NFS server "sweet spot" - a $10k server
>>> that exports 20TB of storage and can sustain close to a GB/s of NFS
>>> traffic over a single 10GbE link with tens to hundreds of clients.
>>> 100MB/s and 10 clients is about the minimum needed to be able to
>>> extrapolate a litle and make an informed guess of how it will scale
>>> up
>>>
 > 1. what's the comparison in performance to typical NFS
 > server writeback parameter tuning? i.e. dirty_background_ratio=5,
 > dirty_ratio=10, dirty_expire_centiseconds=1000,
 > dirty_writeback_centisecs=1? i.e. does this give change give any
 > benefit over the current common practice for configuring NFS
 > servers?

 Agreed, that above improvement in write speed can be achieved by
 tuning above write-back parameters.
 But if we change these settings, it will change write-back behavior
 system wide.
 On the other hand, if we change proposed per bdi setting,
 bdi->dirty_background_bytes it will change write-back behavior for the
 block device exported on NFS server.
>>>
>>> I already know what the difference between global vs per-bdi tuning
>>> means.  What I want to know is how your results compare
>>> *numerically* to just having a tweaked global setting on a vanilla
>>> kernel.  i.e. is there really any performance benefit to per-bdi
>>> configuration that cannot be gained by existing methods?
>>>
 > 2. what happens when you have 10 clients all writing to the server
 > at once? Or a 100? NFS servers rarely have a single writer to a
 > single file at a time, so what impact does this change have on
 > multiple concurrent file write performance from multiple clients

 Sorry, we could not arrange more than 2 PCs for verifying this.
>>>
>>> Really? Well, perhaps there's some tools that might be useful for
>>> you here:
>>>
>>> http://oss.sgi.com/projects/nfs/testtools/
>>>
>>> "Weber
>>>
>>> Test load generator for NFS. Uses multiple threads, multiple
>>> sockets and multiple IP addresses to simulate loads from many
>>> machines, thus enabling testing of NFS server setups with larger
>>> client counts than can be tested with physical infrastructure (or
>>> Virtual Machine clients). Has 

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-12-05 Thread Namjae Jeon
2012/12/5, Wanpeng Li liw...@linux.vnet.ibm.com:
 Hi Namjae,

 How about set bdi-dirty_background_bytes according to bdi_thresh? I found
 an issue during background flush process when review codes, if over
 background
 flush threshold, wb_check_background_flush will kick a work to current
 per-bdi
 flusher, but maybe it is other heavy dirties written in other bdis who
 heavily
 dirty pages instead of current bdi, the worst case is current bdi has many
 frequently used data and flush lead to cache thresh. How about add a check
 in wb_check_background_flush if it is not current bdi who contributes large

 number of dirty pages to background flush threshold(over
 bdi-dirty_background_bytes),
 then don't bother it.

Hi Wanpeng.

First, Thanks for your suggestion!
Yes, I think that it looks reasonable.
I will start checking it.

Thanks.

 Regards,
 Wanpeng Li

 On Tue, Nov 20, 2012 at 08:18:59AM +0900, Namjae Jeon wrote:
2012/10/22, Dave Chinner da...@fromorbit.com:
 On Fri, Oct 19, 2012 at 04:51:05PM +0900, Namjae Jeon wrote:
 Hi Dave.

 Test Procedure:

 1) Local USB disk WRITE speed on NFS server is ~25 MB/s

 2) Run WRITE test(create 1 GB file) on NFS Client with default
 writeback settings on NFS Server. By default
 bdi-dirty_background_bytes = 0, that means no change in default
 writeback behaviour

 3) Next we change bdi-dirty_background_bytes = 25 MB (almost equal to
 local USB disk write speed on NFS Server)
 *** only on NFS Server - not on NFS Client ***

 Ok, so the results look good, but it's not really addressing what I
 was asking, though.  A typical desktop PC has a disk that can do
 100MB/s and GbE, so I was expecting a test that showed throughput
 close to GbE maximums at least (ie. around that 100MB/s). I have 3
 year old, low end, low power hardware (atom) that hanles twice the
 throughput you are testing here, and most current consumer NAS
 devices are more powerful than this. IOWs, I think the rates you are
 testing at are probably too low even for the consumer NAS market to
 consider relevant...

 --
 Multiple NFS Client test:
 ---
 Sorry - We could not arrange multiple PCs to verify this.
 So, we tried 1 NFS Server + 2 NFS Clients using 3 target boards:
 ARM Target + 512 MB RAM + ethernet - 100 Mbits/s, create 1 GB File

 But this really doesn't tells us anything - it's still only 100Mb/s,
 which we'd expect is already getting very close to line rate even
 with low powered client hardware.

 What I'm concerned about the NFS server sweet spot - a $10k server
 that exports 20TB of storage and can sustain close to a GB/s of NFS
 traffic over a single 10GbE link with tens to hundreds of clients.
 100MB/s and 10 clients is about the minimum needed to be able to
 extrapolate a litle and make an informed guess of how it will scale
 up

  1. what's the comparison in performance to typical NFS
  server writeback parameter tuning? i.e. dirty_background_ratio=5,
  dirty_ratio=10, dirty_expire_centiseconds=1000,
  dirty_writeback_centisecs=1? i.e. does this give change give any
  benefit over the current common practice for configuring NFS
  servers?

 Agreed, that above improvement in write speed can be achieved by
 tuning above write-back parameters.
 But if we change these settings, it will change write-back behavior
 system wide.
 On the other hand, if we change proposed per bdi setting,
 bdi-dirty_background_bytes it will change write-back behavior for the
 block device exported on NFS server.

 I already know what the difference between global vs per-bdi tuning
 means.  What I want to know is how your results compare
 *numerically* to just having a tweaked global setting on a vanilla
 kernel.  i.e. is there really any performance benefit to per-bdi
 configuration that cannot be gained by existing methods?

  2. what happens when you have 10 clients all writing to the server
  at once? Or a 100? NFS servers rarely have a single writer to a
  single file at a time, so what impact does this change have on
  multiple concurrent file write performance from multiple clients

 Sorry, we could not arrange more than 2 PCs for verifying this.

 Really? Well, perhaps there's some tools that might be useful for
 you here:

 http://oss.sgi.com/projects/nfs/testtools/

 Weber

 Test load generator for NFS. Uses multiple threads, multiple
 sockets and multiple IP addresses to simulate loads from many
 machines, thus enabling testing of NFS server setups with larger
 client counts than can be tested with physical infrastructure (or
 Virtual Machine clients). Has been useful in automated NFS testing
 and as a pinpoint NFS load generator tool for performance
 development.


Hi Dave,
We ran weber test on below setup:
1) SATA HDD - Local WRITE speed ~120 MB/s, NFS WRITE speed ~90 MB/s
2) Used 10GbE - network interface to mount NFS

We ran weber test 

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-11-19 Thread Namjae Jeon
2012/10/22, Dave Chinner :
> On Fri, Oct 19, 2012 at 04:51:05PM +0900, Namjae Jeon wrote:
>> Hi Dave.
>>
>> Test Procedure:
>>
>> 1) Local USB disk WRITE speed on NFS server is ~25 MB/s
>>
>> 2) Run WRITE test(create 1 GB file) on NFS Client with default
>> writeback settings on NFS Server. By default
>> bdi->dirty_background_bytes = 0, that means no change in default
>> writeback behaviour
>>
>> 3) Next we change bdi->dirty_background_bytes = 25 MB (almost equal to
>> local USB disk write speed on NFS Server)
>> *** only on NFS Server - not on NFS Client ***
>
> Ok, so the results look good, but it's not really addressing what I
> was asking, though.  A typical desktop PC has a disk that can do
> 100MB/s and GbE, so I was expecting a test that showed throughput
> close to GbE maximums at least (ie. around that 100MB/s). I have 3
> year old, low end, low power hardware (atom) that hanles twice the
> throughput you are testing here, and most current consumer NAS
> devices are more powerful than this. IOWs, I think the rates you are
> testing at are probably too low even for the consumer NAS market to
> consider relevant...
>
>> --
>> Multiple NFS Client test:
>> ---
>> Sorry - We could not arrange multiple PCs to verify this.
>> So, we tried 1 NFS Server + 2 NFS Clients using 3 target boards:
>> ARM Target + 512 MB RAM + ethernet - 100 Mbits/s, create 1 GB File
>
> But this really doesn't tells us anything - it's still only 100Mb/s,
> which we'd expect is already getting very close to line rate even
> with low powered client hardware.
>
> What I'm concerned about the NFS server "sweet spot" - a $10k server
> that exports 20TB of storage and can sustain close to a GB/s of NFS
> traffic over a single 10GbE link with tens to hundreds of clients.
> 100MB/s and 10 clients is about the minimum needed to be able to
> extrapolate a litle and make an informed guess of how it will scale
> up
>
>> > 1. what's the comparison in performance to typical NFS
>> > server writeback parameter tuning? i.e. dirty_background_ratio=5,
>> > dirty_ratio=10, dirty_expire_centiseconds=1000,
>> > dirty_writeback_centisecs=1? i.e. does this give change give any
>> > benefit over the current common practice for configuring NFS
>> > servers?
>>
>> Agreed, that above improvement in write speed can be achieved by
>> tuning above write-back parameters.
>> But if we change these settings, it will change write-back behavior
>> system wide.
>> On the other hand, if we change proposed per bdi setting,
>> bdi->dirty_background_bytes it will change write-back behavior for the
>> block device exported on NFS server.
>
> I already know what the difference between global vs per-bdi tuning
> means.  What I want to know is how your results compare
> *numerically* to just having a tweaked global setting on a vanilla
> kernel.  i.e. is there really any performance benefit to per-bdi
> configuration that cannot be gained by existing methods?
>
>> > 2. what happens when you have 10 clients all writing to the server
>> > at once? Or a 100? NFS servers rarely have a single writer to a
>> > single file at a time, so what impact does this change have on
>> > multiple concurrent file write performance from multiple clients
>>
>> Sorry, we could not arrange more than 2 PCs for verifying this.
>
> Really? Well, perhaps there's some tools that might be useful for
> you here:
>
> http://oss.sgi.com/projects/nfs/testtools/
>
> "Weber
>
> Test load generator for NFS. Uses multiple threads, multiple
> sockets and multiple IP addresses to simulate loads from many
> machines, thus enabling testing of NFS server setups with larger
> client counts than can be tested with physical infrastructure (or
> Virtual Machine clients). Has been useful in automated NFS testing
> and as a pinpoint NFS load generator tool for performance
> development."
>

Hi Dave,
We ran "weber" test on below setup:
1) SATA HDD - Local WRITE speed ~120 MB/s, NFS WRITE speed ~90 MB/s
2) Used 10GbE - network interface to mount NFS

We ran "weber" test with  NFS clients ranging from 1 to 100,
below is the % GAIN in NFS WRITE speed with
bdi->dirty_background_bytes = 100 MB at NFS server

-
| Number of NFS Clients |% GAIN in WRITE Speed  |
|---|
| 1 | 19.83 %   |
|---|
| 2 |  2.97 %   |
|---|
| 3 |  2.01 %   |
|---|
|10 |  0.25 %   |
|---|
|20 |  0.23 %   |
|---|
|30  

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-11-19 Thread Namjae Jeon
2012/10/22, Dave Chinner da...@fromorbit.com:
 On Fri, Oct 19, 2012 at 04:51:05PM +0900, Namjae Jeon wrote:
 Hi Dave.

 Test Procedure:

 1) Local USB disk WRITE speed on NFS server is ~25 MB/s

 2) Run WRITE test(create 1 GB file) on NFS Client with default
 writeback settings on NFS Server. By default
 bdi-dirty_background_bytes = 0, that means no change in default
 writeback behaviour

 3) Next we change bdi-dirty_background_bytes = 25 MB (almost equal to
 local USB disk write speed on NFS Server)
 *** only on NFS Server - not on NFS Client ***

 Ok, so the results look good, but it's not really addressing what I
 was asking, though.  A typical desktop PC has a disk that can do
 100MB/s and GbE, so I was expecting a test that showed throughput
 close to GbE maximums at least (ie. around that 100MB/s). I have 3
 year old, low end, low power hardware (atom) that hanles twice the
 throughput you are testing here, and most current consumer NAS
 devices are more powerful than this. IOWs, I think the rates you are
 testing at are probably too low even for the consumer NAS market to
 consider relevant...

 --
 Multiple NFS Client test:
 ---
 Sorry - We could not arrange multiple PCs to verify this.
 So, we tried 1 NFS Server + 2 NFS Clients using 3 target boards:
 ARM Target + 512 MB RAM + ethernet - 100 Mbits/s, create 1 GB File

 But this really doesn't tells us anything - it's still only 100Mb/s,
 which we'd expect is already getting very close to line rate even
 with low powered client hardware.

 What I'm concerned about the NFS server sweet spot - a $10k server
 that exports 20TB of storage and can sustain close to a GB/s of NFS
 traffic over a single 10GbE link with tens to hundreds of clients.
 100MB/s and 10 clients is about the minimum needed to be able to
 extrapolate a litle and make an informed guess of how it will scale
 up

  1. what's the comparison in performance to typical NFS
  server writeback parameter tuning? i.e. dirty_background_ratio=5,
  dirty_ratio=10, dirty_expire_centiseconds=1000,
  dirty_writeback_centisecs=1? i.e. does this give change give any
  benefit over the current common practice for configuring NFS
  servers?

 Agreed, that above improvement in write speed can be achieved by
 tuning above write-back parameters.
 But if we change these settings, it will change write-back behavior
 system wide.
 On the other hand, if we change proposed per bdi setting,
 bdi-dirty_background_bytes it will change write-back behavior for the
 block device exported on NFS server.

 I already know what the difference between global vs per-bdi tuning
 means.  What I want to know is how your results compare
 *numerically* to just having a tweaked global setting on a vanilla
 kernel.  i.e. is there really any performance benefit to per-bdi
 configuration that cannot be gained by existing methods?

  2. what happens when you have 10 clients all writing to the server
  at once? Or a 100? NFS servers rarely have a single writer to a
  single file at a time, so what impact does this change have on
  multiple concurrent file write performance from multiple clients

 Sorry, we could not arrange more than 2 PCs for verifying this.

 Really? Well, perhaps there's some tools that might be useful for
 you here:

 http://oss.sgi.com/projects/nfs/testtools/

 Weber

 Test load generator for NFS. Uses multiple threads, multiple
 sockets and multiple IP addresses to simulate loads from many
 machines, thus enabling testing of NFS server setups with larger
 client counts than can be tested with physical infrastructure (or
 Virtual Machine clients). Has been useful in automated NFS testing
 and as a pinpoint NFS load generator tool for performance
 development.


Hi Dave,
We ran weber test on below setup:
1) SATA HDD - Local WRITE speed ~120 MB/s, NFS WRITE speed ~90 MB/s
2) Used 10GbE - network interface to mount NFS

We ran weber test with  NFS clients ranging from 1 to 100,
below is the % GAIN in NFS WRITE speed with
bdi-dirty_background_bytes = 100 MB at NFS server

-
| Number of NFS Clients |% GAIN in WRITE Speed  |
|---|
| 1 | 19.83 %   |
|---|
| 2 |  2.97 %   |
|---|
| 3 |  2.01 %   |
|---|
|10 |  0.25 %   |
|---|
|20 |  0.23 %   |
|---|
|30 |  0.13 %   |
|---|
|   100 |- 0.60 %  

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-10-21 Thread Dave Chinner
On Fri, Oct 19, 2012 at 04:51:05PM +0900, Namjae Jeon wrote:
> Hi Dave.
> 
> Test Procedure:
> 
> 1) Local USB disk WRITE speed on NFS server is ~25 MB/s
> 
> 2) Run WRITE test(create 1 GB file) on NFS Client with default
> writeback settings on NFS Server. By default
> bdi->dirty_background_bytes = 0, that means no change in default
> writeback behaviour
> 
> 3) Next we change bdi->dirty_background_bytes = 25 MB (almost equal to
> local USB disk write speed on NFS Server)
> *** only on NFS Server - not on NFS Client ***

Ok, so the results look good, but it's not really addressing what I
was asking, though.  A typical desktop PC has a disk that can do
100MB/s and GbE, so I was expecting a test that showed throughput
close to GbE maximums at least (ie. around that 100MB/s). I have 3
year old, low end, low power hardware (atom) that hanles twice the
throughput you are testing here, and most current consumer NAS
devices are more powerful than this. IOWs, I think the rates you are
testing at are probably too low even for the consumer NAS market to
consider relevant...

> --
> Multiple NFS Client test:
> ---
> Sorry - We could not arrange multiple PCs to verify this.
> So, we tried 1 NFS Server + 2 NFS Clients using 3 target boards:
> ARM Target + 512 MB RAM + ethernet - 100 Mbits/s, create 1 GB File

But this really doesn't tells us anything - it's still only 100Mb/s,
which we'd expect is already getting very close to line rate even
with low powered client hardware.

What I'm concerned about the NFS server "sweet spot" - a $10k server
that exports 20TB of storage and can sustain close to a GB/s of NFS
traffic over a single 10GbE link with tens to hundreds of clients.
100MB/s and 10 clients is about the minimum needed to be able to
extrapolate a litle and make an informed guess of how it will scale
up

> > 1. what's the comparison in performance to typical NFS
> > server writeback parameter tuning? i.e. dirty_background_ratio=5,
> > dirty_ratio=10, dirty_expire_centiseconds=1000,
> > dirty_writeback_centisecs=1? i.e. does this give change give any
> > benefit over the current common practice for configuring NFS
> > servers?
> 
> Agreed, that above improvement in write speed can be achieved by
> tuning above write-back parameters.
> But if we change these settings, it will change write-back behavior
> system wide.
> On the other hand, if we change proposed per bdi setting,
> bdi->dirty_background_bytes it will change write-back behavior for the
> block device exported on NFS server.

I already know what the difference between global vs per-bdi tuning
means.  What I want to know is how your results compare
*numerically* to just having a tweaked global setting on a vanilla
kernel.  i.e. is there really any performance benefit to per-bdi
configuration that cannot be gained by existing methods?

> > 2. what happens when you have 10 clients all writing to the server
> > at once? Or a 100? NFS servers rarely have a single writer to a
> > single file at a time, so what impact does this change have on
> > multiple concurrent file write performance from multiple clients
> 
> Sorry, we could not arrange more than 2 PCs for verifying this.

Really? Well, perhaps there's some tools that might be useful for
you here:

http://oss.sgi.com/projects/nfs/testtools/

"Weber

Test load generator for NFS. Uses multiple threads, multiple
sockets and multiple IP addresses to simulate loads from many
machines, thus enabling testing of NFS server setups with larger
client counts than can be tested with physical infrastructure (or
Virtual Machine clients). Has been useful in automated NFS testing
and as a pinpoint NFS load generator tool for performance
development."

> > 3. Following on from the multiple client test, what difference does it
> > make to file fragmentation rates? Writing more frequently means
> > smaller allocations and writes, and that tends to lead to higher
> > fragmentation rates, especially when multiple files are being
> > written concurrently. Higher fragmentation also means lower
> > performance over time as fragmentation accelerates filesystem aging
> > effects on performance.  IOWs, it may be faster when new, but it
> > will be slower 3 months down the track and that's a bad tradeoff to
> > make.
> 
> We agree that there could be bit more framentation. But as you know,
> we are not changing writeback settings at NFS clients.
> So, write-back behavior on NFS client will not change - IO requests
> will be buffered at NFS client as per existing write-back behavior.

I think you misunderstand - writeback settings on the server greatly
impact the way the server writes data and therefore the way files
are fragmented. It has nothing to do with client side tuning.

Effectively, what you are presenting is best case numbers - empty
filesystem, single client, 

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-10-21 Thread Dave Chinner
On Fri, Oct 19, 2012 at 04:51:05PM +0900, Namjae Jeon wrote:
 Hi Dave.
 
 Test Procedure:
 
 1) Local USB disk WRITE speed on NFS server is ~25 MB/s
 
 2) Run WRITE test(create 1 GB file) on NFS Client with default
 writeback settings on NFS Server. By default
 bdi-dirty_background_bytes = 0, that means no change in default
 writeback behaviour
 
 3) Next we change bdi-dirty_background_bytes = 25 MB (almost equal to
 local USB disk write speed on NFS Server)
 *** only on NFS Server - not on NFS Client ***

Ok, so the results look good, but it's not really addressing what I
was asking, though.  A typical desktop PC has a disk that can do
100MB/s and GbE, so I was expecting a test that showed throughput
close to GbE maximums at least (ie. around that 100MB/s). I have 3
year old, low end, low power hardware (atom) that hanles twice the
throughput you are testing here, and most current consumer NAS
devices are more powerful than this. IOWs, I think the rates you are
testing at are probably too low even for the consumer NAS market to
consider relevant...

 --
 Multiple NFS Client test:
 ---
 Sorry - We could not arrange multiple PCs to verify this.
 So, we tried 1 NFS Server + 2 NFS Clients using 3 target boards:
 ARM Target + 512 MB RAM + ethernet - 100 Mbits/s, create 1 GB File

But this really doesn't tells us anything - it's still only 100Mb/s,
which we'd expect is already getting very close to line rate even
with low powered client hardware.

What I'm concerned about the NFS server sweet spot - a $10k server
that exports 20TB of storage and can sustain close to a GB/s of NFS
traffic over a single 10GbE link with tens to hundreds of clients.
100MB/s and 10 clients is about the minimum needed to be able to
extrapolate a litle and make an informed guess of how it will scale
up

  1. what's the comparison in performance to typical NFS
  server writeback parameter tuning? i.e. dirty_background_ratio=5,
  dirty_ratio=10, dirty_expire_centiseconds=1000,
  dirty_writeback_centisecs=1? i.e. does this give change give any
  benefit over the current common practice for configuring NFS
  servers?
 
 Agreed, that above improvement in write speed can be achieved by
 tuning above write-back parameters.
 But if we change these settings, it will change write-back behavior
 system wide.
 On the other hand, if we change proposed per bdi setting,
 bdi-dirty_background_bytes it will change write-back behavior for the
 block device exported on NFS server.

I already know what the difference between global vs per-bdi tuning
means.  What I want to know is how your results compare
*numerically* to just having a tweaked global setting on a vanilla
kernel.  i.e. is there really any performance benefit to per-bdi
configuration that cannot be gained by existing methods?

  2. what happens when you have 10 clients all writing to the server
  at once? Or a 100? NFS servers rarely have a single writer to a
  single file at a time, so what impact does this change have on
  multiple concurrent file write performance from multiple clients
 
 Sorry, we could not arrange more than 2 PCs for verifying this.

Really? Well, perhaps there's some tools that might be useful for
you here:

http://oss.sgi.com/projects/nfs/testtools/

Weber

Test load generator for NFS. Uses multiple threads, multiple
sockets and multiple IP addresses to simulate loads from many
machines, thus enabling testing of NFS server setups with larger
client counts than can be tested with physical infrastructure (or
Virtual Machine clients). Has been useful in automated NFS testing
and as a pinpoint NFS load generator tool for performance
development.

  3. Following on from the multiple client test, what difference does it
  make to file fragmentation rates? Writing more frequently means
  smaller allocations and writes, and that tends to lead to higher
  fragmentation rates, especially when multiple files are being
  written concurrently. Higher fragmentation also means lower
  performance over time as fragmentation accelerates filesystem aging
  effects on performance.  IOWs, it may be faster when new, but it
  will be slower 3 months down the track and that's a bad tradeoff to
  make.
 
 We agree that there could be bit more framentation. But as you know,
 we are not changing writeback settings at NFS clients.
 So, write-back behavior on NFS client will not change - IO requests
 will be buffered at NFS client as per existing write-back behavior.

I think you misunderstand - writeback settings on the server greatly
impact the way the server writes data and therefore the way files
are fragmented. It has nothing to do with client side tuning.

Effectively, what you are presenting is best case numbers - empty
filesystem, single client, streaming write, no fragmentation, no
allocation contention, no competing IO 

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-10-19 Thread Namjae Jeon
Hi Dave.

Test Procedure:

1) Local USB disk WRITE speed on NFS server is ~25 MB/s

2) Run WRITE test(create 1 GB file) on NFS Client with default
writeback settings on NFS Server. By default
bdi->dirty_background_bytes = 0, that means no change in default
writeback behaviour

3) Next we change bdi->dirty_background_bytes = 25 MB (almost equal to
local USB disk write speed on NFS Server)
*** only on NFS Server - not on NFS Client ***

[NFS Server]
# echo $((25*1024*1024)) > /sys/block/sdb/bdi/dirty_background_bytes
# cat /sys/block/sdb/bdi/dirty_background_bytes
26214400

4) Run WRITE test again on NFS client to see change in WRITE speed at NFS client

Test setup details:
Test result on PC - FC16 - RAM 3 GB - ethernet - 1000 Mbits/s,
Create 1 GB File


Table 1: XFS over NFS - WRITE SPEED on NFS Client

 default writebackbdi->dirty_background_bytes
  setting  = 25 MB

RecSize write speed(MB/s)   write speed(MB/s)  % Change
10485760   27.39   28.53  4%
1048576 27.928.59  2%
524288   27.55  28.94  5%
262144   25.428.58 13%
131072   25.73   27.55 7%
65536 25.85   28.4510%
32768 26.13   28.6410%
16384 26.17   27.93 7%
8192  25.6428.07 9%
4096  26.2828.19 7%

--
Table 2: EXT4 over NFS - WRITE SPEED on NFS Client
--
  default writebackbdi->dirty_background_bytes
  setting  = 25 MB

RecSize write speed(MB/s)   write speed(MB/s)   % Change
10485760 23.87 28.319%
1048576   24.8127.79   12%
52428824.53 28.14   15%
26214424.21 27.99   16%
13107224.11 28.33   18%
65536  23.73 28.21   19%
32768  25.66 27.527%
16384  24.3   27.67   14%
819223.6   27.08   15%
409623.35 27.2417%

As mentioned in the above Table 1 & 2, there is performance
improvement on NFS client on gigabit Ethernet on both EXT4/XFS over
NFS. We did not observe any degradation in write speed.
However, performance gain varies on different file systems i.e.
different on XFS & EXT4 over NFS.

We also tried this change on BTRFS over NFS, but we did not see any
significant change in WRITE speed.

--
Multiple NFS Client test:
---
Sorry - We could not arrange multiple PCs to verify this.
So, we tried 1 NFS Server + 2 NFS Clients using 3 target boards:
ARM Target + 512 MB RAM + ethernet - 100 Mbits/s, create 1 GB File

-
Table 3: bdi->dirty_background_bytes = 0 MB
 - default writeback behaviour
-
RecSizeWrite SpeedWrite Speed Combined
on Client 1  on client 2 write speed
(MB/s)   (MB/s)  (MB/s)

104857605.45  5.36  10.81
1048576  5.44  5.34  10.78
5242885.48  5.51  10.99
2621446.24  4.83  11.07
1310725.58  5.53  11.11
65536  5.51  5.48  10.99
32768  5.42  5.46  10.88
16384  5.62  5.58  11.2
81925.59  5.49  11.08
40965.57  6.38   

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-10-19 Thread Namjae Jeon
Hi Dave.

Test Procedure:

1) Local USB disk WRITE speed on NFS server is ~25 MB/s

2) Run WRITE test(create 1 GB file) on NFS Client with default
writeback settings on NFS Server. By default
bdi-dirty_background_bytes = 0, that means no change in default
writeback behaviour

3) Next we change bdi-dirty_background_bytes = 25 MB (almost equal to
local USB disk write speed on NFS Server)
*** only on NFS Server - not on NFS Client ***

[NFS Server]
# echo $((25*1024*1024))  /sys/block/sdb/bdi/dirty_background_bytes
# cat /sys/block/sdb/bdi/dirty_background_bytes
26214400

4) Run WRITE test again on NFS client to see change in WRITE speed at NFS client

Test setup details:
Test result on PC - FC16 - RAM 3 GB - ethernet - 1000 Mbits/s,
Create 1 GB File


Table 1: XFS over NFS - WRITE SPEED on NFS Client

 default writebackbdi-dirty_background_bytes
  setting  = 25 MB

RecSize write speed(MB/s)   write speed(MB/s)  % Change
10485760   27.39   28.53  4%
1048576 27.928.59  2%
524288   27.55  28.94  5%
262144   25.428.58 13%
131072   25.73   27.55 7%
65536 25.85   28.4510%
32768 26.13   28.6410%
16384 26.17   27.93 7%
8192  25.6428.07 9%
4096  26.2828.19 7%

--
Table 2: EXT4 over NFS - WRITE SPEED on NFS Client
--
  default writebackbdi-dirty_background_bytes
  setting  = 25 MB

RecSize write speed(MB/s)   write speed(MB/s)   % Change
10485760 23.87 28.319%
1048576   24.8127.79   12%
52428824.53 28.14   15%
26214424.21 27.99   16%
13107224.11 28.33   18%
65536  23.73 28.21   19%
32768  25.66 27.527%
16384  24.3   27.67   14%
819223.6   27.08   15%
409623.35 27.2417%

As mentioned in the above Table 1  2, there is performance
improvement on NFS client on gigabit Ethernet on both EXT4/XFS over
NFS. We did not observe any degradation in write speed.
However, performance gain varies on different file systems i.e.
different on XFS  EXT4 over NFS.

We also tried this change on BTRFS over NFS, but we did not see any
significant change in WRITE speed.

--
Multiple NFS Client test:
---
Sorry - We could not arrange multiple PCs to verify this.
So, we tried 1 NFS Server + 2 NFS Clients using 3 target boards:
ARM Target + 512 MB RAM + ethernet - 100 Mbits/s, create 1 GB File

-
Table 3: bdi-dirty_background_bytes = 0 MB
 - default writeback behaviour
-
RecSizeWrite SpeedWrite Speed Combined
on Client 1  on client 2 write speed
(MB/s)   (MB/s)  (MB/s)

104857605.45  5.36  10.81
1048576  5.44  5.34  10.78
5242885.48  5.51  10.99
2621446.24  4.83  11.07
1310725.58  5.53  11.11
65536  5.51  5.48  10.99
32768  5.42  5.46  10.88
16384  5.62  5.58  11.2
81925.59  5.49  11.08
40965.57  6.38  

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-09-27 Thread Namjae Jeon
2012/9/27, Jan Kara :
> On Thu 27-09-12 15:00:18, Namjae Jeon wrote:
>> 2012/9/27, Jan Kara :
>> > On Thu 27-09-12 00:56:02, Wu Fengguang wrote:
>> >> On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote:
>> >> > On Thu 20-09-12 16:44:22, Wu Fengguang wrote:
>> >> > > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
>> >> > > > From: Namjae Jeon 
>> >> > > >
>> >> > > > This patch is based on suggestion by Wu Fengguang:
>> >> > > > https://lkml.org/lkml/2011/8/19/19
>> >> > > >
>> >> > > > kernel has mechanism to do writeback as per dirty_ratio and
>> >> > > > dirty_background
>> >> > > > ratio. It also maintains per task dirty rate limit to keep
>> >> > > > balance
>> >> > > > of
>> >> > > > dirty pages at any given instance by doing bdi bandwidth
>> >> > > > estimation.
>> >> > > >
>> >> > > > Kernel also has max_ratio/min_ratio tunables to specify
>> >> > > > percentage
>> >> > > > of
>> >> > > > writecache to control per bdi dirty limits and task throttling.
>> >> > > >
>> >> > > > However, there might be a usecase where user wants a per bdi
>> >> > > > writeback tuning
>> >> > > > parameter to flush dirty data once per bdi dirty data reach a
>> >> > > > threshold
>> >> > > > especially at NFS server.
>> >> > > >
>> >> > > > dirty_background_centisecs provides an interface where user can
>> >> > > > tune
>> >> > > > background writeback start threshold using
>> >> > > > /sys/block/sda/bdi/dirty_background_centisecs
>> >> > > >
>> >> > > > dirty_background_centisecs is used alongwith average bdi write
>> >> > > > bandwidth
>> >> > > > estimation to start background writeback.
>> >> >   The functionality you describe, i.e. start flushing bdi when
>> >> > there's
>> >> > reasonable amount of dirty data on it, looks sensible and useful.
>> >> > However
>> >> > I'm not so sure whether the interface you propose is the right one.
>> >> > Traditionally, we allow user to set amount of dirty data (either in
>> >> > bytes
>> >> > or percentage of memory) when background writeback should start. You
>> >> > propose setting the amount of data in centisecs-to-write. Why that
>> >> > difference? Also this interface ties our throughput estimation code
>> >> > (which
>> >> > is an implementation detail of current dirty throttling) with the
>> >> > userspace
>> >> > API. So we'd have to maintain the estimation code forever, possibly
>> >> > also
>> >> > face problems when we change the estimation code (and thus estimates
>> >> > in
>> >> > some cases) and users will complain that the values they set
>> >> > originally
>> >> > no
>> >> > longer work as they used to.
>> >>
>> >> Yes, that bandwidth estimation is not all that (and in theory cannot
>> >> be made) reliable which may be a surprise to the user. Which make the
>> >> interface flaky.
>> >>
>> >> > Also, as with each knob, there's a problem how to properly set its
>> >> > value?
>> >> > Most admins won't know about the knob and so won't touch it. Others
>> >> > might
>> >> > know about the knob but will have hard time figuring out what value
>> >> > should
>> >> > they set. So if there's a new knob, it should have a sensible
>> >> > initial
>> >> > value. And since this feature looks like a useful one, it shouldn't
>> >> > be
>> >> > zero.
>> >>
>> >> Agreed in principle. There seems be no reasonable defaults for the
>> >> centisecs-to-write interface, mainly due to its inaccurate nature,
>> >> especially the initial value may be wildly wrong on fresh system
>> >> bootup. This is also true for your proposed interfaces, see below.
>> >>
>> >> > So my personal preference would be to have
>> >> > bdi->dirty_background_ratio
>> >> > and
>> >> > bdi->dirty_background_bytes and start background writeback whenever
>> >> > one of global background limit and per-bdi background limit is
>> >> > exceeded.
>> >> > I
>> >> > think this interface will do the job as well and it's easier to
>> >> > maintain
>> >> > in
>> >> > future.
>> >>
>> >> bdi->dirty_background_ratio, if I understand its semantics right, is
>> >> unfortunately flaky in the same principle as centisecs-to-write,
>> >> because it relies on the (implicitly estimation of) writeout
>> >> proportions. The writeout proportions for each bdi starts with 0,
>> >> which is even worse than the 100MB/s initial value for
>> >> bdi->write_bandwidth and will trigger background writeback on the
>> >> first write.
>> >   Well, I meant bdi->dirty_backround_ratio wouldn't use writeout
>> > proportion
>> > estimates at all. Limit would be
>> >   dirtiable_memory * bdi->dirty_backround_ratio.
>> >
>> > After all we want to start writeout to bdi when we have enough pages to
>> > reasonably load the device for a while which has nothing to do with how
>> > much is written to this device as compared to other devices.
>> >
>> > OTOH I'm not particularly attached to this interface. Especially since
>> > on a
>> > lot of today's machines, 1% is rather big so people might often end up
>> > using dirty_background_bytes 

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-09-27 Thread Jan Kara
On Thu 27-09-12 15:00:18, Namjae Jeon wrote:
> 2012/9/27, Jan Kara :
> > On Thu 27-09-12 00:56:02, Wu Fengguang wrote:
> >> On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote:
> >> > On Thu 20-09-12 16:44:22, Wu Fengguang wrote:
> >> > > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
> >> > > > From: Namjae Jeon 
> >> > > >
> >> > > > This patch is based on suggestion by Wu Fengguang:
> >> > > > https://lkml.org/lkml/2011/8/19/19
> >> > > >
> >> > > > kernel has mechanism to do writeback as per dirty_ratio and
> >> > > > dirty_background
> >> > > > ratio. It also maintains per task dirty rate limit to keep balance
> >> > > > of
> >> > > > dirty pages at any given instance by doing bdi bandwidth
> >> > > > estimation.
> >> > > >
> >> > > > Kernel also has max_ratio/min_ratio tunables to specify percentage
> >> > > > of
> >> > > > writecache to control per bdi dirty limits and task throttling.
> >> > > >
> >> > > > However, there might be a usecase where user wants a per bdi
> >> > > > writeback tuning
> >> > > > parameter to flush dirty data once per bdi dirty data reach a
> >> > > > threshold
> >> > > > especially at NFS server.
> >> > > >
> >> > > > dirty_background_centisecs provides an interface where user can
> >> > > > tune
> >> > > > background writeback start threshold using
> >> > > > /sys/block/sda/bdi/dirty_background_centisecs
> >> > > >
> >> > > > dirty_background_centisecs is used alongwith average bdi write
> >> > > > bandwidth
> >> > > > estimation to start background writeback.
> >> >   The functionality you describe, i.e. start flushing bdi when there's
> >> > reasonable amount of dirty data on it, looks sensible and useful.
> >> > However
> >> > I'm not so sure whether the interface you propose is the right one.
> >> > Traditionally, we allow user to set amount of dirty data (either in
> >> > bytes
> >> > or percentage of memory) when background writeback should start. You
> >> > propose setting the amount of data in centisecs-to-write. Why that
> >> > difference? Also this interface ties our throughput estimation code
> >> > (which
> >> > is an implementation detail of current dirty throttling) with the
> >> > userspace
> >> > API. So we'd have to maintain the estimation code forever, possibly
> >> > also
> >> > face problems when we change the estimation code (and thus estimates in
> >> > some cases) and users will complain that the values they set originally
> >> > no
> >> > longer work as they used to.
> >>
> >> Yes, that bandwidth estimation is not all that (and in theory cannot
> >> be made) reliable which may be a surprise to the user. Which make the
> >> interface flaky.
> >>
> >> > Also, as with each knob, there's a problem how to properly set its
> >> > value?
> >> > Most admins won't know about the knob and so won't touch it. Others
> >> > might
> >> > know about the knob but will have hard time figuring out what value
> >> > should
> >> > they set. So if there's a new knob, it should have a sensible initial
> >> > value. And since this feature looks like a useful one, it shouldn't be
> >> > zero.
> >>
> >> Agreed in principle. There seems be no reasonable defaults for the
> >> centisecs-to-write interface, mainly due to its inaccurate nature,
> >> especially the initial value may be wildly wrong on fresh system
> >> bootup. This is also true for your proposed interfaces, see below.
> >>
> >> > So my personal preference would be to have bdi->dirty_background_ratio
> >> > and
> >> > bdi->dirty_background_bytes and start background writeback whenever
> >> > one of global background limit and per-bdi background limit is exceeded.
> >> > I
> >> > think this interface will do the job as well and it's easier to maintain
> >> > in
> >> > future.
> >>
> >> bdi->dirty_background_ratio, if I understand its semantics right, is
> >> unfortunately flaky in the same principle as centisecs-to-write,
> >> because it relies on the (implicitly estimation of) writeout
> >> proportions. The writeout proportions for each bdi starts with 0,
> >> which is even worse than the 100MB/s initial value for
> >> bdi->write_bandwidth and will trigger background writeback on the
> >> first write.
> >   Well, I meant bdi->dirty_backround_ratio wouldn't use writeout proportion
> > estimates at all. Limit would be
> >   dirtiable_memory * bdi->dirty_backround_ratio.
> >
> > After all we want to start writeout to bdi when we have enough pages to
> > reasonably load the device for a while which has nothing to do with how
> > much is written to this device as compared to other devices.
> >
> > OTOH I'm not particularly attached to this interface. Especially since on a
> > lot of today's machines, 1% is rather big so people might often end up
> > using dirty_background_bytes anyway.
> >
> >> bdi->dirty_background_bytes is, however, reliable, and gives users
> >> total control. If we export this interface alone, I'd imagine users
> >> who want to control centisecs-to-write could run a simple 

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-09-27 Thread Namjae Jeon
2012/9/27, Jan Kara :
> On Thu 27-09-12 00:56:02, Wu Fengguang wrote:
>> On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote:
>> > On Thu 20-09-12 16:44:22, Wu Fengguang wrote:
>> > > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
>> > > > From: Namjae Jeon 
>> > > >
>> > > > This patch is based on suggestion by Wu Fengguang:
>> > > > https://lkml.org/lkml/2011/8/19/19
>> > > >
>> > > > kernel has mechanism to do writeback as per dirty_ratio and
>> > > > dirty_background
>> > > > ratio. It also maintains per task dirty rate limit to keep balance
>> > > > of
>> > > > dirty pages at any given instance by doing bdi bandwidth
>> > > > estimation.
>> > > >
>> > > > Kernel also has max_ratio/min_ratio tunables to specify percentage
>> > > > of
>> > > > writecache to control per bdi dirty limits and task throttling.
>> > > >
>> > > > However, there might be a usecase where user wants a per bdi
>> > > > writeback tuning
>> > > > parameter to flush dirty data once per bdi dirty data reach a
>> > > > threshold
>> > > > especially at NFS server.
>> > > >
>> > > > dirty_background_centisecs provides an interface where user can
>> > > > tune
>> > > > background writeback start threshold using
>> > > > /sys/block/sda/bdi/dirty_background_centisecs
>> > > >
>> > > > dirty_background_centisecs is used alongwith average bdi write
>> > > > bandwidth
>> > > > estimation to start background writeback.
>> >   The functionality you describe, i.e. start flushing bdi when there's
>> > reasonable amount of dirty data on it, looks sensible and useful.
>> > However
>> > I'm not so sure whether the interface you propose is the right one.
>> > Traditionally, we allow user to set amount of dirty data (either in
>> > bytes
>> > or percentage of memory) when background writeback should start. You
>> > propose setting the amount of data in centisecs-to-write. Why that
>> > difference? Also this interface ties our throughput estimation code
>> > (which
>> > is an implementation detail of current dirty throttling) with the
>> > userspace
>> > API. So we'd have to maintain the estimation code forever, possibly
>> > also
>> > face problems when we change the estimation code (and thus estimates in
>> > some cases) and users will complain that the values they set originally
>> > no
>> > longer work as they used to.
>>
>> Yes, that bandwidth estimation is not all that (and in theory cannot
>> be made) reliable which may be a surprise to the user. Which make the
>> interface flaky.
>>
>> > Also, as with each knob, there's a problem how to properly set its
>> > value?
>> > Most admins won't know about the knob and so won't touch it. Others
>> > might
>> > know about the knob but will have hard time figuring out what value
>> > should
>> > they set. So if there's a new knob, it should have a sensible initial
>> > value. And since this feature looks like a useful one, it shouldn't be
>> > zero.
>>
>> Agreed in principle. There seems be no reasonable defaults for the
>> centisecs-to-write interface, mainly due to its inaccurate nature,
>> especially the initial value may be wildly wrong on fresh system
>> bootup. This is also true for your proposed interfaces, see below.
>>
>> > So my personal preference would be to have bdi->dirty_background_ratio
>> > and
>> > bdi->dirty_background_bytes and start background writeback whenever
>> > one of global background limit and per-bdi background limit is exceeded.
>> > I
>> > think this interface will do the job as well and it's easier to maintain
>> > in
>> > future.
>>
>> bdi->dirty_background_ratio, if I understand its semantics right, is
>> unfortunately flaky in the same principle as centisecs-to-write,
>> because it relies on the (implicitly estimation of) writeout
>> proportions. The writeout proportions for each bdi starts with 0,
>> which is even worse than the 100MB/s initial value for
>> bdi->write_bandwidth and will trigger background writeback on the
>> first write.
>   Well, I meant bdi->dirty_backround_ratio wouldn't use writeout proportion
> estimates at all. Limit would be
>   dirtiable_memory * bdi->dirty_backround_ratio.
>
> After all we want to start writeout to bdi when we have enough pages to
> reasonably load the device for a while which has nothing to do with how
> much is written to this device as compared to other devices.
>
> OTOH I'm not particularly attached to this interface. Especially since on a
> lot of today's machines, 1% is rather big so people might often end up
> using dirty_background_bytes anyway.
>
>> bdi->dirty_background_bytes is, however, reliable, and gives users
>> total control. If we export this interface alone, I'd imagine users
>> who want to control centisecs-to-write could run a simple script to
>> periodically get the write bandwith value out of the existing bdi
>> interface and echo it into bdi->dirty_background_bytes. Which makes
>> simple yet good enough centisecs-to-write controlling.
>>
>> So what do you think about exporting a 

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-09-27 Thread Namjae Jeon
2012/9/27, Jan Kara j...@suse.cz:
 On Thu 27-09-12 00:56:02, Wu Fengguang wrote:
 On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote:
  On Thu 20-09-12 16:44:22, Wu Fengguang wrote:
   On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
From: Namjae Jeon namjae.j...@samsung.com
   
This patch is based on suggestion by Wu Fengguang:
https://lkml.org/lkml/2011/8/19/19
   
kernel has mechanism to do writeback as per dirty_ratio and
dirty_background
ratio. It also maintains per task dirty rate limit to keep balance
of
dirty pages at any given instance by doing bdi bandwidth
estimation.
   
Kernel also has max_ratio/min_ratio tunables to specify percentage
of
writecache to control per bdi dirty limits and task throttling.
   
However, there might be a usecase where user wants a per bdi
writeback tuning
parameter to flush dirty data once per bdi dirty data reach a
threshold
especially at NFS server.
   
dirty_background_centisecs provides an interface where user can
tune
background writeback start threshold using
/sys/block/sda/bdi/dirty_background_centisecs
   
dirty_background_centisecs is used alongwith average bdi write
bandwidth
estimation to start background writeback.
The functionality you describe, i.e. start flushing bdi when there's
  reasonable amount of dirty data on it, looks sensible and useful.
  However
  I'm not so sure whether the interface you propose is the right one.
  Traditionally, we allow user to set amount of dirty data (either in
  bytes
  or percentage of memory) when background writeback should start. You
  propose setting the amount of data in centisecs-to-write. Why that
  difference? Also this interface ties our throughput estimation code
  (which
  is an implementation detail of current dirty throttling) with the
  userspace
  API. So we'd have to maintain the estimation code forever, possibly
  also
  face problems when we change the estimation code (and thus estimates in
  some cases) and users will complain that the values they set originally
  no
  longer work as they used to.

 Yes, that bandwidth estimation is not all that (and in theory cannot
 be made) reliable which may be a surprise to the user. Which make the
 interface flaky.

  Also, as with each knob, there's a problem how to properly set its
  value?
  Most admins won't know about the knob and so won't touch it. Others
  might
  know about the knob but will have hard time figuring out what value
  should
  they set. So if there's a new knob, it should have a sensible initial
  value. And since this feature looks like a useful one, it shouldn't be
  zero.

 Agreed in principle. There seems be no reasonable defaults for the
 centisecs-to-write interface, mainly due to its inaccurate nature,
 especially the initial value may be wildly wrong on fresh system
 bootup. This is also true for your proposed interfaces, see below.

  So my personal preference would be to have bdi-dirty_background_ratio
  and
  bdi-dirty_background_bytes and start background writeback whenever
  one of global background limit and per-bdi background limit is exceeded.
  I
  think this interface will do the job as well and it's easier to maintain
  in
  future.

 bdi-dirty_background_ratio, if I understand its semantics right, is
 unfortunately flaky in the same principle as centisecs-to-write,
 because it relies on the (implicitly estimation of) writeout
 proportions. The writeout proportions for each bdi starts with 0,
 which is even worse than the 100MB/s initial value for
 bdi-write_bandwidth and will trigger background writeback on the
 first write.
   Well, I meant bdi-dirty_backround_ratio wouldn't use writeout proportion
 estimates at all. Limit would be
   dirtiable_memory * bdi-dirty_backround_ratio.

 After all we want to start writeout to bdi when we have enough pages to
 reasonably load the device for a while which has nothing to do with how
 much is written to this device as compared to other devices.

 OTOH I'm not particularly attached to this interface. Especially since on a
 lot of today's machines, 1% is rather big so people might often end up
 using dirty_background_bytes anyway.

 bdi-dirty_background_bytes is, however, reliable, and gives users
 total control. If we export this interface alone, I'd imagine users
 who want to control centisecs-to-write could run a simple script to
 periodically get the write bandwith value out of the existing bdi
 interface and echo it into bdi-dirty_background_bytes. Which makes
 simple yet good enough centisecs-to-write controlling.

 So what do you think about exporting a really dumb
 bdi-dirty_background_bytes, which will effectively give smart users
 the freedom to do smart control over per-bdi background writeback
 threshold? The users are offered the freedom to do his own bandwidth
 estimation and choose not to rely on the kernel estimation, which will
 free us from the 

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-09-27 Thread Jan Kara
On Thu 27-09-12 15:00:18, Namjae Jeon wrote:
 2012/9/27, Jan Kara j...@suse.cz:
  On Thu 27-09-12 00:56:02, Wu Fengguang wrote:
  On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote:
   On Thu 20-09-12 16:44:22, Wu Fengguang wrote:
On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
 From: Namjae Jeon namjae.j...@samsung.com

 This patch is based on suggestion by Wu Fengguang:
 https://lkml.org/lkml/2011/8/19/19

 kernel has mechanism to do writeback as per dirty_ratio and
 dirty_background
 ratio. It also maintains per task dirty rate limit to keep balance
 of
 dirty pages at any given instance by doing bdi bandwidth
 estimation.

 Kernel also has max_ratio/min_ratio tunables to specify percentage
 of
 writecache to control per bdi dirty limits and task throttling.

 However, there might be a usecase where user wants a per bdi
 writeback tuning
 parameter to flush dirty data once per bdi dirty data reach a
 threshold
 especially at NFS server.

 dirty_background_centisecs provides an interface where user can
 tune
 background writeback start threshold using
 /sys/block/sda/bdi/dirty_background_centisecs

 dirty_background_centisecs is used alongwith average bdi write
 bandwidth
 estimation to start background writeback.
 The functionality you describe, i.e. start flushing bdi when there's
   reasonable amount of dirty data on it, looks sensible and useful.
   However
   I'm not so sure whether the interface you propose is the right one.
   Traditionally, we allow user to set amount of dirty data (either in
   bytes
   or percentage of memory) when background writeback should start. You
   propose setting the amount of data in centisecs-to-write. Why that
   difference? Also this interface ties our throughput estimation code
   (which
   is an implementation detail of current dirty throttling) with the
   userspace
   API. So we'd have to maintain the estimation code forever, possibly
   also
   face problems when we change the estimation code (and thus estimates in
   some cases) and users will complain that the values they set originally
   no
   longer work as they used to.
 
  Yes, that bandwidth estimation is not all that (and in theory cannot
  be made) reliable which may be a surprise to the user. Which make the
  interface flaky.
 
   Also, as with each knob, there's a problem how to properly set its
   value?
   Most admins won't know about the knob and so won't touch it. Others
   might
   know about the knob but will have hard time figuring out what value
   should
   they set. So if there's a new knob, it should have a sensible initial
   value. And since this feature looks like a useful one, it shouldn't be
   zero.
 
  Agreed in principle. There seems be no reasonable defaults for the
  centisecs-to-write interface, mainly due to its inaccurate nature,
  especially the initial value may be wildly wrong on fresh system
  bootup. This is also true for your proposed interfaces, see below.
 
   So my personal preference would be to have bdi-dirty_background_ratio
   and
   bdi-dirty_background_bytes and start background writeback whenever
   one of global background limit and per-bdi background limit is exceeded.
   I
   think this interface will do the job as well and it's easier to maintain
   in
   future.
 
  bdi-dirty_background_ratio, if I understand its semantics right, is
  unfortunately flaky in the same principle as centisecs-to-write,
  because it relies on the (implicitly estimation of) writeout
  proportions. The writeout proportions for each bdi starts with 0,
  which is even worse than the 100MB/s initial value for
  bdi-write_bandwidth and will trigger background writeback on the
  first write.
Well, I meant bdi-dirty_backround_ratio wouldn't use writeout proportion
  estimates at all. Limit would be
dirtiable_memory * bdi-dirty_backround_ratio.
 
  After all we want to start writeout to bdi when we have enough pages to
  reasonably load the device for a while which has nothing to do with how
  much is written to this device as compared to other devices.
 
  OTOH I'm not particularly attached to this interface. Especially since on a
  lot of today's machines, 1% is rather big so people might often end up
  using dirty_background_bytes anyway.
 
  bdi-dirty_background_bytes is, however, reliable, and gives users
  total control. If we export this interface alone, I'd imagine users
  who want to control centisecs-to-write could run a simple script to
  periodically get the write bandwith value out of the existing bdi
  interface and echo it into bdi-dirty_background_bytes. Which makes
  simple yet good enough centisecs-to-write controlling.
 
  So what do you think about exporting a really dumb
  bdi-dirty_background_bytes, which will effectively give smart users
  the freedom to do smart control over per-bdi background writeback
  

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-09-27 Thread Namjae Jeon
2012/9/27, Jan Kara j...@suse.cz:
 On Thu 27-09-12 15:00:18, Namjae Jeon wrote:
 2012/9/27, Jan Kara j...@suse.cz:
  On Thu 27-09-12 00:56:02, Wu Fengguang wrote:
  On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote:
   On Thu 20-09-12 16:44:22, Wu Fengguang wrote:
On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
 From: Namjae Jeon namjae.j...@samsung.com

 This patch is based on suggestion by Wu Fengguang:
 https://lkml.org/lkml/2011/8/19/19

 kernel has mechanism to do writeback as per dirty_ratio and
 dirty_background
 ratio. It also maintains per task dirty rate limit to keep
 balance
 of
 dirty pages at any given instance by doing bdi bandwidth
 estimation.

 Kernel also has max_ratio/min_ratio tunables to specify
 percentage
 of
 writecache to control per bdi dirty limits and task throttling.

 However, there might be a usecase where user wants a per bdi
 writeback tuning
 parameter to flush dirty data once per bdi dirty data reach a
 threshold
 especially at NFS server.

 dirty_background_centisecs provides an interface where user can
 tune
 background writeback start threshold using
 /sys/block/sda/bdi/dirty_background_centisecs

 dirty_background_centisecs is used alongwith average bdi write
 bandwidth
 estimation to start background writeback.
 The functionality you describe, i.e. start flushing bdi when
   there's
   reasonable amount of dirty data on it, looks sensible and useful.
   However
   I'm not so sure whether the interface you propose is the right one.
   Traditionally, we allow user to set amount of dirty data (either in
   bytes
   or percentage of memory) when background writeback should start. You
   propose setting the amount of data in centisecs-to-write. Why that
   difference? Also this interface ties our throughput estimation code
   (which
   is an implementation detail of current dirty throttling) with the
   userspace
   API. So we'd have to maintain the estimation code forever, possibly
   also
   face problems when we change the estimation code (and thus estimates
   in
   some cases) and users will complain that the values they set
   originally
   no
   longer work as they used to.
 
  Yes, that bandwidth estimation is not all that (and in theory cannot
  be made) reliable which may be a surprise to the user. Which make the
  interface flaky.
 
   Also, as with each knob, there's a problem how to properly set its
   value?
   Most admins won't know about the knob and so won't touch it. Others
   might
   know about the knob but will have hard time figuring out what value
   should
   they set. So if there's a new knob, it should have a sensible
   initial
   value. And since this feature looks like a useful one, it shouldn't
   be
   zero.
 
  Agreed in principle. There seems be no reasonable defaults for the
  centisecs-to-write interface, mainly due to its inaccurate nature,
  especially the initial value may be wildly wrong on fresh system
  bootup. This is also true for your proposed interfaces, see below.
 
   So my personal preference would be to have
   bdi-dirty_background_ratio
   and
   bdi-dirty_background_bytes and start background writeback whenever
   one of global background limit and per-bdi background limit is
   exceeded.
   I
   think this interface will do the job as well and it's easier to
   maintain
   in
   future.
 
  bdi-dirty_background_ratio, if I understand its semantics right, is
  unfortunately flaky in the same principle as centisecs-to-write,
  because it relies on the (implicitly estimation of) writeout
  proportions. The writeout proportions for each bdi starts with 0,
  which is even worse than the 100MB/s initial value for
  bdi-write_bandwidth and will trigger background writeback on the
  first write.
Well, I meant bdi-dirty_backround_ratio wouldn't use writeout
  proportion
  estimates at all. Limit would be
dirtiable_memory * bdi-dirty_backround_ratio.
 
  After all we want to start writeout to bdi when we have enough pages to
  reasonably load the device for a while which has nothing to do with how
  much is written to this device as compared to other devices.
 
  OTOH I'm not particularly attached to this interface. Especially since
  on a
  lot of today's machines, 1% is rather big so people might often end up
  using dirty_background_bytes anyway.
 
  bdi-dirty_background_bytes is, however, reliable, and gives users
  total control. If we export this interface alone, I'd imagine users
  who want to control centisecs-to-write could run a simple script to
  periodically get the write bandwith value out of the existing bdi
  interface and echo it into bdi-dirty_background_bytes. Which makes
  simple yet good enough centisecs-to-write controlling.
 
  So what do you think about exporting a really dumb
  bdi-dirty_background_bytes, which will effectively give smart users

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-09-26 Thread Jan Kara
On Thu 27-09-12 00:56:02, Wu Fengguang wrote:
> On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote:
> > On Thu 20-09-12 16:44:22, Wu Fengguang wrote:
> > > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
> > > > From: Namjae Jeon 
> > > > 
> > > > This patch is based on suggestion by Wu Fengguang:
> > > > https://lkml.org/lkml/2011/8/19/19
> > > > 
> > > > kernel has mechanism to do writeback as per dirty_ratio and 
> > > > dirty_background
> > > > ratio. It also maintains per task dirty rate limit to keep balance of
> > > > dirty pages at any given instance by doing bdi bandwidth estimation.
> > > > 
> > > > Kernel also has max_ratio/min_ratio tunables to specify percentage of
> > > > writecache to control per bdi dirty limits and task throttling.
> > > > 
> > > > However, there might be a usecase where user wants a per bdi writeback 
> > > > tuning
> > > > parameter to flush dirty data once per bdi dirty data reach a threshold
> > > > especially at NFS server.
> > > > 
> > > > dirty_background_centisecs provides an interface where user can tune
> > > > background writeback start threshold using
> > > > /sys/block/sda/bdi/dirty_background_centisecs
> > > > 
> > > > dirty_background_centisecs is used alongwith average bdi write bandwidth
> > > > estimation to start background writeback.
> >   The functionality you describe, i.e. start flushing bdi when there's
> > reasonable amount of dirty data on it, looks sensible and useful. However
> > I'm not so sure whether the interface you propose is the right one.
> > Traditionally, we allow user to set amount of dirty data (either in bytes
> > or percentage of memory) when background writeback should start. You
> > propose setting the amount of data in centisecs-to-write. Why that
> > difference? Also this interface ties our throughput estimation code (which
> > is an implementation detail of current dirty throttling) with the userspace
> > API. So we'd have to maintain the estimation code forever, possibly also
> > face problems when we change the estimation code (and thus estimates in
> > some cases) and users will complain that the values they set originally no
> > longer work as they used to.
> 
> Yes, that bandwidth estimation is not all that (and in theory cannot
> be made) reliable which may be a surprise to the user. Which make the
> interface flaky.
> 
> > Also, as with each knob, there's a problem how to properly set its value?
> > Most admins won't know about the knob and so won't touch it. Others might
> > know about the knob but will have hard time figuring out what value should
> > they set. So if there's a new knob, it should have a sensible initial
> > value. And since this feature looks like a useful one, it shouldn't be
> > zero.
> 
> Agreed in principle. There seems be no reasonable defaults for the
> centisecs-to-write interface, mainly due to its inaccurate nature,
> especially the initial value may be wildly wrong on fresh system
> bootup. This is also true for your proposed interfaces, see below.
> 
> > So my personal preference would be to have bdi->dirty_background_ratio and
> > bdi->dirty_background_bytes and start background writeback whenever
> > one of global background limit and per-bdi background limit is exceeded. I
> > think this interface will do the job as well and it's easier to maintain in
> > future.
> 
> bdi->dirty_background_ratio, if I understand its semantics right, is
> unfortunately flaky in the same principle as centisecs-to-write,
> because it relies on the (implicitly estimation of) writeout
> proportions. The writeout proportions for each bdi starts with 0,
> which is even worse than the 100MB/s initial value for
> bdi->write_bandwidth and will trigger background writeback on the
> first write.
  Well, I meant bdi->dirty_backround_ratio wouldn't use writeout proportion
estimates at all. Limit would be
  dirtiable_memory * bdi->dirty_backround_ratio.

After all we want to start writeout to bdi when we have enough pages to
reasonably load the device for a while which has nothing to do with how
much is written to this device as compared to other devices.
 
OTOH I'm not particularly attached to this interface. Especially since on a
lot of today's machines, 1% is rather big so people might often end up
using dirty_background_bytes anyway.

> bdi->dirty_background_bytes is, however, reliable, and gives users
> total control. If we export this interface alone, I'd imagine users
> who want to control centisecs-to-write could run a simple script to
> periodically get the write bandwith value out of the existing bdi
> interface and echo it into bdi->dirty_background_bytes. Which makes
> simple yet good enough centisecs-to-write controlling.
> 
> So what do you think about exporting a really dumb
> bdi->dirty_background_bytes, which will effectively give smart users
> the freedom to do smart control over per-bdi background writeback
> threshold? The users are offered the freedom to do his own bandwidth
> 

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-09-26 Thread Fengguang Wu
On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote:
> On Thu 20-09-12 16:44:22, Wu Fengguang wrote:
> > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
> > > From: Namjae Jeon 
> > > 
> > > This patch is based on suggestion by Wu Fengguang:
> > > https://lkml.org/lkml/2011/8/19/19
> > > 
> > > kernel has mechanism to do writeback as per dirty_ratio and 
> > > dirty_background
> > > ratio. It also maintains per task dirty rate limit to keep balance of
> > > dirty pages at any given instance by doing bdi bandwidth estimation.
> > > 
> > > Kernel also has max_ratio/min_ratio tunables to specify percentage of
> > > writecache to control per bdi dirty limits and task throttling.
> > > 
> > > However, there might be a usecase where user wants a per bdi writeback 
> > > tuning
> > > parameter to flush dirty data once per bdi dirty data reach a threshold
> > > especially at NFS server.
> > > 
> > > dirty_background_centisecs provides an interface where user can tune
> > > background writeback start threshold using
> > > /sys/block/sda/bdi/dirty_background_centisecs
> > > 
> > > dirty_background_centisecs is used alongwith average bdi write bandwidth
> > > estimation to start background writeback.
>   The functionality you describe, i.e. start flushing bdi when there's
> reasonable amount of dirty data on it, looks sensible and useful. However
> I'm not so sure whether the interface you propose is the right one.
> Traditionally, we allow user to set amount of dirty data (either in bytes
> or percentage of memory) when background writeback should start. You
> propose setting the amount of data in centisecs-to-write. Why that
> difference? Also this interface ties our throughput estimation code (which
> is an implementation detail of current dirty throttling) with the userspace
> API. So we'd have to maintain the estimation code forever, possibly also
> face problems when we change the estimation code (and thus estimates in
> some cases) and users will complain that the values they set originally no
> longer work as they used to.

Yes, that bandwidth estimation is not all that (and in theory cannot
be made) reliable which may be a surprise to the user. Which make the
interface flaky.

> Also, as with each knob, there's a problem how to properly set its value?
> Most admins won't know about the knob and so won't touch it. Others might
> know about the knob but will have hard time figuring out what value should
> they set. So if there's a new knob, it should have a sensible initial
> value. And since this feature looks like a useful one, it shouldn't be
> zero.

Agreed in principle. There seems be no reasonable defaults for the
centisecs-to-write interface, mainly due to its inaccurate nature,
especially the initial value may be wildly wrong on fresh system
bootup. This is also true for your proposed interfaces, see below.

> So my personal preference would be to have bdi->dirty_background_ratio and
> bdi->dirty_background_bytes and start background writeback whenever
> one of global background limit and per-bdi background limit is exceeded. I
> think this interface will do the job as well and it's easier to maintain in
> future.

bdi->dirty_background_ratio, if I understand its semantics right, is
unfortunately flaky in the same principle as centisecs-to-write,
because it relies on the (implicitly estimation of) writeout
proportions. The writeout proportions for each bdi starts with 0,
which is even worse than the 100MB/s initial value for
bdi->write_bandwidth and will trigger background writeback on the
first write.

bdi->dirty_background_bytes is, however, reliable, and gives users
total control. If we export this interface alone, I'd imagine users
who want to control centisecs-to-write could run a simple script to
periodically get the write bandwith value out of the existing bdi
interface and echo it into bdi->dirty_background_bytes. Which makes
simple yet good enough centisecs-to-write controlling.

So what do you think about exporting a really dumb
bdi->dirty_background_bytes, which will effectively give smart users
the freedom to do smart control over per-bdi background writeback
threshold? The users are offered the freedom to do his own bandwidth
estimation and choose not to rely on the kernel estimation, which will
free us from the burden of maintaining a flaky interface as well. :)

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-09-26 Thread Fengguang Wu
On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote:
 On Thu 20-09-12 16:44:22, Wu Fengguang wrote:
  On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
   From: Namjae Jeon namjae.j...@samsung.com
   
   This patch is based on suggestion by Wu Fengguang:
   https://lkml.org/lkml/2011/8/19/19
   
   kernel has mechanism to do writeback as per dirty_ratio and 
   dirty_background
   ratio. It also maintains per task dirty rate limit to keep balance of
   dirty pages at any given instance by doing bdi bandwidth estimation.
   
   Kernel also has max_ratio/min_ratio tunables to specify percentage of
   writecache to control per bdi dirty limits and task throttling.
   
   However, there might be a usecase where user wants a per bdi writeback 
   tuning
   parameter to flush dirty data once per bdi dirty data reach a threshold
   especially at NFS server.
   
   dirty_background_centisecs provides an interface where user can tune
   background writeback start threshold using
   /sys/block/sda/bdi/dirty_background_centisecs
   
   dirty_background_centisecs is used alongwith average bdi write bandwidth
   estimation to start background writeback.
   The functionality you describe, i.e. start flushing bdi when there's
 reasonable amount of dirty data on it, looks sensible and useful. However
 I'm not so sure whether the interface you propose is the right one.
 Traditionally, we allow user to set amount of dirty data (either in bytes
 or percentage of memory) when background writeback should start. You
 propose setting the amount of data in centisecs-to-write. Why that
 difference? Also this interface ties our throughput estimation code (which
 is an implementation detail of current dirty throttling) with the userspace
 API. So we'd have to maintain the estimation code forever, possibly also
 face problems when we change the estimation code (and thus estimates in
 some cases) and users will complain that the values they set originally no
 longer work as they used to.

Yes, that bandwidth estimation is not all that (and in theory cannot
be made) reliable which may be a surprise to the user. Which make the
interface flaky.

 Also, as with each knob, there's a problem how to properly set its value?
 Most admins won't know about the knob and so won't touch it. Others might
 know about the knob but will have hard time figuring out what value should
 they set. So if there's a new knob, it should have a sensible initial
 value. And since this feature looks like a useful one, it shouldn't be
 zero.

Agreed in principle. There seems be no reasonable defaults for the
centisecs-to-write interface, mainly due to its inaccurate nature,
especially the initial value may be wildly wrong on fresh system
bootup. This is also true for your proposed interfaces, see below.

 So my personal preference would be to have bdi-dirty_background_ratio and
 bdi-dirty_background_bytes and start background writeback whenever
 one of global background limit and per-bdi background limit is exceeded. I
 think this interface will do the job as well and it's easier to maintain in
 future.

bdi-dirty_background_ratio, if I understand its semantics right, is
unfortunately flaky in the same principle as centisecs-to-write,
because it relies on the (implicitly estimation of) writeout
proportions. The writeout proportions for each bdi starts with 0,
which is even worse than the 100MB/s initial value for
bdi-write_bandwidth and will trigger background writeback on the
first write.

bdi-dirty_background_bytes is, however, reliable, and gives users
total control. If we export this interface alone, I'd imagine users
who want to control centisecs-to-write could run a simple script to
periodically get the write bandwith value out of the existing bdi
interface and echo it into bdi-dirty_background_bytes. Which makes
simple yet good enough centisecs-to-write controlling.

So what do you think about exporting a really dumb
bdi-dirty_background_bytes, which will effectively give smart users
the freedom to do smart control over per-bdi background writeback
threshold? The users are offered the freedom to do his own bandwidth
estimation and choose not to rely on the kernel estimation, which will
free us from the burden of maintaining a flaky interface as well. :)

Thanks,
Fengguang
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-09-26 Thread Jan Kara
On Thu 27-09-12 00:56:02, Wu Fengguang wrote:
 On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote:
  On Thu 20-09-12 16:44:22, Wu Fengguang wrote:
   On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
From: Namjae Jeon namjae.j...@samsung.com

This patch is based on suggestion by Wu Fengguang:
https://lkml.org/lkml/2011/8/19/19

kernel has mechanism to do writeback as per dirty_ratio and 
dirty_background
ratio. It also maintains per task dirty rate limit to keep balance of
dirty pages at any given instance by doing bdi bandwidth estimation.

Kernel also has max_ratio/min_ratio tunables to specify percentage of
writecache to control per bdi dirty limits and task throttling.

However, there might be a usecase where user wants a per bdi writeback 
tuning
parameter to flush dirty data once per bdi dirty data reach a threshold
especially at NFS server.

dirty_background_centisecs provides an interface where user can tune
background writeback start threshold using
/sys/block/sda/bdi/dirty_background_centisecs

dirty_background_centisecs is used alongwith average bdi write bandwidth
estimation to start background writeback.
The functionality you describe, i.e. start flushing bdi when there's
  reasonable amount of dirty data on it, looks sensible and useful. However
  I'm not so sure whether the interface you propose is the right one.
  Traditionally, we allow user to set amount of dirty data (either in bytes
  or percentage of memory) when background writeback should start. You
  propose setting the amount of data in centisecs-to-write. Why that
  difference? Also this interface ties our throughput estimation code (which
  is an implementation detail of current dirty throttling) with the userspace
  API. So we'd have to maintain the estimation code forever, possibly also
  face problems when we change the estimation code (and thus estimates in
  some cases) and users will complain that the values they set originally no
  longer work as they used to.
 
 Yes, that bandwidth estimation is not all that (and in theory cannot
 be made) reliable which may be a surprise to the user. Which make the
 interface flaky.
 
  Also, as with each knob, there's a problem how to properly set its value?
  Most admins won't know about the knob and so won't touch it. Others might
  know about the knob but will have hard time figuring out what value should
  they set. So if there's a new knob, it should have a sensible initial
  value. And since this feature looks like a useful one, it shouldn't be
  zero.
 
 Agreed in principle. There seems be no reasonable defaults for the
 centisecs-to-write interface, mainly due to its inaccurate nature,
 especially the initial value may be wildly wrong on fresh system
 bootup. This is also true for your proposed interfaces, see below.
 
  So my personal preference would be to have bdi-dirty_background_ratio and
  bdi-dirty_background_bytes and start background writeback whenever
  one of global background limit and per-bdi background limit is exceeded. I
  think this interface will do the job as well and it's easier to maintain in
  future.
 
 bdi-dirty_background_ratio, if I understand its semantics right, is
 unfortunately flaky in the same principle as centisecs-to-write,
 because it relies on the (implicitly estimation of) writeout
 proportions. The writeout proportions for each bdi starts with 0,
 which is even worse than the 100MB/s initial value for
 bdi-write_bandwidth and will trigger background writeback on the
 first write.
  Well, I meant bdi-dirty_backround_ratio wouldn't use writeout proportion
estimates at all. Limit would be
  dirtiable_memory * bdi-dirty_backround_ratio.

After all we want to start writeout to bdi when we have enough pages to
reasonably load the device for a while which has nothing to do with how
much is written to this device as compared to other devices.
 
OTOH I'm not particularly attached to this interface. Especially since on a
lot of today's machines, 1% is rather big so people might often end up
using dirty_background_bytes anyway.

 bdi-dirty_background_bytes is, however, reliable, and gives users
 total control. If we export this interface alone, I'd imagine users
 who want to control centisecs-to-write could run a simple script to
 periodically get the write bandwith value out of the existing bdi
 interface and echo it into bdi-dirty_background_bytes. Which makes
 simple yet good enough centisecs-to-write controlling.
 
 So what do you think about exporting a really dumb
 bdi-dirty_background_bytes, which will effectively give smart users
 the freedom to do smart control over per-bdi background writeback
 threshold? The users are offered the freedom to do his own bandwidth
 estimation and choose not to rely on the kernel estimation, which will
 free us from the burden of maintaining a flaky interface as well. :)
  That's fine with me. 

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-09-25 Thread Namjae Jeon
2012/9/25, Namjae Jeon :
> 2012/9/25, Dave Chinner :
>> On Thu, Sep 20, 2012 at 04:44:22PM +0800, Fengguang Wu wrote:
>>> [ CC FS and MM lists ]
>>>
>>> Patch looks good to me, however we need to be careful because it's
>>> introducing a new interface. So it's desirable to get some acks from
>>> the FS/MM developers.
>>>
>>> Thanks,
>>> Fengguang
>>>
>>> On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
>>> > From: Namjae Jeon 
>>> >
>>> > This patch is based on suggestion by Wu Fengguang:
>>> > https://lkml.org/lkml/2011/8/19/19
>>> >
>>> > kernel has mechanism to do writeback as per dirty_ratio and
>>> > dirty_background
>>> > ratio. It also maintains per task dirty rate limit to keep balance of
>>> > dirty pages at any given instance by doing bdi bandwidth estimation.
>>> >
>>> > Kernel also has max_ratio/min_ratio tunables to specify percentage of
>>> > writecache to control per bdi dirty limits and task throttling.
>>> >
>>> > However, there might be a usecase where user wants a per bdi writeback
>>> > tuning
>>> > parameter to flush dirty data once per bdi dirty data reach a
>>> > threshold
>>> > especially at NFS server.
>>> >
>>> > dirty_background_centisecs provides an interface where user can tune
>>> > background writeback start threshold using
>>> > /sys/block/sda/bdi/dirty_background_centisecs
>>> >
>>> > dirty_background_centisecs is used alongwith average bdi write
>>> > bandwidth
>>> > estimation to start background writeback.
>>> >
>>> > One of the use case to demonstrate the patch functionality can be
>>> > on NFS setup:-
>>> > We have a NFS setup with ethernet line of 100Mbps, while the USB
>>> > disk is attached to server, which has a local speed of 25MBps. Server
>>> > and client both are arm target boards.
>>> >
>>> > Now if we perform a write operation over NFS (client to server), as
>>> > per the network speed, data can travel at max speed of 100Mbps. But
>>> > if we check the default write speed of USB hdd over NFS it comes
>>> > around to 8MB/sec, far below the speed of network.
>>> >
>>> > Reason being is as per the NFS logic, during write operation,
>>> > initially
>>> > pages are dirtied on NFS client side, then after reaching the dirty
>>> > threshold/writeback limit (or in case of sync) data is actually sent
>>> > to NFS server (so now again pages are dirtied on server side). This
>>> > will be done in COMMIT call from client to server i.e if 100MB of data
>>> > is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9
>>> > seconds.
>>> >
>>> > After the data is received, now it will take approx 100/25 ~4 Seconds
>>> > to
>>> > write the data to USB Hdd on server side. Hence making the overall
>>> > time
>>> > to write this much of data ~12 seconds, which in practically comes out
>>> > to
>>> > be near 7 to 8MB/second. After this a COMMIT response will be sent to
>>> > NFS
>>> > client.
>>> >
>>> > However we may improve this write performace by making the use of NFS
>>> > server idle time i.e while data is being received from the client,
>>> > simultaneously initiate the writeback thread on server side. So
>>> > instead
>>> > of waiting for the complete data to come and then start the writeback,
>>> > we can work in parallel while the network is still busy in receiving
>>> > the
>>> > data. Hence in this way overall performace will be improved.
>>> >
>>> > If we tune dirty_background_centisecs, we can see there
>>> > is increase in the performace and it comes out to be ~ 11MB/seconds.
>>> > Results are:-
>>> >
>>> > Write test(create a 1 GB file) result at 'NFS client' after changing
>>> > /sys/block/sda/bdi/dirty_background_centisecs
>>> > on  *** NFS Server only - not on NFS Client 
>>
>
> Hi. Dave.
>
>> What is the configuration of the client and server? How much RAM,
>> what their dirty_* parameters are set to, network speed, server disk
>> speed for local sequential IO, etc?
> these results are on ARM, 512MB RAM and XFS over NFS with default
> writeback settings(only our writeback setting - dirty_back​ground_cen
> tisecs changed at nfs server only). Network speed is ~100MB/sec and
Sorry, there is typo:)
^^100Mb/sec
> local disk speed is ~25MB/sec.
>
>>
>>> > -
>>> > |WRITE Test with various 'dirty_background_centisecs' at NFS Server |
>>> > -
>>> > |  | default = 0 | 300 centisec| 200 centisec| 100 centisec |
>>> > -
>>> > |RecSize   |  WriteSpeed |  WriteSpeed |  WriteSpeed |  WriteSpeed  |
>>> > -
>>> > |10485760  |  8.44MB/sec |  8.60MB/sec |  9.30MB/sec |  10.27MB/sec |
>>> > | 1048576  |  8.48MB/sec |  8.87MB/sec |  9.31MB/sec |  10.34MB/sec |
>>> > |  524288  |  8.37MB/sec |  8.42MB/sec |  9.84MB/sec |  10.47MB/sec |
>>> > |  262144  |  

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-09-25 Thread Namjae Jeon
2012/9/25, Jan Kara :
> On Thu 20-09-12 16:44:22, Wu Fengguang wrote:
>> On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
>> > From: Namjae Jeon 
>> >
>> > This patch is based on suggestion by Wu Fengguang:
>> > https://lkml.org/lkml/2011/8/19/19
>> >
>> > kernel has mechanism to do writeback as per dirty_ratio and
>> > dirty_background
>> > ratio. It also maintains per task dirty rate limit to keep balance of
>> > dirty pages at any given instance by doing bdi bandwidth estimation.
>> >
>> > Kernel also has max_ratio/min_ratio tunables to specify percentage of
>> > writecache to control per bdi dirty limits and task throttling.
>> >
>> > However, there might be a usecase where user wants a per bdi writeback
>> > tuning
>> > parameter to flush dirty data once per bdi dirty data reach a threshold
>> > especially at NFS server.
>> >
>> > dirty_background_centisecs provides an interface where user can tune
>> > background writeback start threshold using
>> > /sys/block/sda/bdi/dirty_background_centisecs
>> >
>> > dirty_background_centisecs is used alongwith average bdi write
>> > bandwidth
>> > estimation to start background writeback.
>   The functionality you describe, i.e. start flushing bdi when there's
> reasonable amount of dirty data on it, looks sensible and useful. However
> I'm not so sure whether the interface you propose is the right one.
> Traditionally, we allow user to set amount of dirty data (either in bytes
> or percentage of memory) when background writeback should start. You
> propose setting the amount of data in centisecs-to-write. Why that
> difference? Also this interface ties our throughput estimation code (which
> is an implementation detail of current dirty throttling) with the userspace
> API. So we'd have to maintain the estimation code forever, possibly also
> face problems when we change the estimation code (and thus estimates in
> some cases) and users will complain that the values they set originally no
> longer work as they used to.
>
> Also, as with each knob, there's a problem how to properly set its value?
> Most admins won't know about the knob and so won't touch it. Others might
> know about the knob but will have hard time figuring out what value should
> they set. So if there's a new knob, it should have a sensible initial
> value. And since this feature looks like a useful one, it shouldn't be
> zero.
>
> So my personal preference would be to have bdi->dirty_background_ratio and
> bdi->dirty_background_bytes and start background writeback whenever
> one of global background limit and per-bdi background limit is exceeded. I
> think this interface will do the job as well and it's easier to maintain in
> future.
Hi Jan.
Thanks for review and your opinion.

Hi. Wu.
How about adding per-bdi - bdi->dirty_background_ratio and
bdi->dirty_background_bytes interface as suggested by Jan?

Thanks.
>
>   Honza
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-09-25 Thread Namjae Jeon
2012/9/25, Dave Chinner :
> On Thu, Sep 20, 2012 at 04:44:22PM +0800, Fengguang Wu wrote:
>> [ CC FS and MM lists ]
>>
>> Patch looks good to me, however we need to be careful because it's
>> introducing a new interface. So it's desirable to get some acks from
>> the FS/MM developers.
>>
>> Thanks,
>> Fengguang
>>
>> On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
>> > From: Namjae Jeon 
>> >
>> > This patch is based on suggestion by Wu Fengguang:
>> > https://lkml.org/lkml/2011/8/19/19
>> >
>> > kernel has mechanism to do writeback as per dirty_ratio and
>> > dirty_background
>> > ratio. It also maintains per task dirty rate limit to keep balance of
>> > dirty pages at any given instance by doing bdi bandwidth estimation.
>> >
>> > Kernel also has max_ratio/min_ratio tunables to specify percentage of
>> > writecache to control per bdi dirty limits and task throttling.
>> >
>> > However, there might be a usecase where user wants a per bdi writeback
>> > tuning
>> > parameter to flush dirty data once per bdi dirty data reach a threshold
>> > especially at NFS server.
>> >
>> > dirty_background_centisecs provides an interface where user can tune
>> > background writeback start threshold using
>> > /sys/block/sda/bdi/dirty_background_centisecs
>> >
>> > dirty_background_centisecs is used alongwith average bdi write
>> > bandwidth
>> > estimation to start background writeback.
>> >
>> > One of the use case to demonstrate the patch functionality can be
>> > on NFS setup:-
>> > We have a NFS setup with ethernet line of 100Mbps, while the USB
>> > disk is attached to server, which has a local speed of 25MBps. Server
>> > and client both are arm target boards.
>> >
>> > Now if we perform a write operation over NFS (client to server), as
>> > per the network speed, data can travel at max speed of 100Mbps. But
>> > if we check the default write speed of USB hdd over NFS it comes
>> > around to 8MB/sec, far below the speed of network.
>> >
>> > Reason being is as per the NFS logic, during write operation, initially
>> > pages are dirtied on NFS client side, then after reaching the dirty
>> > threshold/writeback limit (or in case of sync) data is actually sent
>> > to NFS server (so now again pages are dirtied on server side). This
>> > will be done in COMMIT call from client to server i.e if 100MB of data
>> > is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9
>> > seconds.
>> >
>> > After the data is received, now it will take approx 100/25 ~4 Seconds
>> > to
>> > write the data to USB Hdd on server side. Hence making the overall time
>> > to write this much of data ~12 seconds, which in practically comes out
>> > to
>> > be near 7 to 8MB/second. After this a COMMIT response will be sent to
>> > NFS
>> > client.
>> >
>> > However we may improve this write performace by making the use of NFS
>> > server idle time i.e while data is being received from the client,
>> > simultaneously initiate the writeback thread on server side. So instead
>> > of waiting for the complete data to come and then start the writeback,
>> > we can work in parallel while the network is still busy in receiving
>> > the
>> > data. Hence in this way overall performace will be improved.
>> >
>> > If we tune dirty_background_centisecs, we can see there
>> > is increase in the performace and it comes out to be ~ 11MB/seconds.
>> > Results are:-
>> >
>> > Write test(create a 1 GB file) result at 'NFS client' after changing
>> > /sys/block/sda/bdi/dirty_background_centisecs
>> > on  *** NFS Server only - not on NFS Client 
>

Hi. Dave.

> What is the configuration of the client and server? How much RAM,
> what their dirty_* parameters are set to, network speed, server disk
> speed for local sequential IO, etc?
these results are on ARM, 512MB RAM and XFS over NFS with default
writeback settings(only our writeback setting - dirty_back​ground_cen
tisecs changed at nfs server only). Network speed is ~100MB/sec and
local disk speed is ~25MB/sec.

>
>> > -
>> > |WRITE Test with various 'dirty_background_centisecs' at NFS Server |
>> > -
>> > |  | default = 0 | 300 centisec| 200 centisec| 100 centisec |
>> > -
>> > |RecSize   |  WriteSpeed |  WriteSpeed |  WriteSpeed |  WriteSpeed  |
>> > -
>> > |10485760  |  8.44MB/sec |  8.60MB/sec |  9.30MB/sec |  10.27MB/sec |
>> > | 1048576  |  8.48MB/sec |  8.87MB/sec |  9.31MB/sec |  10.34MB/sec |
>> > |  524288  |  8.37MB/sec |  8.42MB/sec |  9.84MB/sec |  10.47MB/sec |
>> > |  262144  |  8.16MB/sec |  8.51MB/sec |  9.52MB/sec |  10.62MB/sec |
>> > |  131072  |  8.48MB/sec |  8.81MB/sec |  9.42MB/sec |  10.55MB/sec |
>> > |   65536  |  8.38MB/sec |  9.09MB/sec |  9.76MB/sec |  10.53MB/sec |
>> > |   32768  |  

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-09-25 Thread Namjae Jeon
2012/9/25, Dave Chinner da...@fromorbit.com:
 On Thu, Sep 20, 2012 at 04:44:22PM +0800, Fengguang Wu wrote:
 [ CC FS and MM lists ]

 Patch looks good to me, however we need to be careful because it's
 introducing a new interface. So it's desirable to get some acks from
 the FS/MM developers.

 Thanks,
 Fengguang

 On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
  From: Namjae Jeon namjae.j...@samsung.com
 
  This patch is based on suggestion by Wu Fengguang:
  https://lkml.org/lkml/2011/8/19/19
 
  kernel has mechanism to do writeback as per dirty_ratio and
  dirty_background
  ratio. It also maintains per task dirty rate limit to keep balance of
  dirty pages at any given instance by doing bdi bandwidth estimation.
 
  Kernel also has max_ratio/min_ratio tunables to specify percentage of
  writecache to control per bdi dirty limits and task throttling.
 
  However, there might be a usecase where user wants a per bdi writeback
  tuning
  parameter to flush dirty data once per bdi dirty data reach a threshold
  especially at NFS server.
 
  dirty_background_centisecs provides an interface where user can tune
  background writeback start threshold using
  /sys/block/sda/bdi/dirty_background_centisecs
 
  dirty_background_centisecs is used alongwith average bdi write
  bandwidth
  estimation to start background writeback.
 
  One of the use case to demonstrate the patch functionality can be
  on NFS setup:-
  We have a NFS setup with ethernet line of 100Mbps, while the USB
  disk is attached to server, which has a local speed of 25MBps. Server
  and client both are arm target boards.
 
  Now if we perform a write operation over NFS (client to server), as
  per the network speed, data can travel at max speed of 100Mbps. But
  if we check the default write speed of USB hdd over NFS it comes
  around to 8MB/sec, far below the speed of network.
 
  Reason being is as per the NFS logic, during write operation, initially
  pages are dirtied on NFS client side, then after reaching the dirty
  threshold/writeback limit (or in case of sync) data is actually sent
  to NFS server (so now again pages are dirtied on server side). This
  will be done in COMMIT call from client to server i.e if 100MB of data
  is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9
  seconds.
 
  After the data is received, now it will take approx 100/25 ~4 Seconds
  to
  write the data to USB Hdd on server side. Hence making the overall time
  to write this much of data ~12 seconds, which in practically comes out
  to
  be near 7 to 8MB/second. After this a COMMIT response will be sent to
  NFS
  client.
 
  However we may improve this write performace by making the use of NFS
  server idle time i.e while data is being received from the client,
  simultaneously initiate the writeback thread on server side. So instead
  of waiting for the complete data to come and then start the writeback,
  we can work in parallel while the network is still busy in receiving
  the
  data. Hence in this way overall performace will be improved.
 
  If we tune dirty_background_centisecs, we can see there
  is increase in the performace and it comes out to be ~ 11MB/seconds.
  Results are:-
 
  Write test(create a 1 GB file) result at 'NFS client' after changing
  /sys/block/sda/bdi/dirty_background_centisecs
  on  *** NFS Server only - not on NFS Client 


Hi. Dave.

 What is the configuration of the client and server? How much RAM,
 what their dirty_* parameters are set to, network speed, server disk
 speed for local sequential IO, etc?
these results are on ARM, 512MB RAM and XFS over NFS with default
writeback settings(only our writeback setting - dirty_back​ground_cen
tisecs changed at nfs server only). Network speed is ~100MB/sec and
local disk speed is ~25MB/sec.


  -
  |WRITE Test with various 'dirty_background_centisecs' at NFS Server |
  -
  |  | default = 0 | 300 centisec| 200 centisec| 100 centisec |
  -
  |RecSize   |  WriteSpeed |  WriteSpeed |  WriteSpeed |  WriteSpeed  |
  -
  |10485760  |  8.44MB/sec |  8.60MB/sec |  9.30MB/sec |  10.27MB/sec |
  | 1048576  |  8.48MB/sec |  8.87MB/sec |  9.31MB/sec |  10.34MB/sec |
  |  524288  |  8.37MB/sec |  8.42MB/sec |  9.84MB/sec |  10.47MB/sec |
  |  262144  |  8.16MB/sec |  8.51MB/sec |  9.52MB/sec |  10.62MB/sec |
  |  131072  |  8.48MB/sec |  8.81MB/sec |  9.42MB/sec |  10.55MB/sec |
  |   65536  |  8.38MB/sec |  9.09MB/sec |  9.76MB/sec |  10.53MB/sec |
  |   32768  |  8.65MB/sec |  9.00MB/sec |  9.57MB/sec |  10.54MB/sec |
  |   16384  |  8.27MB/sec |  8.80MB/sec |  9.39MB/sec |  10.43MB/sec |
  |8192  |  8.52MB/sec |  8.70MB/sec |  9.40MB/sec |  10.50MB/sec |
  |4096  |  8.20MB/sec |  

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-09-25 Thread Namjae Jeon
2012/9/25, Jan Kara j...@suse.cz:
 On Thu 20-09-12 16:44:22, Wu Fengguang wrote:
 On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
  From: Namjae Jeon namjae.j...@samsung.com
 
  This patch is based on suggestion by Wu Fengguang:
  https://lkml.org/lkml/2011/8/19/19
 
  kernel has mechanism to do writeback as per dirty_ratio and
  dirty_background
  ratio. It also maintains per task dirty rate limit to keep balance of
  dirty pages at any given instance by doing bdi bandwidth estimation.
 
  Kernel also has max_ratio/min_ratio tunables to specify percentage of
  writecache to control per bdi dirty limits and task throttling.
 
  However, there might be a usecase where user wants a per bdi writeback
  tuning
  parameter to flush dirty data once per bdi dirty data reach a threshold
  especially at NFS server.
 
  dirty_background_centisecs provides an interface where user can tune
  background writeback start threshold using
  /sys/block/sda/bdi/dirty_background_centisecs
 
  dirty_background_centisecs is used alongwith average bdi write
  bandwidth
  estimation to start background writeback.
   The functionality you describe, i.e. start flushing bdi when there's
 reasonable amount of dirty data on it, looks sensible and useful. However
 I'm not so sure whether the interface you propose is the right one.
 Traditionally, we allow user to set amount of dirty data (either in bytes
 or percentage of memory) when background writeback should start. You
 propose setting the amount of data in centisecs-to-write. Why that
 difference? Also this interface ties our throughput estimation code (which
 is an implementation detail of current dirty throttling) with the userspace
 API. So we'd have to maintain the estimation code forever, possibly also
 face problems when we change the estimation code (and thus estimates in
 some cases) and users will complain that the values they set originally no
 longer work as they used to.

 Also, as with each knob, there's a problem how to properly set its value?
 Most admins won't know about the knob and so won't touch it. Others might
 know about the knob but will have hard time figuring out what value should
 they set. So if there's a new knob, it should have a sensible initial
 value. And since this feature looks like a useful one, it shouldn't be
 zero.

 So my personal preference would be to have bdi-dirty_background_ratio and
 bdi-dirty_background_bytes and start background writeback whenever
 one of global background limit and per-bdi background limit is exceeded. I
 think this interface will do the job as well and it's easier to maintain in
 future.
Hi Jan.
Thanks for review and your opinion.

Hi. Wu.
How about adding per-bdi - bdi-dirty_background_ratio and
bdi-dirty_background_bytes interface as suggested by Jan?

Thanks.

   Honza
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-09-25 Thread Namjae Jeon
2012/9/25, Namjae Jeon linkinj...@gmail.com:
 2012/9/25, Dave Chinner da...@fromorbit.com:
 On Thu, Sep 20, 2012 at 04:44:22PM +0800, Fengguang Wu wrote:
 [ CC FS and MM lists ]

 Patch looks good to me, however we need to be careful because it's
 introducing a new interface. So it's desirable to get some acks from
 the FS/MM developers.

 Thanks,
 Fengguang

 On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
  From: Namjae Jeon namjae.j...@samsung.com
 
  This patch is based on suggestion by Wu Fengguang:
  https://lkml.org/lkml/2011/8/19/19
 
  kernel has mechanism to do writeback as per dirty_ratio and
  dirty_background
  ratio. It also maintains per task dirty rate limit to keep balance of
  dirty pages at any given instance by doing bdi bandwidth estimation.
 
  Kernel also has max_ratio/min_ratio tunables to specify percentage of
  writecache to control per bdi dirty limits and task throttling.
 
  However, there might be a usecase where user wants a per bdi writeback
  tuning
  parameter to flush dirty data once per bdi dirty data reach a
  threshold
  especially at NFS server.
 
  dirty_background_centisecs provides an interface where user can tune
  background writeback start threshold using
  /sys/block/sda/bdi/dirty_background_centisecs
 
  dirty_background_centisecs is used alongwith average bdi write
  bandwidth
  estimation to start background writeback.
 
  One of the use case to demonstrate the patch functionality can be
  on NFS setup:-
  We have a NFS setup with ethernet line of 100Mbps, while the USB
  disk is attached to server, which has a local speed of 25MBps. Server
  and client both are arm target boards.
 
  Now if we perform a write operation over NFS (client to server), as
  per the network speed, data can travel at max speed of 100Mbps. But
  if we check the default write speed of USB hdd over NFS it comes
  around to 8MB/sec, far below the speed of network.
 
  Reason being is as per the NFS logic, during write operation,
  initially
  pages are dirtied on NFS client side, then after reaching the dirty
  threshold/writeback limit (or in case of sync) data is actually sent
  to NFS server (so now again pages are dirtied on server side). This
  will be done in COMMIT call from client to server i.e if 100MB of data
  is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9
  seconds.
 
  After the data is received, now it will take approx 100/25 ~4 Seconds
  to
  write the data to USB Hdd on server side. Hence making the overall
  time
  to write this much of data ~12 seconds, which in practically comes out
  to
  be near 7 to 8MB/second. After this a COMMIT response will be sent to
  NFS
  client.
 
  However we may improve this write performace by making the use of NFS
  server idle time i.e while data is being received from the client,
  simultaneously initiate the writeback thread on server side. So
  instead
  of waiting for the complete data to come and then start the writeback,
  we can work in parallel while the network is still busy in receiving
  the
  data. Hence in this way overall performace will be improved.
 
  If we tune dirty_background_centisecs, we can see there
  is increase in the performace and it comes out to be ~ 11MB/seconds.
  Results are:-
 
  Write test(create a 1 GB file) result at 'NFS client' after changing
  /sys/block/sda/bdi/dirty_background_centisecs
  on  *** NFS Server only - not on NFS Client 


 Hi. Dave.

 What is the configuration of the client and server? How much RAM,
 what their dirty_* parameters are set to, network speed, server disk
 speed for local sequential IO, etc?
 these results are on ARM, 512MB RAM and XFS over NFS with default
 writeback settings(only our writeback setting - dirty_back​ground_cen
 tisecs changed at nfs server only). Network speed is ~100MB/sec and
Sorry, there is typo:)
^^100Mb/sec
 local disk speed is ~25MB/sec.


  -
  |WRITE Test with various 'dirty_background_centisecs' at NFS Server |
  -
  |  | default = 0 | 300 centisec| 200 centisec| 100 centisec |
  -
  |RecSize   |  WriteSpeed |  WriteSpeed |  WriteSpeed |  WriteSpeed  |
  -
  |10485760  |  8.44MB/sec |  8.60MB/sec |  9.30MB/sec |  10.27MB/sec |
  | 1048576  |  8.48MB/sec |  8.87MB/sec |  9.31MB/sec |  10.34MB/sec |
  |  524288  |  8.37MB/sec |  8.42MB/sec |  9.84MB/sec |  10.47MB/sec |
  |  262144  |  8.16MB/sec |  8.51MB/sec |  9.52MB/sec |  10.62MB/sec |
  |  131072  |  8.48MB/sec |  8.81MB/sec |  9.42MB/sec |  10.55MB/sec |
  |   65536  |  8.38MB/sec |  9.09MB/sec |  9.76MB/sec |  10.53MB/sec |
  |   32768  |  8.65MB/sec |  9.00MB/sec |  9.57MB/sec |  10.54MB/sec |
  |   16384  |  8.27MB/sec |  8.80MB/sec |  9.39MB/sec |  

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-09-24 Thread Dave Chinner
On Thu, Sep 20, 2012 at 04:44:22PM +0800, Fengguang Wu wrote:
> [ CC FS and MM lists ]
> 
> Patch looks good to me, however we need to be careful because it's
> introducing a new interface. So it's desirable to get some acks from
> the FS/MM developers.
> 
> Thanks,
> Fengguang
> 
> On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
> > From: Namjae Jeon 
> > 
> > This patch is based on suggestion by Wu Fengguang:
> > https://lkml.org/lkml/2011/8/19/19
> > 
> > kernel has mechanism to do writeback as per dirty_ratio and dirty_background
> > ratio. It also maintains per task dirty rate limit to keep balance of
> > dirty pages at any given instance by doing bdi bandwidth estimation.
> > 
> > Kernel also has max_ratio/min_ratio tunables to specify percentage of
> > writecache to control per bdi dirty limits and task throttling.
> > 
> > However, there might be a usecase where user wants a per bdi writeback 
> > tuning
> > parameter to flush dirty data once per bdi dirty data reach a threshold
> > especially at NFS server.
> > 
> > dirty_background_centisecs provides an interface where user can tune
> > background writeback start threshold using
> > /sys/block/sda/bdi/dirty_background_centisecs
> > 
> > dirty_background_centisecs is used alongwith average bdi write bandwidth
> > estimation to start background writeback.
> > 
> > One of the use case to demonstrate the patch functionality can be
> > on NFS setup:-
> > We have a NFS setup with ethernet line of 100Mbps, while the USB
> > disk is attached to server, which has a local speed of 25MBps. Server
> > and client both are arm target boards.
> > 
> > Now if we perform a write operation over NFS (client to server), as
> > per the network speed, data can travel at max speed of 100Mbps. But
> > if we check the default write speed of USB hdd over NFS it comes
> > around to 8MB/sec, far below the speed of network.
> > 
> > Reason being is as per the NFS logic, during write operation, initially
> > pages are dirtied on NFS client side, then after reaching the dirty
> > threshold/writeback limit (or in case of sync) data is actually sent
> > to NFS server (so now again pages are dirtied on server side). This
> > will be done in COMMIT call from client to server i.e if 100MB of data
> > is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9 seconds.
> > 
> > After the data is received, now it will take approx 100/25 ~4 Seconds to
> > write the data to USB Hdd on server side. Hence making the overall time
> > to write this much of data ~12 seconds, which in practically comes out to
> > be near 7 to 8MB/second. After this a COMMIT response will be sent to NFS
> > client.
> > 
> > However we may improve this write performace by making the use of NFS
> > server idle time i.e while data is being received from the client,
> > simultaneously initiate the writeback thread on server side. So instead
> > of waiting for the complete data to come and then start the writeback,
> > we can work in parallel while the network is still busy in receiving the
> > data. Hence in this way overall performace will be improved.
> > 
> > If we tune dirty_background_centisecs, we can see there
> > is increase in the performace and it comes out to be ~ 11MB/seconds.
> > Results are:-
> > 
> > Write test(create a 1 GB file) result at 'NFS client' after changing 
> > /sys/block/sda/bdi/dirty_background_centisecs 
> > on  *** NFS Server only - not on NFS Client 

What is the configuration of the client and server? How much RAM,
what their dirty_* parameters are set to, network speed, server disk
speed for local sequential IO, etc?

> > -
> > |WRITE Test with various 'dirty_background_centisecs' at NFS Server |
> > -
> > |  | default = 0 | 300 centisec| 200 centisec| 100 centisec |
> > -
> > |RecSize   |  WriteSpeed |  WriteSpeed |  WriteSpeed |  WriteSpeed  |
> > -
> > |10485760  |  8.44MB/sec |  8.60MB/sec |  9.30MB/sec |  10.27MB/sec |
> > | 1048576  |  8.48MB/sec |  8.87MB/sec |  9.31MB/sec |  10.34MB/sec |
> > |  524288  |  8.37MB/sec |  8.42MB/sec |  9.84MB/sec |  10.47MB/sec |
> > |  262144  |  8.16MB/sec |  8.51MB/sec |  9.52MB/sec |  10.62MB/sec |
> > |  131072  |  8.48MB/sec |  8.81MB/sec |  9.42MB/sec |  10.55MB/sec |
> > |   65536  |  8.38MB/sec |  9.09MB/sec |  9.76MB/sec |  10.53MB/sec |
> > |   32768  |  8.65MB/sec |  9.00MB/sec |  9.57MB/sec |  10.54MB/sec |
> > |   16384  |  8.27MB/sec |  8.80MB/sec |  9.39MB/sec |  10.43MB/sec |
> > |8192  |  8.52MB/sec |  8.70MB/sec |  9.40MB/sec |  10.50MB/sec |
> > |4096  |  8.20MB/sec |  8.63MB/sec |  9.80MB/sec |  10.35MB/sec |
> > -

While this set of numbers looks 

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-09-24 Thread Jan Kara
On Thu 20-09-12 16:44:22, Wu Fengguang wrote:
> On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
> > From: Namjae Jeon 
> > 
> > This patch is based on suggestion by Wu Fengguang:
> > https://lkml.org/lkml/2011/8/19/19
> > 
> > kernel has mechanism to do writeback as per dirty_ratio and dirty_background
> > ratio. It also maintains per task dirty rate limit to keep balance of
> > dirty pages at any given instance by doing bdi bandwidth estimation.
> > 
> > Kernel also has max_ratio/min_ratio tunables to specify percentage of
> > writecache to control per bdi dirty limits and task throttling.
> > 
> > However, there might be a usecase where user wants a per bdi writeback 
> > tuning
> > parameter to flush dirty data once per bdi dirty data reach a threshold
> > especially at NFS server.
> > 
> > dirty_background_centisecs provides an interface where user can tune
> > background writeback start threshold using
> > /sys/block/sda/bdi/dirty_background_centisecs
> > 
> > dirty_background_centisecs is used alongwith average bdi write bandwidth
> > estimation to start background writeback.
  The functionality you describe, i.e. start flushing bdi when there's
reasonable amount of dirty data on it, looks sensible and useful. However
I'm not so sure whether the interface you propose is the right one.
Traditionally, we allow user to set amount of dirty data (either in bytes
or percentage of memory) when background writeback should start. You
propose setting the amount of data in centisecs-to-write. Why that
difference? Also this interface ties our throughput estimation code (which
is an implementation detail of current dirty throttling) with the userspace
API. So we'd have to maintain the estimation code forever, possibly also
face problems when we change the estimation code (and thus estimates in
some cases) and users will complain that the values they set originally no
longer work as they used to.

Also, as with each knob, there's a problem how to properly set its value?
Most admins won't know about the knob and so won't touch it. Others might
know about the knob but will have hard time figuring out what value should
they set. So if there's a new knob, it should have a sensible initial
value. And since this feature looks like a useful one, it shouldn't be
zero.

So my personal preference would be to have bdi->dirty_background_ratio and
bdi->dirty_background_bytes and start background writeback whenever
one of global background limit and per-bdi background limit is exceeded. I
think this interface will do the job as well and it's easier to maintain in
future.

Honza

> > One of the use case to demonstrate the patch functionality can be
> > on NFS setup:-
> > We have a NFS setup with ethernet line of 100Mbps, while the USB
> > disk is attached to server, which has a local speed of 25MBps. Server
> > and client both are arm target boards.
> > 
> > Now if we perform a write operation over NFS (client to server), as
> > per the network speed, data can travel at max speed of 100Mbps. But
> > if we check the default write speed of USB hdd over NFS it comes
> > around to 8MB/sec, far below the speed of network.
> > 
> > Reason being is as per the NFS logic, during write operation, initially
> > pages are dirtied on NFS client side, then after reaching the dirty
> > threshold/writeback limit (or in case of sync) data is actually sent
> > to NFS server (so now again pages are dirtied on server side). This
> > will be done in COMMIT call from client to server i.e if 100MB of data
> > is dirtied and sent then it will take minimum 100MB/100Mbps ~ 8-9 seconds.
> > 
> > After the data is received, now it will take approx 100/25 ~4 Seconds to
> > write the data to USB Hdd on server side. Hence making the overall time
> > to write this much of data ~12 seconds, which in practically comes out to
> > be near 7 to 8MB/second. After this a COMMIT response will be sent to NFS
> > client.
> > 
> > However we may improve this write performace by making the use of NFS
> > server idle time i.e while data is being received from the client,
> > simultaneously initiate the writeback thread on server side. So instead
> > of waiting for the complete data to come and then start the writeback,
> > we can work in parallel while the network is still busy in receiving the
> > data. Hence in this way overall performace will be improved.
> > 
> > If we tune dirty_background_centisecs, we can see there
> > is increase in the performace and it comes out to be ~ 11MB/seconds.
> > Results are:-
> > 
> > Write test(create a 1 GB file) result at 'NFS client' after changing 
> > /sys/block/sda/bdi/dirty_background_centisecs 
> > on  *** NFS Server only - not on NFS Client 
> > 
> > -
> > |WRITE Test with various 'dirty_background_centisecs' at NFS Server |
> > 

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-09-24 Thread Jan Kara
On Thu 20-09-12 16:44:22, Wu Fengguang wrote:
 On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
  From: Namjae Jeon namjae.j...@samsung.com
  
  This patch is based on suggestion by Wu Fengguang:
  https://lkml.org/lkml/2011/8/19/19
  
  kernel has mechanism to do writeback as per dirty_ratio and dirty_background
  ratio. It also maintains per task dirty rate limit to keep balance of
  dirty pages at any given instance by doing bdi bandwidth estimation.
  
  Kernel also has max_ratio/min_ratio tunables to specify percentage of
  writecache to control per bdi dirty limits and task throttling.
  
  However, there might be a usecase where user wants a per bdi writeback 
  tuning
  parameter to flush dirty data once per bdi dirty data reach a threshold
  especially at NFS server.
  
  dirty_background_centisecs provides an interface where user can tune
  background writeback start threshold using
  /sys/block/sda/bdi/dirty_background_centisecs
  
  dirty_background_centisecs is used alongwith average bdi write bandwidth
  estimation to start background writeback.
  The functionality you describe, i.e. start flushing bdi when there's
reasonable amount of dirty data on it, looks sensible and useful. However
I'm not so sure whether the interface you propose is the right one.
Traditionally, we allow user to set amount of dirty data (either in bytes
or percentage of memory) when background writeback should start. You
propose setting the amount of data in centisecs-to-write. Why that
difference? Also this interface ties our throughput estimation code (which
is an implementation detail of current dirty throttling) with the userspace
API. So we'd have to maintain the estimation code forever, possibly also
face problems when we change the estimation code (and thus estimates in
some cases) and users will complain that the values they set originally no
longer work as they used to.

Also, as with each knob, there's a problem how to properly set its value?
Most admins won't know about the knob and so won't touch it. Others might
know about the knob but will have hard time figuring out what value should
they set. So if there's a new knob, it should have a sensible initial
value. And since this feature looks like a useful one, it shouldn't be
zero.

So my personal preference would be to have bdi-dirty_background_ratio and
bdi-dirty_background_bytes and start background writeback whenever
one of global background limit and per-bdi background limit is exceeded. I
think this interface will do the job as well and it's easier to maintain in
future.

Honza

  One of the use case to demonstrate the patch functionality can be
  on NFS setup:-
  We have a NFS setup with ethernet line of 100Mbps, while the USB
  disk is attached to server, which has a local speed of 25MBps. Server
  and client both are arm target boards.
  
  Now if we perform a write operation over NFS (client to server), as
  per the network speed, data can travel at max speed of 100Mbps. But
  if we check the default write speed of USB hdd over NFS it comes
  around to 8MB/sec, far below the speed of network.
  
  Reason being is as per the NFS logic, during write operation, initially
  pages are dirtied on NFS client side, then after reaching the dirty
  threshold/writeback limit (or in case of sync) data is actually sent
  to NFS server (so now again pages are dirtied on server side). This
  will be done in COMMIT call from client to server i.e if 100MB of data
  is dirtied and sent then it will take minimum 100MB/100Mbps ~ 8-9 seconds.
  
  After the data is received, now it will take approx 100/25 ~4 Seconds to
  write the data to USB Hdd on server side. Hence making the overall time
  to write this much of data ~12 seconds, which in practically comes out to
  be near 7 to 8MB/second. After this a COMMIT response will be sent to NFS
  client.
  
  However we may improve this write performace by making the use of NFS
  server idle time i.e while data is being received from the client,
  simultaneously initiate the writeback thread on server side. So instead
  of waiting for the complete data to come and then start the writeback,
  we can work in parallel while the network is still busy in receiving the
  data. Hence in this way overall performace will be improved.
  
  If we tune dirty_background_centisecs, we can see there
  is increase in the performace and it comes out to be ~ 11MB/seconds.
  Results are:-
  
  Write test(create a 1 GB file) result at 'NFS client' after changing 
  /sys/block/sda/bdi/dirty_background_centisecs 
  on  *** NFS Server only - not on NFS Client 
  
  -
  |WRITE Test with various 'dirty_background_centisecs' at NFS Server |
  -
  |  | default = 0 | 300 centisec| 200 centisec| 100 centisec |
  

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-09-24 Thread Dave Chinner
On Thu, Sep 20, 2012 at 04:44:22PM +0800, Fengguang Wu wrote:
 [ CC FS and MM lists ]
 
 Patch looks good to me, however we need to be careful because it's
 introducing a new interface. So it's desirable to get some acks from
 the FS/MM developers.
 
 Thanks,
 Fengguang
 
 On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
  From: Namjae Jeon namjae.j...@samsung.com
  
  This patch is based on suggestion by Wu Fengguang:
  https://lkml.org/lkml/2011/8/19/19
  
  kernel has mechanism to do writeback as per dirty_ratio and dirty_background
  ratio. It also maintains per task dirty rate limit to keep balance of
  dirty pages at any given instance by doing bdi bandwidth estimation.
  
  Kernel also has max_ratio/min_ratio tunables to specify percentage of
  writecache to control per bdi dirty limits and task throttling.
  
  However, there might be a usecase where user wants a per bdi writeback 
  tuning
  parameter to flush dirty data once per bdi dirty data reach a threshold
  especially at NFS server.
  
  dirty_background_centisecs provides an interface where user can tune
  background writeback start threshold using
  /sys/block/sda/bdi/dirty_background_centisecs
  
  dirty_background_centisecs is used alongwith average bdi write bandwidth
  estimation to start background writeback.
  
  One of the use case to demonstrate the patch functionality can be
  on NFS setup:-
  We have a NFS setup with ethernet line of 100Mbps, while the USB
  disk is attached to server, which has a local speed of 25MBps. Server
  and client both are arm target boards.
  
  Now if we perform a write operation over NFS (client to server), as
  per the network speed, data can travel at max speed of 100Mbps. But
  if we check the default write speed of USB hdd over NFS it comes
  around to 8MB/sec, far below the speed of network.
  
  Reason being is as per the NFS logic, during write operation, initially
  pages are dirtied on NFS client side, then after reaching the dirty
  threshold/writeback limit (or in case of sync) data is actually sent
  to NFS server (so now again pages are dirtied on server side). This
  will be done in COMMIT call from client to server i.e if 100MB of data
  is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9 seconds.
  
  After the data is received, now it will take approx 100/25 ~4 Seconds to
  write the data to USB Hdd on server side. Hence making the overall time
  to write this much of data ~12 seconds, which in practically comes out to
  be near 7 to 8MB/second. After this a COMMIT response will be sent to NFS
  client.
  
  However we may improve this write performace by making the use of NFS
  server idle time i.e while data is being received from the client,
  simultaneously initiate the writeback thread on server side. So instead
  of waiting for the complete data to come and then start the writeback,
  we can work in parallel while the network is still busy in receiving the
  data. Hence in this way overall performace will be improved.
  
  If we tune dirty_background_centisecs, we can see there
  is increase in the performace and it comes out to be ~ 11MB/seconds.
  Results are:-
  
  Write test(create a 1 GB file) result at 'NFS client' after changing 
  /sys/block/sda/bdi/dirty_background_centisecs 
  on  *** NFS Server only - not on NFS Client 

What is the configuration of the client and server? How much RAM,
what their dirty_* parameters are set to, network speed, server disk
speed for local sequential IO, etc?

  -
  |WRITE Test with various 'dirty_background_centisecs' at NFS Server |
  -
  |  | default = 0 | 300 centisec| 200 centisec| 100 centisec |
  -
  |RecSize   |  WriteSpeed |  WriteSpeed |  WriteSpeed |  WriteSpeed  |
  -
  |10485760  |  8.44MB/sec |  8.60MB/sec |  9.30MB/sec |  10.27MB/sec |
  | 1048576  |  8.48MB/sec |  8.87MB/sec |  9.31MB/sec |  10.34MB/sec |
  |  524288  |  8.37MB/sec |  8.42MB/sec |  9.84MB/sec |  10.47MB/sec |
  |  262144  |  8.16MB/sec |  8.51MB/sec |  9.52MB/sec |  10.62MB/sec |
  |  131072  |  8.48MB/sec |  8.81MB/sec |  9.42MB/sec |  10.55MB/sec |
  |   65536  |  8.38MB/sec |  9.09MB/sec |  9.76MB/sec |  10.53MB/sec |
  |   32768  |  8.65MB/sec |  9.00MB/sec |  9.57MB/sec |  10.54MB/sec |
  |   16384  |  8.27MB/sec |  8.80MB/sec |  9.39MB/sec |  10.43MB/sec |
  |8192  |  8.52MB/sec |  8.70MB/sec |  9.40MB/sec |  10.50MB/sec |
  |4096  |  8.20MB/sec |  8.63MB/sec |  9.80MB/sec |  10.35MB/sec |
  -

While this set of numbers looks good, it's a very limited in scope.
I can't evaluate whether the change is worthwhile or not from this
test. If I was writing this patch, the 

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-09-20 Thread Fengguang Wu
[ CC FS and MM lists ]

Patch looks good to me, however we need to be careful because it's
introducing a new interface. So it's desirable to get some acks from
the FS/MM developers.

Thanks,
Fengguang

On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
> From: Namjae Jeon 
> 
> This patch is based on suggestion by Wu Fengguang:
> https://lkml.org/lkml/2011/8/19/19
> 
> kernel has mechanism to do writeback as per dirty_ratio and dirty_background
> ratio. It also maintains per task dirty rate limit to keep balance of
> dirty pages at any given instance by doing bdi bandwidth estimation.
> 
> Kernel also has max_ratio/min_ratio tunables to specify percentage of
> writecache to control per bdi dirty limits and task throttling.
> 
> However, there might be a usecase where user wants a per bdi writeback tuning
> parameter to flush dirty data once per bdi dirty data reach a threshold
> especially at NFS server.
> 
> dirty_background_centisecs provides an interface where user can tune
> background writeback start threshold using
> /sys/block/sda/bdi/dirty_background_centisecs
> 
> dirty_background_centisecs is used alongwith average bdi write bandwidth
> estimation to start background writeback.
> 
> One of the use case to demonstrate the patch functionality can be
> on NFS setup:-
> We have a NFS setup with ethernet line of 100Mbps, while the USB
> disk is attached to server, which has a local speed of 25MBps. Server
> and client both are arm target boards.
> 
> Now if we perform a write operation over NFS (client to server), as
> per the network speed, data can travel at max speed of 100Mbps. But
> if we check the default write speed of USB hdd over NFS it comes
> around to 8MB/sec, far below the speed of network.
> 
> Reason being is as per the NFS logic, during write operation, initially
> pages are dirtied on NFS client side, then after reaching the dirty
> threshold/writeback limit (or in case of sync) data is actually sent
> to NFS server (so now again pages are dirtied on server side). This
> will be done in COMMIT call from client to server i.e if 100MB of data
> is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9 seconds.
> 
> After the data is received, now it will take approx 100/25 ~4 Seconds to
> write the data to USB Hdd on server side. Hence making the overall time
> to write this much of data ~12 seconds, which in practically comes out to
> be near 7 to 8MB/second. After this a COMMIT response will be sent to NFS
> client.
> 
> However we may improve this write performace by making the use of NFS
> server idle time i.e while data is being received from the client,
> simultaneously initiate the writeback thread on server side. So instead
> of waiting for the complete data to come and then start the writeback,
> we can work in parallel while the network is still busy in receiving the
> data. Hence in this way overall performace will be improved.
> 
> If we tune dirty_background_centisecs, we can see there
> is increase in the performace and it comes out to be ~ 11MB/seconds.
> Results are:-
> 
> Write test(create a 1 GB file) result at 'NFS client' after changing 
> /sys/block/sda/bdi/dirty_background_centisecs 
> on  *** NFS Server only - not on NFS Client 
> 
> -
> |WRITE Test with various 'dirty_background_centisecs' at NFS Server |
> -
> |  | default = 0 | 300 centisec| 200 centisec| 100 centisec |
> -
> |RecSize   |  WriteSpeed |  WriteSpeed |  WriteSpeed |  WriteSpeed  |
> -
> |10485760  |  8.44MB/sec |  8.60MB/sec |  9.30MB/sec |  10.27MB/sec |
> | 1048576  |  8.48MB/sec |  8.87MB/sec |  9.31MB/sec |  10.34MB/sec |
> |  524288  |  8.37MB/sec |  8.42MB/sec |  9.84MB/sec |  10.47MB/sec |
> |  262144  |  8.16MB/sec |  8.51MB/sec |  9.52MB/sec |  10.62MB/sec |
> |  131072  |  8.48MB/sec |  8.81MB/sec |  9.42MB/sec |  10.55MB/sec |
> |   65536  |  8.38MB/sec |  9.09MB/sec |  9.76MB/sec |  10.53MB/sec |
> |   32768  |  8.65MB/sec |  9.00MB/sec |  9.57MB/sec |  10.54MB/sec |
> |   16384  |  8.27MB/sec |  8.80MB/sec |  9.39MB/sec |  10.43MB/sec |
> |8192  |  8.52MB/sec |  8.70MB/sec |  9.40MB/sec |  10.50MB/sec |
> |4096  |  8.20MB/sec |  8.63MB/sec |  9.80MB/sec |  10.35MB/sec |
> -
> 
> we can see, average write speed is increased to ~10-11MB/sec.
> 
> 
> This patch provides the changes per block devices. So that we may modify the
> dirty_background_centisecs as per the device and overall system is not 
> impacted
> by the changes and we get improved perforamace in certain use cases.
> 
> NOTE: dirty_background_centisecs is used alongwith average bdi write 

Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

2012-09-20 Thread Fengguang Wu
[ CC FS and MM lists ]

Patch looks good to me, however we need to be careful because it's
introducing a new interface. So it's desirable to get some acks from
the FS/MM developers.

Thanks,
Fengguang

On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
 From: Namjae Jeon namjae.j...@samsung.com
 
 This patch is based on suggestion by Wu Fengguang:
 https://lkml.org/lkml/2011/8/19/19
 
 kernel has mechanism to do writeback as per dirty_ratio and dirty_background
 ratio. It also maintains per task dirty rate limit to keep balance of
 dirty pages at any given instance by doing bdi bandwidth estimation.
 
 Kernel also has max_ratio/min_ratio tunables to specify percentage of
 writecache to control per bdi dirty limits and task throttling.
 
 However, there might be a usecase where user wants a per bdi writeback tuning
 parameter to flush dirty data once per bdi dirty data reach a threshold
 especially at NFS server.
 
 dirty_background_centisecs provides an interface where user can tune
 background writeback start threshold using
 /sys/block/sda/bdi/dirty_background_centisecs
 
 dirty_background_centisecs is used alongwith average bdi write bandwidth
 estimation to start background writeback.
 
 One of the use case to demonstrate the patch functionality can be
 on NFS setup:-
 We have a NFS setup with ethernet line of 100Mbps, while the USB
 disk is attached to server, which has a local speed of 25MBps. Server
 and client both are arm target boards.
 
 Now if we perform a write operation over NFS (client to server), as
 per the network speed, data can travel at max speed of 100Mbps. But
 if we check the default write speed of USB hdd over NFS it comes
 around to 8MB/sec, far below the speed of network.
 
 Reason being is as per the NFS logic, during write operation, initially
 pages are dirtied on NFS client side, then after reaching the dirty
 threshold/writeback limit (or in case of sync) data is actually sent
 to NFS server (so now again pages are dirtied on server side). This
 will be done in COMMIT call from client to server i.e if 100MB of data
 is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9 seconds.
 
 After the data is received, now it will take approx 100/25 ~4 Seconds to
 write the data to USB Hdd on server side. Hence making the overall time
 to write this much of data ~12 seconds, which in practically comes out to
 be near 7 to 8MB/second. After this a COMMIT response will be sent to NFS
 client.
 
 However we may improve this write performace by making the use of NFS
 server idle time i.e while data is being received from the client,
 simultaneously initiate the writeback thread on server side. So instead
 of waiting for the complete data to come and then start the writeback,
 we can work in parallel while the network is still busy in receiving the
 data. Hence in this way overall performace will be improved.
 
 If we tune dirty_background_centisecs, we can see there
 is increase in the performace and it comes out to be ~ 11MB/seconds.
 Results are:-
 
 Write test(create a 1 GB file) result at 'NFS client' after changing 
 /sys/block/sda/bdi/dirty_background_centisecs 
 on  *** NFS Server only - not on NFS Client 
 
 -
 |WRITE Test with various 'dirty_background_centisecs' at NFS Server |
 -
 |  | default = 0 | 300 centisec| 200 centisec| 100 centisec |
 -
 |RecSize   |  WriteSpeed |  WriteSpeed |  WriteSpeed |  WriteSpeed  |
 -
 |10485760  |  8.44MB/sec |  8.60MB/sec |  9.30MB/sec |  10.27MB/sec |
 | 1048576  |  8.48MB/sec |  8.87MB/sec |  9.31MB/sec |  10.34MB/sec |
 |  524288  |  8.37MB/sec |  8.42MB/sec |  9.84MB/sec |  10.47MB/sec |
 |  262144  |  8.16MB/sec |  8.51MB/sec |  9.52MB/sec |  10.62MB/sec |
 |  131072  |  8.48MB/sec |  8.81MB/sec |  9.42MB/sec |  10.55MB/sec |
 |   65536  |  8.38MB/sec |  9.09MB/sec |  9.76MB/sec |  10.53MB/sec |
 |   32768  |  8.65MB/sec |  9.00MB/sec |  9.57MB/sec |  10.54MB/sec |
 |   16384  |  8.27MB/sec |  8.80MB/sec |  9.39MB/sec |  10.43MB/sec |
 |8192  |  8.52MB/sec |  8.70MB/sec |  9.40MB/sec |  10.50MB/sec |
 |4096  |  8.20MB/sec |  8.63MB/sec |  9.80MB/sec |  10.35MB/sec |
 -
 
 we can see, average write speed is increased to ~10-11MB/sec.
 
 
 This patch provides the changes per block devices. So that we may modify the
 dirty_background_centisecs as per the device and overall system is not 
 impacted
 by the changes and we get improved perforamace in certain use cases.
 
 NOTE: dirty_background_centisecs is used alongwith average bdi write bandwidth
 estimation to start background writeback. But, bdi write