Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt

2017-08-07 Thread Minchan Kim
On Fri, Aug 04, 2017 at 11:24:49AM -0700, Dan Williams wrote:
> On Fri, Aug 4, 2017 at 11:21 AM, Ross Zwisler
>  wrote:
> > On Fri, Aug 04, 2017 at 11:01:08AM -0700, Dan Williams wrote:
> >> [ adding Dave who is working on a blk-mq + dma offload version of the
> >> pmem driver ]
> >>
> >> On Fri, Aug 4, 2017 at 1:17 AM, Minchan Kim  wrote:
> >> > On Fri, Aug 04, 2017 at 12:54:41PM +0900, Minchan Kim wrote:
> >> [..]
> >> >> Thanks for the testing. Your testing number is within noise level?
> >> >>
> >> >> I cannot understand why PMEM doesn't have enough gain while BTT is 
> >> >> significant
> >> >> win(8%). I guess no rw_page with BTT testing had more chances to wait 
> >> >> bio dynamic
> >> >> allocation and mine and rw_page testing reduced it significantly. 
> >> >> However,
> >> >> in no rw_page with pmem, there wasn't many cases to wait bio 
> >> >> allocations due
> >> >> to the device is so fast so the number comes from purely the number of
> >> >> instructions has done. At a quick glance of bio init/submit, it's not 
> >> >> trivial
> >> >> so indeed, i understand where the 12% enhancement comes from but I'm 
> >> >> not sure
> >> >> it's really big difference in real practice at the cost of maintaince 
> >> >> burden.
> >> >
> >> > I tested pmbench 10 times in my local machine(4 core) with zram-swap.
> >> > In my machine, even, on-stack bio is faster than rw_page. Unbelievable.
> >> >
> >> > I guess it's really hard to get stable result in severe memory pressure.
> >> > It would be a result within noise level(see below stddev).
> >> > So, I think it's hard to conclude rw_page is far faster than onstack-bio.
> >> >
> >> > rw_page
> >> > avg 5.54us
> >> > stddev  8.89%
> >> > max 6.02us
> >> > min 4.20us
> >> >
> >> > onstack bio
> >> > avg 5.27us
> >> > stddev  13.03%
> >> > max 5.96us
> >> > min 3.55us
> >>
> >> The maintenance burden of having alternative submission paths is
> >> significant especially as we consider the pmem driver ising more
> >> services of the core block layer. Ideally, I'd want to complete the
> >> rw_page removal work before we look at the blk-mq + dma offload
> >> reworks.
> >>
> >> The change to introduce BDI_CAP_SYNC is interesting because we might
> >> have use for switching between dma offload and cpu copy based on
> >> whether the I/O is synchronous or otherwise hinted to be a low latency
> >> request. Right now the dma offload patches are using "bio_segments() >
> >> 1" as the gate for selecting offload vs cpu copy which seem
> >> inadequate.
> >
> > Okay, so based on the feedback above and from Jens[1], it sounds like we 
> > want
> > to go forward with removing the rw_page() interface, and instead optimize 
> > the
> > regular I/O path via on-stack BIOS and dma offload, correct?
> >
> > If so, I'll prepare patches that fully remove the rw_page() code, and let
> > Minchan and Dave work on their optimizations.
> 
> I think the conversion to on-stack-bio should be done in the same
> patchset that removes rw_page. We don't want to leave a known
> performance regression while the on-stack-bio work is in-flight.

Okay. It seems everyone get an agreement with on-stack-bio.
I will send my formal patchset including Ross's patches which
removes rw_page.

Thanks.

Thanks.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt

2017-08-04 Thread Dan Williams
On Fri, Aug 4, 2017 at 11:21 AM, Ross Zwisler
 wrote:
> On Fri, Aug 04, 2017 at 11:01:08AM -0700, Dan Williams wrote:
>> [ adding Dave who is working on a blk-mq + dma offload version of the
>> pmem driver ]
>>
>> On Fri, Aug 4, 2017 at 1:17 AM, Minchan Kim  wrote:
>> > On Fri, Aug 04, 2017 at 12:54:41PM +0900, Minchan Kim wrote:
>> [..]
>> >> Thanks for the testing. Your testing number is within noise level?
>> >>
>> >> I cannot understand why PMEM doesn't have enough gain while BTT is 
>> >> significant
>> >> win(8%). I guess no rw_page with BTT testing had more chances to wait bio 
>> >> dynamic
>> >> allocation and mine and rw_page testing reduced it significantly. However,
>> >> in no rw_page with pmem, there wasn't many cases to wait bio allocations 
>> >> due
>> >> to the device is so fast so the number comes from purely the number of
>> >> instructions has done. At a quick glance of bio init/submit, it's not 
>> >> trivial
>> >> so indeed, i understand where the 12% enhancement comes from but I'm not 
>> >> sure
>> >> it's really big difference in real practice at the cost of maintaince 
>> >> burden.
>> >
>> > I tested pmbench 10 times in my local machine(4 core) with zram-swap.
>> > In my machine, even, on-stack bio is faster than rw_page. Unbelievable.
>> >
>> > I guess it's really hard to get stable result in severe memory pressure.
>> > It would be a result within noise level(see below stddev).
>> > So, I think it's hard to conclude rw_page is far faster than onstack-bio.
>> >
>> > rw_page
>> > avg 5.54us
>> > stddev  8.89%
>> > max 6.02us
>> > min 4.20us
>> >
>> > onstack bio
>> > avg 5.27us
>> > stddev  13.03%
>> > max 5.96us
>> > min 3.55us
>>
>> The maintenance burden of having alternative submission paths is
>> significant especially as we consider the pmem driver ising more
>> services of the core block layer. Ideally, I'd want to complete the
>> rw_page removal work before we look at the blk-mq + dma offload
>> reworks.
>>
>> The change to introduce BDI_CAP_SYNC is interesting because we might
>> have use for switching between dma offload and cpu copy based on
>> whether the I/O is synchronous or otherwise hinted to be a low latency
>> request. Right now the dma offload patches are using "bio_segments() >
>> 1" as the gate for selecting offload vs cpu copy which seem
>> inadequate.
>
> Okay, so based on the feedback above and from Jens[1], it sounds like we want
> to go forward with removing the rw_page() interface, and instead optimize the
> regular I/O path via on-stack BIOS and dma offload, correct?
>
> If so, I'll prepare patches that fully remove the rw_page() code, and let
> Minchan and Dave work on their optimizations.

I think the conversion to on-stack-bio should be done in the same
patchset that removes rw_page. We don't want to leave a known
performance regression while the on-stack-bio work is in-flight.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt

2017-08-04 Thread Ross Zwisler
On Fri, Aug 04, 2017 at 11:01:08AM -0700, Dan Williams wrote:
> [ adding Dave who is working on a blk-mq + dma offload version of the
> pmem driver ]
> 
> On Fri, Aug 4, 2017 at 1:17 AM, Minchan Kim  wrote:
> > On Fri, Aug 04, 2017 at 12:54:41PM +0900, Minchan Kim wrote:
> [..]
> >> Thanks for the testing. Your testing number is within noise level?
> >>
> >> I cannot understand why PMEM doesn't have enough gain while BTT is 
> >> significant
> >> win(8%). I guess no rw_page with BTT testing had more chances to wait bio 
> >> dynamic
> >> allocation and mine and rw_page testing reduced it significantly. However,
> >> in no rw_page with pmem, there wasn't many cases to wait bio allocations 
> >> due
> >> to the device is so fast so the number comes from purely the number of
> >> instructions has done. At a quick glance of bio init/submit, it's not 
> >> trivial
> >> so indeed, i understand where the 12% enhancement comes from but I'm not 
> >> sure
> >> it's really big difference in real practice at the cost of maintaince 
> >> burden.
> >
> > I tested pmbench 10 times in my local machine(4 core) with zram-swap.
> > In my machine, even, on-stack bio is faster than rw_page. Unbelievable.
> >
> > I guess it's really hard to get stable result in severe memory pressure.
> > It would be a result within noise level(see below stddev).
> > So, I think it's hard to conclude rw_page is far faster than onstack-bio.
> >
> > rw_page
> > avg 5.54us
> > stddev  8.89%
> > max 6.02us
> > min 4.20us
> >
> > onstack bio
> > avg 5.27us
> > stddev  13.03%
> > max 5.96us
> > min 3.55us
> 
> The maintenance burden of having alternative submission paths is
> significant especially as we consider the pmem driver ising more
> services of the core block layer. Ideally, I'd want to complete the
> rw_page removal work before we look at the blk-mq + dma offload
> reworks.
> 
> The change to introduce BDI_CAP_SYNC is interesting because we might
> have use for switching between dma offload and cpu copy based on
> whether the I/O is synchronous or otherwise hinted to be a low latency
> request. Right now the dma offload patches are using "bio_segments() >
> 1" as the gate for selecting offload vs cpu copy which seem
> inadequate.

Okay, so based on the feedback above and from Jens[1], it sounds like we want
to go forward with removing the rw_page() interface, and instead optimize the
regular I/O path via on-stack BIOS and dma offload, correct?

If so, I'll prepare patches that fully remove the rw_page() code, and let
Minchan and Dave work on their optimizations.

[1]: https://lkml.org/lkml/2017/8/3/803
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt

2017-08-04 Thread Dan Williams
[ adding Dave who is working on a blk-mq + dma offload version of the
pmem driver ]

On Fri, Aug 4, 2017 at 1:17 AM, Minchan Kim  wrote:
> On Fri, Aug 04, 2017 at 12:54:41PM +0900, Minchan Kim wrote:
[..]
>> Thanks for the testing. Your testing number is within noise level?
>>
>> I cannot understand why PMEM doesn't have enough gain while BTT is 
>> significant
>> win(8%). I guess no rw_page with BTT testing had more chances to wait bio 
>> dynamic
>> allocation and mine and rw_page testing reduced it significantly. However,
>> in no rw_page with pmem, there wasn't many cases to wait bio allocations due
>> to the device is so fast so the number comes from purely the number of
>> instructions has done. At a quick glance of bio init/submit, it's not trivial
>> so indeed, i understand where the 12% enhancement comes from but I'm not sure
>> it's really big difference in real practice at the cost of maintaince burden.
>
> I tested pmbench 10 times in my local machine(4 core) with zram-swap.
> In my machine, even, on-stack bio is faster than rw_page. Unbelievable.
>
> I guess it's really hard to get stable result in severe memory pressure.
> It would be a result within noise level(see below stddev).
> So, I think it's hard to conclude rw_page is far faster than onstack-bio.
>
> rw_page
> avg 5.54us
> stddev  8.89%
> max 6.02us
> min 4.20us
>
> onstack bio
> avg 5.27us
> stddev  13.03%
> max 5.96us
> min 3.55us

The maintenance burden of having alternative submission paths is
significant especially as we consider the pmem driver ising more
services of the core block layer. Ideally, I'd want to complete the
rw_page removal work before we look at the blk-mq + dma offload
reworks.

The change to introduce BDI_CAP_SYNC is interesting because we might
have use for switching between dma offload and cpu copy based on
whether the I/O is synchronous or otherwise hinted to be a low latency
request. Right now the dma offload patches are using "bio_segments() >
1" as the gate for selecting offload vs cpu copy which seem
inadequate.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt

2017-08-04 Thread Minchan Kim
On Fri, Aug 04, 2017 at 12:54:41PM +0900, Minchan Kim wrote:
> On Thu, Aug 03, 2017 at 03:13:35PM -0600, Ross Zwisler wrote:
> > On Thu, Aug 03, 2017 at 09:13:15AM +0900, Minchan Kim wrote:
> > > Hi Ross,
> > > 
> > > On Wed, Aug 02, 2017 at 04:13:59PM -0600, Ross Zwisler wrote:
> > > > On Fri, Jul 28, 2017 at 10:31:43AM -0700, Matthew Wilcox wrote:
> > > > > On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> > > > > > Dan Williams and Christoph Hellwig have recently expressed doubt 
> > > > > > about
> > > > > > whether the rw_page() interface made sense for synchronous memory 
> > > > > > drivers
> > > > > > [1][2].  It's unclear whether this interface has any performance 
> > > > > > benefit
> > > > > > for these drivers, but as we continue to fix bugs it is clear that 
> > > > > > it does
> > > > > > have a maintenance burden.  This series removes the rw_page()
> > > > > > implementations in brd, pmem and btt to relieve this burden.
> > > > > 
> > > > > Why don't you measure whether it has performance benefits?  I don't
> > > > > understand why zram would see performance benefits and not other 
> > > > > drivers.
> > > > > If it's going to be removed, then the whole interface should be 
> > > > > removed,
> > > > > not just have the implementations removed from some drivers.
> > > > 
> > > > Okay, I've run a bunch of performance tests with the PMEM and with BTT 
> > > > entry
> > > > points for rw_pages() in a swap workload, and in all cases I do see an
> > > > improvement over the code when rw_pages() is removed.  Here are the 
> > > > results
> > > > from my random lab box:
> > > > 
> > > >   Average latency of swap_writepage()
> > > > +--++-+-+
> > > > |  | no rw_page | rw_page | Improvement |
> > > > +---+
> > > > | PMEM |  5.0 us|  4.7 us | 6%  |
> > > > +---+
> > > > |  BTT |  6.8 us|  6.1 us |10%  |
> > > > +--++-+-+
> > > > 
> > > >   Average latency of swap_readpage()
> > > > +--++-+-+
> > > > |  | no rw_page | rw_page | Improvement |
> > > > +---+
> > > > | PMEM |  3.3 us|  2.9 us |12%  |
> > > > +---+
> > > > |  BTT |  3.7 us|  3.4 us | 8%  |
> > > > +--++-+-+
> > > > 
> > > > The workload was pmbench, a memory benchmark, run on a system where I 
> > > > had
> > > > severely restricted the amount of memory in the system with the 'mem' 
> > > > kernel
> > > > command line parameter.  The benchmark was set up to test more memory 
> > > > than I
> > > > allowed the OS to have so it spilled over into swap.
> > > > 
> > > > The PMEM or BTT device was set up as my swap device, and during the 
> > > > test I got
> > > > a few hundred thousand samples of each of swap_writepage() and
> > > > swap_writepage().  The PMEM/BTT device was just memory reserved with the
> > > > memmap kernel command line parameter.
> > > > 
> > > > Thanks, Matthew, for asking for performance data.  It looks like 
> > > > removing this
> > > > code would have been a mistake.
> > > 
> > > By suggestion of Christoph Hellwig, I made a quick patch which does IO 
> > > without
> > > dynamic bio allocation for swap IO. Actually, it's not formal patch to be
> > > worth to send mainline yet but I believe it's enough to test the 
> > > improvement.
> > > 
> > > Could you test patchset on pmem and btt without rw_page?
> > > 
> > > For working the patch, block drivers need to declare it's synchronous IO
> > > device via BDI_CAP_SYNC but if it's hard, you can just make every swap IO
> > > comes from (sis->flags & SWP_SYNC_IO) with removing condition check
> > > 
> > > if (!(sis->flags & SWP_SYNC_IO)) in swap_[read|write]page.
> > > 
> > > Patchset is based on 4.13-rc3.
> > 
> > Thanks for the patch, here are the updated results from my test box:
> > 
> >  Average latency of swap_writepage()
> > +--++-+-+
> > |  | no rw_page | minchan | rw_page |
> > +
> > | PMEM |  5.0 us| 4.98 us |  4.7 us |
> > +
> > |  BTT |  6.8 us| 6.3 us  |  6.1 us |
> > +--++-+-+
> >
> >  Average latency of swap_readpage()
> > +--++-+-+
> > |  | no rw_page | minchan | rw_page |
> > +
> > | PMEM |  3.3 us| 3.27 us |  2.9 us |
> > +
> > |  BTT |  3.7 us| 3.44 us |  3.4 us |
> > +--++-+-+
> > 
> > I've added another digit in precision in some cases to help differentiate 
> > the
> > various results.
> > 
> > In all cases your patches did perform better than with 

Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt

2017-08-03 Thread Minchan Kim
On Thu, Aug 03, 2017 at 10:05:44AM +0200, Christoph Hellwig wrote:
> FYI, for the read side we should use the on-stack bio unconditionally,
> as it will always be a win (or not show up at all).

Think about readahead. Unconditional on-stack bio to read around pages
with faulted address will cause latency peek. So, I want to use that
synchronous IO only if device says "Hey, I'm a synchronous".
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt

2017-08-03 Thread Jens Axboe
On 08/03/2017 03:13 PM, Ross Zwisler wrote:
> On Thu, Aug 03, 2017 at 09:13:15AM +0900, Minchan Kim wrote:
>> Hi Ross,
>>
>> On Wed, Aug 02, 2017 at 04:13:59PM -0600, Ross Zwisler wrote:
>>> On Fri, Jul 28, 2017 at 10:31:43AM -0700, Matthew Wilcox wrote:
 On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> Dan Williams and Christoph Hellwig have recently expressed doubt about
> whether the rw_page() interface made sense for synchronous memory drivers
> [1][2].  It's unclear whether this interface has any performance benefit
> for these drivers, but as we continue to fix bugs it is clear that it does
> have a maintenance burden.  This series removes the rw_page()
> implementations in brd, pmem and btt to relieve this burden.

 Why don't you measure whether it has performance benefits?  I don't
 understand why zram would see performance benefits and not other drivers.
 If it's going to be removed, then the whole interface should be removed,
 not just have the implementations removed from some drivers.
>>>
>>> Okay, I've run a bunch of performance tests with the PMEM and with BTT entry
>>> points for rw_pages() in a swap workload, and in all cases I do see an
>>> improvement over the code when rw_pages() is removed.  Here are the results
>>> from my random lab box:
>>>
>>>   Average latency of swap_writepage()
>>> +--++-+-+
>>> |  | no rw_page | rw_page | Improvement |
>>> +---+
>>> | PMEM |  5.0 us|  4.7 us | 6%  |
>>> +---+
>>> |  BTT |  6.8 us|  6.1 us |10%  |
>>> +--++-+-+
>>>
>>>   Average latency of swap_readpage()
>>> +--++-+-+
>>> |  | no rw_page | rw_page | Improvement |
>>> +---+
>>> | PMEM |  3.3 us|  2.9 us |12%  |
>>> +---+
>>> |  BTT |  3.7 us|  3.4 us | 8%  |
>>> +--++-+-+
>>>
>>> The workload was pmbench, a memory benchmark, run on a system where I had
>>> severely restricted the amount of memory in the system with the 'mem' kernel
>>> command line parameter.  The benchmark was set up to test more memory than I
>>> allowed the OS to have so it spilled over into swap.
>>>
>>> The PMEM or BTT device was set up as my swap device, and during the test I 
>>> got
>>> a few hundred thousand samples of each of swap_writepage() and
>>> swap_writepage().  The PMEM/BTT device was just memory reserved with the
>>> memmap kernel command line parameter.
>>>
>>> Thanks, Matthew, for asking for performance data.  It looks like removing 
>>> this
>>> code would have been a mistake.
>>
>> By suggestion of Christoph Hellwig, I made a quick patch which does IO 
>> without
>> dynamic bio allocation for swap IO. Actually, it's not formal patch to be
>> worth to send mainline yet but I believe it's enough to test the improvement.
>>
>> Could you test patchset on pmem and btt without rw_page?
>>
>> For working the patch, block drivers need to declare it's synchronous IO
>> device via BDI_CAP_SYNC but if it's hard, you can just make every swap IO
>> comes from (sis->flags & SWP_SYNC_IO) with removing condition check
>>
>> if (!(sis->flags & SWP_SYNC_IO)) in swap_[read|write]page.
>>
>> Patchset is based on 4.13-rc3.
> 
> Thanks for the patch, here are the updated results from my test box:
> 
>  Average latency of swap_writepage()
> +--++-+-+
> |  | no rw_page | minchan | rw_page |
> +
> | PMEM |  5.0 us| 4.98 us |  4.7 us |
> +
> |  BTT |  6.8 us| 6.3 us  |  6.1 us |
> +--++-+-+
>  
>  Average latency of swap_readpage()
> +--++-+-+
> |  | no rw_page | minchan | rw_page |
> +
> | PMEM |  3.3 us| 3.27 us |  2.9 us |
> +
> |  BTT |  3.7 us| 3.44 us |  3.4 us |
> +--++-+-+
> 
> I've added another digit in precision in some cases to help differentiate the
> various results.
> 
> In all cases your patches did perform better than with the regularly allocated
> BIO, but again for all cases the rw_page() path was the fastest, even if only
> marginally.

IMHO, the win needs to be pretty substantial to justify keeping a
parallel read/write path in the kernel. The recent work of making
O_DIRECT faster is exactly the same as what Minchan did here for sync
IO. I would greatly prefer one fast path, instead of one fast and one
that's just a little faster for some things. It's much better to get
everyone behind one path/stack, and make that as fast as it can be.

-- 
Jens Axboe


Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt

2017-08-03 Thread Ross Zwisler
On Thu, Aug 03, 2017 at 09:13:15AM +0900, Minchan Kim wrote:
> Hi Ross,
> 
> On Wed, Aug 02, 2017 at 04:13:59PM -0600, Ross Zwisler wrote:
> > On Fri, Jul 28, 2017 at 10:31:43AM -0700, Matthew Wilcox wrote:
> > > On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> > > > Dan Williams and Christoph Hellwig have recently expressed doubt about
> > > > whether the rw_page() interface made sense for synchronous memory 
> > > > drivers
> > > > [1][2].  It's unclear whether this interface has any performance benefit
> > > > for these drivers, but as we continue to fix bugs it is clear that it 
> > > > does
> > > > have a maintenance burden.  This series removes the rw_page()
> > > > implementations in brd, pmem and btt to relieve this burden.
> > > 
> > > Why don't you measure whether it has performance benefits?  I don't
> > > understand why zram would see performance benefits and not other drivers.
> > > If it's going to be removed, then the whole interface should be removed,
> > > not just have the implementations removed from some drivers.
> > 
> > Okay, I've run a bunch of performance tests with the PMEM and with BTT entry
> > points for rw_pages() in a swap workload, and in all cases I do see an
> > improvement over the code when rw_pages() is removed.  Here are the results
> > from my random lab box:
> > 
> >   Average latency of swap_writepage()
> > +--++-+-+
> > |  | no rw_page | rw_page | Improvement |
> > +---+
> > | PMEM |  5.0 us|  4.7 us | 6%  |
> > +---+
> > |  BTT |  6.8 us|  6.1 us |10%  |
> > +--++-+-+
> > 
> >   Average latency of swap_readpage()
> > +--++-+-+
> > |  | no rw_page | rw_page | Improvement |
> > +---+
> > | PMEM |  3.3 us|  2.9 us |12%  |
> > +---+
> > |  BTT |  3.7 us|  3.4 us | 8%  |
> > +--++-+-+
> > 
> > The workload was pmbench, a memory benchmark, run on a system where I had
> > severely restricted the amount of memory in the system with the 'mem' kernel
> > command line parameter.  The benchmark was set up to test more memory than I
> > allowed the OS to have so it spilled over into swap.
> > 
> > The PMEM or BTT device was set up as my swap device, and during the test I 
> > got
> > a few hundred thousand samples of each of swap_writepage() and
> > swap_writepage().  The PMEM/BTT device was just memory reserved with the
> > memmap kernel command line parameter.
> > 
> > Thanks, Matthew, for asking for performance data.  It looks like removing 
> > this
> > code would have been a mistake.
> 
> By suggestion of Christoph Hellwig, I made a quick patch which does IO without
> dynamic bio allocation for swap IO. Actually, it's not formal patch to be
> worth to send mainline yet but I believe it's enough to test the improvement.
> 
> Could you test patchset on pmem and btt without rw_page?
> 
> For working the patch, block drivers need to declare it's synchronous IO
> device via BDI_CAP_SYNC but if it's hard, you can just make every swap IO
> comes from (sis->flags & SWP_SYNC_IO) with removing condition check
> 
> if (!(sis->flags & SWP_SYNC_IO)) in swap_[read|write]page.
> 
> Patchset is based on 4.13-rc3.

Thanks for the patch, here are the updated results from my test box:

 Average latency of swap_writepage()
+--++-+-+
|  | no rw_page | minchan | rw_page |
+
| PMEM |  5.0 us| 4.98 us |  4.7 us |
+
|  BTT |  6.8 us| 6.3 us  |  6.1 us |
+--++-+-+
   
 Average latency of swap_readpage()
+--++-+-+
|  | no rw_page | minchan | rw_page |
+
| PMEM |  3.3 us| 3.27 us |  2.9 us |
+
|  BTT |  3.7 us| 3.44 us |  3.4 us |
+--++-+-+

I've added another digit in precision in some cases to help differentiate the
various results.

In all cases your patches did perform better than with the regularly allocated
BIO, but again for all cases the rw_page() path was the fastest, even if only
marginally.

- Ross
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt

2017-08-03 Thread Christoph Hellwig
FYI, for the read side we should use the on-stack bio unconditionally,
as it will always be a win (or not show up at all).
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt

2017-08-02 Thread Dan Williams
[ adding Tim and Ying who have also been looking at swap optimization
and rw_page interactions ]

On Wed, Aug 2, 2017 at 5:13 PM, Minchan Kim  wrote:
> Hi Ross,
>
> On Wed, Aug 02, 2017 at 04:13:59PM -0600, Ross Zwisler wrote:
>> On Fri, Jul 28, 2017 at 10:31:43AM -0700, Matthew Wilcox wrote:
>> > On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
>> > > Dan Williams and Christoph Hellwig have recently expressed doubt about
>> > > whether the rw_page() interface made sense for synchronous memory drivers
>> > > [1][2].  It's unclear whether this interface has any performance benefit
>> > > for these drivers, but as we continue to fix bugs it is clear that it 
>> > > does
>> > > have a maintenance burden.  This series removes the rw_page()
>> > > implementations in brd, pmem and btt to relieve this burden.
>> >
>> > Why don't you measure whether it has performance benefits?  I don't
>> > understand why zram would see performance benefits and not other drivers.
>> > If it's going to be removed, then the whole interface should be removed,
>> > not just have the implementations removed from some drivers.
>>
>> Okay, I've run a bunch of performance tests with the PMEM and with BTT entry
>> points for rw_pages() in a swap workload, and in all cases I do see an
>> improvement over the code when rw_pages() is removed.  Here are the results
>> from my random lab box:
>>
>>   Average latency of swap_writepage()
>> +--++-+-+
>> |  | no rw_page | rw_page | Improvement |
>> +---+
>> | PMEM |  5.0 us|  4.7 us | 6%  |
>> +---+
>> |  BTT |  6.8 us|  6.1 us |10%  |
>> +--++-+-+
>>
>>   Average latency of swap_readpage()
>> +--++-+-+
>> |  | no rw_page | rw_page | Improvement |
>> +---+
>> | PMEM |  3.3 us|  2.9 us |12%  |
>> +---+
>> |  BTT |  3.7 us|  3.4 us | 8%  |
>> +--++-+-+
>>
>> The workload was pmbench, a memory benchmark, run on a system where I had
>> severely restricted the amount of memory in the system with the 'mem' kernel
>> command line parameter.  The benchmark was set up to test more memory than I
>> allowed the OS to have so it spilled over into swap.
>>
>> The PMEM or BTT device was set up as my swap device, and during the test I 
>> got
>> a few hundred thousand samples of each of swap_writepage() and
>> swap_writepage().  The PMEM/BTT device was just memory reserved with the
>> memmap kernel command line parameter.
>>
>> Thanks, Matthew, for asking for performance data.  It looks like removing 
>> this
>> code would have been a mistake.
>
> By suggestion of Christoph Hellwig, I made a quick patch which does IO without
> dynamic bio allocation for swap IO. Actually, it's not formal patch to be
> worth to send mainline yet but I believe it's enough to test the improvement.
>
> Could you test patchset on pmem and btt without rw_page?
>
> For working the patch, block drivers need to declare it's synchronous IO
> device via BDI_CAP_SYNC but if it's hard, you can just make every swap IO
> comes from (sis->flags & SWP_SYNC_IO) with removing condition check
>
> if (!(sis->flags & SWP_SYNC_IO)) in swap_[read|write]page.
>
> Patchset is based on 4.13-rc3.
>
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 856d5dc02451..b1c5e9bf3ad5 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -125,9 +125,9 @@ static inline bool is_partial_io(struct bio_vec *bvec)
>  static void zram_revalidate_disk(struct zram *zram)
>  {
> revalidate_disk(zram->disk);
> -   /* revalidate_disk reset the BDI_CAP_STABLE_WRITES so set again */
> +   /* revalidate_disk reset the BDI capability so set again */
> zram->disk->queue->backing_dev_info->capabilities |=
> -   BDI_CAP_STABLE_WRITES;
> +   (BDI_CAP_STABLE_WRITES|BDI_CAP_SYNC);
>  }
>
>  /*
> @@ -1096,7 +1096,7 @@ static int zram_open(struct block_device *bdev, fmode_t 
> mode)
>  static const struct block_device_operations zram_devops = {
> .open = zram_open,
> .swap_slot_free_notify = zram_slot_free_notify,
> -   .rw_page = zram_rw_page,
> +   // .rw_page = zram_rw_page,
> .owner = THIS_MODULE
>  };
>
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 854e1bdd0b2a..05eee145d964 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -130,6 +130,7 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, 
> unsigned int max_ratio);
>  #define BDI_CAP_STABLE_WRITES  0x0008
>  #define BDI_CAP_STRICTLIMIT0x0010
>  #define BDI_CAP_CGROUP_WRITEBACK 0x0020
> +#define BDI_CAP_SYNC   

Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt

2017-08-02 Thread Minchan Kim
Hi Ross,

On Wed, Aug 02, 2017 at 04:13:59PM -0600, Ross Zwisler wrote:
> On Fri, Jul 28, 2017 at 10:31:43AM -0700, Matthew Wilcox wrote:
> > On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> > > Dan Williams and Christoph Hellwig have recently expressed doubt about
> > > whether the rw_page() interface made sense for synchronous memory drivers
> > > [1][2].  It's unclear whether this interface has any performance benefit
> > > for these drivers, but as we continue to fix bugs it is clear that it does
> > > have a maintenance burden.  This series removes the rw_page()
> > > implementations in brd, pmem and btt to relieve this burden.
> > 
> > Why don't you measure whether it has performance benefits?  I don't
> > understand why zram would see performance benefits and not other drivers.
> > If it's going to be removed, then the whole interface should be removed,
> > not just have the implementations removed from some drivers.
> 
> Okay, I've run a bunch of performance tests with the PMEM and with BTT entry
> points for rw_pages() in a swap workload, and in all cases I do see an
> improvement over the code when rw_pages() is removed.  Here are the results
> from my random lab box:
> 
>   Average latency of swap_writepage()
> +--++-+-+
> |  | no rw_page | rw_page | Improvement |
> +---+
> | PMEM |  5.0 us|  4.7 us | 6%  |
> +---+
> |  BTT |  6.8 us|  6.1 us |10%  |
> +--++-+-+
> 
>   Average latency of swap_readpage()
> +--++-+-+
> |  | no rw_page | rw_page | Improvement |
> +---+
> | PMEM |  3.3 us|  2.9 us |12%  |
> +---+
> |  BTT |  3.7 us|  3.4 us | 8%  |
> +--++-+-+
> 
> The workload was pmbench, a memory benchmark, run on a system where I had
> severely restricted the amount of memory in the system with the 'mem' kernel
> command line parameter.  The benchmark was set up to test more memory than I
> allowed the OS to have so it spilled over into swap.
> 
> The PMEM or BTT device was set up as my swap device, and during the test I got
> a few hundred thousand samples of each of swap_writepage() and
> swap_writepage().  The PMEM/BTT device was just memory reserved with the
> memmap kernel command line parameter.
> 
> Thanks, Matthew, for asking for performance data.  It looks like removing this
> code would have been a mistake.

By suggestion of Christoph Hellwig, I made a quick patch which does IO without
dynamic bio allocation for swap IO. Actually, it's not formal patch to be
worth to send mainline yet but I believe it's enough to test the improvement.

Could you test patchset on pmem and btt without rw_page?

For working the patch, block drivers need to declare it's synchronous IO
device via BDI_CAP_SYNC but if it's hard, you can just make every swap IO
comes from (sis->flags & SWP_SYNC_IO) with removing condition check

if (!(sis->flags & SWP_SYNC_IO)) in swap_[read|write]page.

Patchset is based on 4.13-rc3.


diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 856d5dc02451..b1c5e9bf3ad5 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -125,9 +125,9 @@ static inline bool is_partial_io(struct bio_vec *bvec)
 static void zram_revalidate_disk(struct zram *zram)
 {
revalidate_disk(zram->disk);
-   /* revalidate_disk reset the BDI_CAP_STABLE_WRITES so set again */
+   /* revalidate_disk reset the BDI capability so set again */
zram->disk->queue->backing_dev_info->capabilities |=
-   BDI_CAP_STABLE_WRITES;
+   (BDI_CAP_STABLE_WRITES|BDI_CAP_SYNC);
 }
 
 /*
@@ -1096,7 +1096,7 @@ static int zram_open(struct block_device *bdev, fmode_t 
mode)
 static const struct block_device_operations zram_devops = {
.open = zram_open,
.swap_slot_free_notify = zram_slot_free_notify,
-   .rw_page = zram_rw_page,
+   // .rw_page = zram_rw_page,
.owner = THIS_MODULE
 };
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 854e1bdd0b2a..05eee145d964 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -130,6 +130,7 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, 
unsigned int max_ratio);
 #define BDI_CAP_STABLE_WRITES  0x0008
 #define BDI_CAP_STRICTLIMIT0x0010
 #define BDI_CAP_CGROUP_WRITEBACK 0x0020
+#define BDI_CAP_SYNC   0x0040
 
 #define BDI_CAP_NO_ACCT_AND_WRITEBACK \
(BDI_CAP_NO_WRITEBACK | BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_ACCT_WB)
@@ -177,6 +178,11 @@ long wait_iff_congested(struct pglist_data *pgdat, int 
sync, long timeout);
 int pdflush_proc_obsolete(struct ctl_table *table, int write,
void __user 

Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt

2017-08-02 Thread Ross Zwisler
On Fri, Jul 28, 2017 at 10:31:43AM -0700, Matthew Wilcox wrote:
> On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> > Dan Williams and Christoph Hellwig have recently expressed doubt about
> > whether the rw_page() interface made sense for synchronous memory drivers
> > [1][2].  It's unclear whether this interface has any performance benefit
> > for these drivers, but as we continue to fix bugs it is clear that it does
> > have a maintenance burden.  This series removes the rw_page()
> > implementations in brd, pmem and btt to relieve this burden.
> 
> Why don't you measure whether it has performance benefits?  I don't
> understand why zram would see performance benefits and not other drivers.
> If it's going to be removed, then the whole interface should be removed,
> not just have the implementations removed from some drivers.

Okay, I've run a bunch of performance tests with the PMEM and with BTT entry
points for rw_pages() in a swap workload, and in all cases I do see an
improvement over the code when rw_pages() is removed.  Here are the results
from my random lab box:

  Average latency of swap_writepage()
+--++-+-+
|  | no rw_page | rw_page | Improvement |
+---+
| PMEM |  5.0 us|  4.7 us | 6%  |
+---+
|  BTT |  6.8 us|  6.1 us |10%  |
+--++-+-+

  Average latency of swap_readpage()
+--++-+-+
|  | no rw_page | rw_page | Improvement |
+---+
| PMEM |  3.3 us|  2.9 us |12%  |
+---+
|  BTT |  3.7 us|  3.4 us | 8%  |
+--++-+-+

The workload was pmbench, a memory benchmark, run on a system where I had
severely restricted the amount of memory in the system with the 'mem' kernel
command line parameter.  The benchmark was set up to test more memory than I
allowed the OS to have so it spilled over into swap.

The PMEM or BTT device was set up as my swap device, and during the test I got
a few hundred thousand samples of each of swap_writepage() and
swap_writepage().  The PMEM/BTT device was just memory reserved with the
memmap kernel command line parameter.

Thanks, Matthew, for asking for performance data.  It looks like removing this
code would have been a mistake.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt

2017-07-31 Thread Christoph Hellwig
On Mon, Jul 31, 2017 at 09:42:06AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 31, 2017 at 04:36:47PM +0900, Minchan Kim wrote:
> > Do you suggest define something special flag(e.g., SWP_INMEMORY)
> > for in-memory swap to swap_info_struct when swapon time manually
> > or from bdi_queue_someting automatically?
> > And depending the flag of swap_info_struct, use the onstack bio
> > instead of dynamic allocation if the swap device is in-memory?
> 
> Currently swap always just does I/O on a single page as far
> as I can tell, so it can always just use an on-stack bio and
> biovec.

That's for synchronous I/O, aka reads of course.  For writes you'll
need to do a dynamic allocation if they are asynchronous.  But yes,
if we want to force certain devices to be synchronous we'll need
a flag for that.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt

2017-07-31 Thread Christoph Hellwig
On Mon, Jul 31, 2017 at 04:36:47PM +0900, Minchan Kim wrote:
> Do you suggest define something special flag(e.g., SWP_INMEMORY)
> for in-memory swap to swap_info_struct when swapon time manually
> or from bdi_queue_someting automatically?
> And depending the flag of swap_info_struct, use the onstack bio
> instead of dynamic allocation if the swap device is in-memory?

Currently swap always just does I/O on a single page as far
as I can tell, so it can always just use an on-stack bio and
biovec.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt

2017-07-31 Thread Minchan Kim
On Mon, Jul 31, 2017 at 09:17:07AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 31, 2017 at 07:16:59AM +0900, Minchan Kim wrote:
> > rw_page's gain is reducing of dynamic allocation in swap path
> > as well as performance gain thorugh avoiding bio allocation.
> > And it would be important in memory pressure situation.
> 
> There is no need for any dynamic allocation when using the bio
> path.  Take a look at __blkdev_direct_IO_simple for an example
> that doesn't do any allocations.

Do you suggest define something special flag(e.g., SWP_INMEMORY)
for in-memory swap to swap_info_struct when swapon time manually
or from bdi_queue_someting automatically?
And depending the flag of swap_info_struct, use the onstack bio
instead of dynamic allocation if the swap device is in-memory?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt

2017-07-31 Thread Christoph Hellwig
On Mon, Jul 31, 2017 at 07:16:59AM +0900, Minchan Kim wrote:
> rw_page's gain is reducing of dynamic allocation in swap path
> as well as performance gain thorugh avoiding bio allocation.
> And it would be important in memory pressure situation.

There is no need for any dynamic allocation when using the bio
path.  Take a look at __blkdev_direct_IO_simple for an example
that doesn't do any allocations.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt

2017-07-30 Thread Minchan Kim
On Mon, Jul 31, 2017 at 07:16:59AM +0900, Minchan Kim wrote:
> Hi Andrew,
> 
> On Fri, Jul 28, 2017 at 02:21:23PM -0700, Andrew Morton wrote:
> > On Fri, 28 Jul 2017 10:31:43 -0700 Matthew Wilcox  
> > wrote:
> > 
> > > On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> > > > Dan Williams and Christoph Hellwig have recently expressed doubt about
> > > > whether the rw_page() interface made sense for synchronous memory 
> > > > drivers
> > > > [1][2].  It's unclear whether this interface has any performance benefit
> > > > for these drivers, but as we continue to fix bugs it is clear that it 
> > > > does
> > > > have a maintenance burden.  This series removes the rw_page()
> > > > implementations in brd, pmem and btt to relieve this burden.
> > > 
> > > Why don't you measure whether it has performance benefits?  I don't
> > > understand why zram would see performance benefits and not other drivers.
> > > If it's going to be removed, then the whole interface should be removed,
> > > not just have the implementations removed from some drivers.
> > 
> > Yes please.  Minchan, could you please take a look sometime?
> 
> rw_page's gain is reducing of dynamic allocation in swap path
> as well as performance gain thorugh avoiding bio allocation.
> And it would be important in memory pressure situation.
> 
> I guess it comes from bio_alloc mempool. Usually, zram-swap works
> in high memory pressure so mempool would be exahusted easily.
> It means that mempool wait and repeated alloc would consume the
> overhead.
> 
> Actually, at that time although Karam reported the gain is 2.4%,
> I got a report from production team that the gain in corner case
> (e.g., animation playing is smooth) would be much higher than
> expected.

One of the idea is to create bioset only for swap without sharing
with FS so bio allocation for swap doesn't need to wait returning
bio from FS side which does slow nand IO to mempool.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt

2017-07-30 Thread Minchan Kim
Hi Andrew,

On Fri, Jul 28, 2017 at 02:21:23PM -0700, Andrew Morton wrote:
> On Fri, 28 Jul 2017 10:31:43 -0700 Matthew Wilcox  wrote:
> 
> > On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> > > Dan Williams and Christoph Hellwig have recently expressed doubt about
> > > whether the rw_page() interface made sense for synchronous memory drivers
> > > [1][2].  It's unclear whether this interface has any performance benefit
> > > for these drivers, but as we continue to fix bugs it is clear that it does
> > > have a maintenance burden.  This series removes the rw_page()
> > > implementations in brd, pmem and btt to relieve this burden.
> > 
> > Why don't you measure whether it has performance benefits?  I don't
> > understand why zram would see performance benefits and not other drivers.
> > If it's going to be removed, then the whole interface should be removed,
> > not just have the implementations removed from some drivers.
> 
> Yes please.  Minchan, could you please take a look sometime?

rw_page's gain is reducing of dynamic allocation in swap path
as well as performance gain thorugh avoiding bio allocation.
And it would be important in memory pressure situation.

I guess it comes from bio_alloc mempool. Usually, zram-swap works
in high memory pressure so mempool would be exahusted easily.
It means that mempool wait and repeated alloc would consume the
overhead.

Actually, at that time although Karam reported the gain is 2.4%,
I got a report from production team that the gain in corner case
(e.g., animation playing is smooth) would be much higher than
expected.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt

2017-07-28 Thread Andrew Morton
On Fri, 28 Jul 2017 10:31:43 -0700 Matthew Wilcox  wrote:

> On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> > Dan Williams and Christoph Hellwig have recently expressed doubt about
> > whether the rw_page() interface made sense for synchronous memory drivers
> > [1][2].  It's unclear whether this interface has any performance benefit
> > for these drivers, but as we continue to fix bugs it is clear that it does
> > have a maintenance burden.  This series removes the rw_page()
> > implementations in brd, pmem and btt to relieve this burden.
> 
> Why don't you measure whether it has performance benefits?  I don't
> understand why zram would see performance benefits and not other drivers.
> If it's going to be removed, then the whole interface should be removed,
> not just have the implementations removed from some drivers.

Yes please.  Minchan, could you please take a look sometime?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt

2017-07-28 Thread Matthew Wilcox
On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> Dan Williams and Christoph Hellwig have recently expressed doubt about
> whether the rw_page() interface made sense for synchronous memory drivers
> [1][2].  It's unclear whether this interface has any performance benefit
> for these drivers, but as we continue to fix bugs it is clear that it does
> have a maintenance burden.  This series removes the rw_page()
> implementations in brd, pmem and btt to relieve this burden.

Why don't you measure whether it has performance benefits?  I don't
understand why zram would see performance benefits and not other drivers.
If it's going to be removed, then the whole interface should be removed,
not just have the implementations removed from some drivers.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm