Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-12 Thread Alexei Starovoitov
On Tue, Apr 12, 2016 at 08:16:49AM +0200, Jesper Dangaard Brouer wrote:
> 
> On Mon, 11 Apr 2016 15:21:26 -0700
> Alexei Starovoitov  wrote:
> 
> > On Mon, Apr 11, 2016 at 11:41:57PM +0200, Jesper Dangaard Brouer wrote:
> > > 
> > > On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg  wrote:
> > >   
> [...]
> > > > 
> > > > If we go down this road how about also attaching some driver opaques
> > > > to the page sets?  
> > > 
> > > That was the ultimate plan... to leave some opaques bytes left in the
> > > page struct that drivers could use.
> > > 
> > > In struct page I would need a pointer back to my page_pool struct and a
> > > page flag.  Then, I would need room to store the dma_unmap address.
> > > (And then some of the usual fields are still needed, like the refcnt,
> > > and reusing some of the list constructs).  And a zero-copy cross-domain
> > > id.  
> > 
> > I don't think we need to add anything to struct page.
> > This is supposed to be small cache of dma_mapped pages with lockless access.
> > It can be implemented as an array or link list where every element
> > is dma_addr and pointer to page. If it is full, dma_unmap_page+put_page to
> > send it to back to page allocator.
> 
> It sounds like the Intel drivers recycle facility, where they split the
> page into two parts, and keep page in RX-ring, by swapping to other
> half of page, if page_count(page) is <= 2.  Thus, they use the atomic
> page ref count to synchronize on.

actually I'm proposing the opposite. one page = one packet.
I'm perfectly happy to waste half a page, since number of such pages is small
and performance matter more. Typical performance vs memory tradeoff.

> Thus, we end-up having two atomic operations per RX packet, on the page
> refcnt.  Where DPDK have zero...

the page recycling cache should have zero atomic ops per packet
otherwise it's non starter.

> By fully taking over the page as an allocator, almost like slab. I can
> optimize the common case (of the packet-page getting allocated and
> free'ed on the same CPU), and remove these atomic operations.

slub is doing local cmpxchg. 40G networking cannot afford it per packet.
If it's amortized due to batching that will be ok.



Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-12 Thread Alexander Duyck
On Mon, Apr 11, 2016 at 11:28 PM, Jesper Dangaard Brouer
 wrote:
>
> On Mon, 11 Apr 2016 15:02:51 -0700 Alexander Duyck 
>  wrote:
>
>> Have you taken a look at possibly trying to optimize the DMA pool API
>> to work with pages?  It sounds like it is supposed to do something
>> similar to what you are wanting to do.
>
> Yes, I have looked at the mm/dmapool.c API. AFAIK this is for DMA
> coherent memory (see use of dma_alloc_coherent/dma_free_coherent).
>
> What we are doing is "streaming" DMA memory, when processing the RX
> ring.
>
> (NIC are only using DMA coherent memory for the descriptors, which are
> allocated on driver init)

Yes, I know that but it shouldn't take much to extend the API to
provide the option for a streaming DMA mapping.  That was why I
thought you might want to look in this direction.

- Alex


Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-12 Thread Jesper Dangaard Brouer

On Mon, 11 Apr 2016 15:02:51 -0700 Alexander Duyck  
wrote:

> Have you taken a look at possibly trying to optimize the DMA pool API
> to work with pages?  It sounds like it is supposed to do something
> similar to what you are wanting to do.

Yes, I have looked at the mm/dmapool.c API. AFAIK this is for DMA
coherent memory (see use of dma_alloc_coherent/dma_free_coherent). 

What we are doing is "streaming" DMA memory, when processing the RX
ring.

(NIC are only using DMA coherent memory for the descriptors, which are
allocated on driver init)
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-12 Thread Jesper Dangaard Brouer

On Mon, 11 Apr 2016 15:21:26 -0700
Alexei Starovoitov  wrote:

> On Mon, Apr 11, 2016 at 11:41:57PM +0200, Jesper Dangaard Brouer wrote:
> > 
> > On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg  wrote:
> >   
[...]
> > > 
> > > If we go down this road how about also attaching some driver opaques
> > > to the page sets?  
> > 
> > That was the ultimate plan... to leave some opaques bytes left in the
> > page struct that drivers could use.
> > 
> > In struct page I would need a pointer back to my page_pool struct and a
> > page flag.  Then, I would need room to store the dma_unmap address.
> > (And then some of the usual fields are still needed, like the refcnt,
> > and reusing some of the list constructs).  And a zero-copy cross-domain
> > id.  
> 
> I don't think we need to add anything to struct page.
> This is supposed to be small cache of dma_mapped pages with lockless access.
> It can be implemented as an array or link list where every element
> is dma_addr and pointer to page. If it is full, dma_unmap_page+put_page to
> send it to back to page allocator.

It sounds like the Intel drivers recycle facility, where they split the
page into two parts, and keep page in RX-ring, by swapping to other
half of page, if page_count(page) is <= 2.  Thus, they use the atomic
page ref count to synchronize on.

Thus, we end-up having two atomic operations per RX packet, on the page
refcnt.  Where DPDK have zero...

By fully taking over the page as an allocator, almost like slab. I can
optimize the common case (of the packet-page getting allocated and
free'ed on the same CPU), and remove these atomic operations.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Alexei Starovoitov
On Mon, Apr 11, 2016 at 11:41:57PM +0200, Jesper Dangaard Brouer wrote:
> 
> On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg  wrote:
> 
> > >> This is also very interesting for storage targets, which face the same
> > >> issue.  SCST has a mode where it caches some fully constructed SGLs,
> > >> which is probably very similar to what NICs want to do.  
> > >
> > > I think a cached allocator for page sets + the scatterlists that
> > > describe these page sets would not only be useful for SCSI target
> > > implementations but also for the Linux SCSI initiator. Today the scsi-mq
> > > code reserves space in each scsi_cmnd for a scatterlist of
> > > SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page
> > > sets less memory would be needed per scsi_cmnd.  
> > 
> > If we go down this road how about also attaching some driver opaques
> > to the page sets?
> 
> That was the ultimate plan... to leave some opaques bytes left in the
> page struct that drivers could use.
> 
> In struct page I would need a pointer back to my page_pool struct and a
> page flag.  Then, I would need room to store the dma_unmap address.
> (And then some of the usual fields are still needed, like the refcnt,
> and reusing some of the list constructs).  And a zero-copy cross-domain
> id.

I don't think we need to add anything to struct page.
This is supposed to be small cache of dma_mapped pages with lockless access.
It can be implemented as an array or link list where every element
is dma_addr and pointer to page. If it is full, dma_unmap_page+put_page to
send it to back to page allocator.



Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Alexander Duyck
On Mon, Apr 11, 2016 at 2:41 PM, Jesper Dangaard Brouer
 wrote:
>
> On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg  wrote:
>
>> >> This is also very interesting for storage targets, which face the same
>> >> issue.  SCST has a mode where it caches some fully constructed SGLs,
>> >> which is probably very similar to what NICs want to do.
>> >
>> > I think a cached allocator for page sets + the scatterlists that
>> > describe these page sets would not only be useful for SCSI target
>> > implementations but also for the Linux SCSI initiator. Today the scsi-mq
>> > code reserves space in each scsi_cmnd for a scatterlist of
>> > SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page
>> > sets less memory would be needed per scsi_cmnd.
>>
>> If we go down this road how about also attaching some driver opaques
>> to the page sets?
>
> That was the ultimate plan... to leave some opaques bytes left in the
> page struct that drivers could use.
>
> In struct page I would need a pointer back to my page_pool struct and a
> page flag.  Then, I would need room to store the dma_unmap address.
> (And then some of the usual fields are still needed, like the refcnt,
> and reusing some of the list constructs).  And a zero-copy cross-domain
> id.
>
>
> For my packet-page idea, I would need a packet length and an offset
> where data starts (I can derive the "head-room" for encap from these
> two).

Have you taken a look at possibly trying to optimize the DMA pool API
to work with pages?  It sounds like it is supposed to do something
similar to what you are wanting to do.

- Alex


Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Jesper Dangaard Brouer

On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg  wrote:

> >> This is also very interesting for storage targets, which face the same
> >> issue.  SCST has a mode where it caches some fully constructed SGLs,
> >> which is probably very similar to what NICs want to do.  
> >
> > I think a cached allocator for page sets + the scatterlists that
> > describe these page sets would not only be useful for SCSI target
> > implementations but also for the Linux SCSI initiator. Today the scsi-mq
> > code reserves space in each scsi_cmnd for a scatterlist of
> > SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page
> > sets less memory would be needed per scsi_cmnd.  
> 
> If we go down this road how about also attaching some driver opaques
> to the page sets?

That was the ultimate plan... to leave some opaques bytes left in the
page struct that drivers could use.

In struct page I would need a pointer back to my page_pool struct and a
page flag.  Then, I would need room to store the dma_unmap address.
(And then some of the usual fields are still needed, like the refcnt,
and reusing some of the list constructs).  And a zero-copy cross-domain
id.


For my packet-page idea, I would need a packet length and an offset
where data starts (I can derive the "head-room" for encap from these
two).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Eric Dumazet
On Mon, 2016-04-11 at 21:47 +0200, Jesper Dangaard Brouer wrote:
> On Mon, 11 Apr 2016 09:53:54 -0700
> Eric Dumazet  wrote:
> 
> > On Mon, 2016-04-11 at 18:19 +0200, Jesper Dangaard Brouer wrote:
> > 
> > > Drivers also do tricks where they fallback to smaller order pages. E.g.
> > > lookup function mlx4_alloc_pages().  I've tried to simulate that
> > > function here:
> > > https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69
> > >   
> > 
> > We use order-0 pages on mlx4 at Google, as order-3 pages are very
> > dangerous for some kind of attacks...
> 
> Interesting!
> 
> > An out of order TCP packet can hold an order-3 pages, while claiming to
> > use 1.5 KBvia skb->truesize.
> > 
> > order-0 only pages allow the page recycle trick used by Intel driver,
> > and we hardly see any page allocations in typical workloads.
> 
> Yes, I looked at the Intel ixgbe drivers page recycle trick. 
> 
> It is actually quite cool, but code wise it is a little hard to
> follow.  I started to look at the variant in i40e, specifically
> function i40e_clean_rx_irq_ps() explains it a bit more explicit.
>  
> 
> > While order-3 pages are 'nice' for friendly datacenter kind of
> > traffic, they also are a higher risk on hosts connected to the wild
> > Internet.
> > 
> > Maybe I should upstream this patch ;)
> 
> Definitely!
> 
> Does this patch also include a page recycle trick?  Else how do you get
> around the cost of allocating a single order-0 page?
> 

Yes, we use the page recycle trick.

Obviously not on powerpc (or any arch with PAGE_SIZE >= 8192), but
definitely on x86.





Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Jesper Dangaard Brouer
On Mon, 11 Apr 2016 09:53:54 -0700
Eric Dumazet  wrote:

> On Mon, 2016-04-11 at 18:19 +0200, Jesper Dangaard Brouer wrote:
> 
> > Drivers also do tricks where they fallback to smaller order pages. E.g.
> > lookup function mlx4_alloc_pages().  I've tried to simulate that
> > function here:
> > https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69
> >   
> 
> We use order-0 pages on mlx4 at Google, as order-3 pages are very
> dangerous for some kind of attacks...

Interesting!

> An out of order TCP packet can hold an order-3 pages, while claiming to
> use 1.5 KBvia skb->truesize.
> 
> order-0 only pages allow the page recycle trick used by Intel driver,
> and we hardly see any page allocations in typical workloads.

Yes, I looked at the Intel ixgbe drivers page recycle trick. 

It is actually quite cool, but code wise it is a little hard to
follow.  I started to look at the variant in i40e, specifically
function i40e_clean_rx_irq_ps() explains it a bit more explicit.
 

> While order-3 pages are 'nice' for friendly datacenter kind of
> traffic, they also are a higher risk on hosts connected to the wild
> Internet.
> 
> Maybe I should upstream this patch ;)

Definitely!

Does this patch also include a page recycle trick?  Else how do you get
around the cost of allocating a single order-0 page?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Jesper Dangaard Brouer
On Mon, 11 Apr 2016 19:07:03 +0100
Mel Gorman  wrote:

> On Mon, Apr 11, 2016 at 06:19:07PM +0200, Jesper Dangaard Brouer wrote:
> > > http://git.kernel.org/cgit/linux/kernel/git/mel/linux.git/log/?h=mm-vmscan-node-lru-v4r5
> > >   
> > 
> > The cost decreased to: 228 cycles(tsc), but there are some variations,
> > sometimes it increase to 238 cycles(tsc).
> >   
> 
> In the free path, a bulk pcp free adds to the cycles. In the alloc path,
> a refill of the pcp lists costs quite a bit. Either option introduces
> variances. The bulk free path can be optimised a little so I chucked
> some additional patches at it that are not released yet but I suspect the
> benefit will be marginal. The real heavy costs there are splitting/merging
> buddies. Fixing that is much more fundamental but even fronting the allocator
> with a new recycle allocator would not offset that as the refill of the
> page-recycling thing would incur high costs.
>

Yes, re-filling page-pool (in the non-steady state) could be
problematic for performance.  That is why I'm very motivated in helping
out with a bulk alloc/free scheme for the page allocator.

 
> > Nice, but there is still a looong way to my performance target, where I
> > can spend 201 cycles for the entire forwarding path
> >   
> 
> While I accept the cost is still too high, I think the effort should still
> be spent on improving the allocator in general than trying to bypass it.
> 

I do think improving the page allocator is very important work.
I just don't see how we can ever reach my performance target, without a
page-pool recycle facility.

I work in the area, where I think the cost of a single atomic operation
is too high.  I work on amortizing the individual atomic operations.
That is what I did for the SLUB allocator, with the bulk API. see:

Commit d0ecd894e3d5 ("slub: optimize bulk slowpath free by detached freelist")
 https://git.kernel.org/torvalds/c/d0ecd894e3d5

Commit fbd02630c6e3 ("slub: initial bulk free implementation")
 https://git.kernel.org/torvalds/c/fbd02630c6e3
 
This is now also used in the network stack:
 Commit 3134b9f019f2 ("Merge branch 'net-mitigate-kmem_free-slowpath'")
 Commit a3a8749d34d8 ("ixgbe: bulk free SKBs during TX completion cleanup 
cycle")


> > > This is an unreleased series that contains both the page allocator
> > > optimisations and the one-LRU-per-node series which in combination remove 
> > > a
> > > lot of code from the page allocator fast paths. I have no data on how the
> > > combined series behaves but each series individually is known to improve
> > > page allocator performance.
> > >
> > > Once you have that, do a hackjob to remove the debugging checks from both 
> > > the
> > > alloc and free path and see what that leaves. They could be bypassed 
> > > properly
> > > with a __GFP_NOACCT flag used only by drivers that absolutely require 
> > > pages
> > > as quickly as possible and willing to be less safe to get that 
> > > performance.  
> > 
> > I would be interested in testing/benchmarking a patch where you remove
> > the debugging checks...
> >   
> 
> Right now, I'm not proposing to remove the debugging checks despite their
> cost. They catch really difficult problems in the field unfortunately
> including corruption from buggy hardware. A GFP flag that disables them
> for a very specific case would be ok but I expect it to be resisted by
> others if it's done for the general case. Even a static branch for runtime
> debugging checks may be resisted.
> 
> Even if GFP flags are tight, I have a patch that deletes __GFP_COLD on
> the grounds it is of questionable value. Applying that would free a flag
> for __GFP_NOACCT that bypasses debugging checks and statistic updates.
> That would work for the allocation side at least but doing the same for
> the free side would be hard (potentially impossible) to do transparently
> for drivers.

Before spending too much work on something, I usually try to determine
what the maximum benefit of something would be.  Thus, I propose you
create a patch that hack remove all the debug checks that you think
could be beneficial to remove.  And then benchmark it yourself or send
it to me for benchmarking... that is the quickest way to determine if
this is worth spending time on.


 
> > You are also welcome to try out my benchmarking modules yourself:
> >  
> > https://github.com/netoptimizer/prototype-kernel/blob/master/getting_started.rst
> >   
> 
> I took a quick look and functionally it's similar to the systemtap-based
> microbenchmark I'm using in mmtests so I don't think we have a problem
> with reproduction at the moment.
> 
> > > Be aware that compound order allocs like this are a double edged sword as
> > > it'll be fast sometimes and other times require reclaim/compaction which
> > > can stall for prolonged periods of time.  
> > 
> > Yes, I've notice that there can be a fairly high variation, when doing
> > compound order allocs, which is not so nice!  I 

Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Bart Van Assche

On 04/11/2016 11:37 AM, Jesper Dangaard Brouer wrote:

On Mon, 11 Apr 2016 14:46:25 -0300
Thadeu Lima de Souza Cascardo  wrote:


So, Jesper, please take into consideration that this pool design
would rather be per device. Otherwise, we allow some device to write
into another's device/driver memory.


Yes, that was my intended use.  I want to have a page-pool per device.
I actually, want to go as far as a page-pool per NIC HW RX-ring queue.

Because the other use-case for the page-pool is zero-copy RX.

The NIC HW trick is that we today can create a HW filter in the NIC
(via ethtool) and place that traffic into a separate RX queue in the
NIC.  Lets say matching NFS traffic or guest traffic. Then we can allow
RX zero-copy of these pages, into the application/guest, somehow
binding it to RX queue, e.g. introducing a "cross-domain-id" in the
page-pool page that need to match.


I think it is important to keep in mind that using a page pool for 
zero-copy RX is specific to protocols that are based on TCP/IP. 
Protocols like FC, SRP and iSER have been designed such that the side 
that allocates the buffers also initiates the data transfer (the target 
side). With TCP/IP however transferring data and allocating receive 
buffers happens by opposite sides of the connection.


Bart.


Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Jesper Dangaard Brouer
On Mon, 11 Apr 2016 14:46:25 -0300
Thadeu Lima de Souza Cascardo  wrote:

> So, Jesper, please take into consideration that this pool design
> would rather be per device. Otherwise, we allow some device to write
> into another's device/driver memory.

Yes, that was my intended use.  I want to have a page-pool per device.
I actually, want to go as far as a page-pool per NIC HW RX-ring queue.

Because the other use-case for the page-pool is zero-copy RX.

The NIC HW trick is that we today can create a HW filter in the NIC
(via ethtool) and place that traffic into a separate RX queue in the
NIC.  Lets say matching NFS traffic or guest traffic. Then we can allow
RX zero-copy of these pages, into the application/guest, somehow
binding it to RX queue, e.g. introducing a "cross-domain-id" in the
page-pool page that need to match.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Mel Gorman
On Mon, Apr 11, 2016 at 06:19:07PM +0200, Jesper Dangaard Brouer wrote:
> > http://git.kernel.org/cgit/linux/kernel/git/mel/linux.git/log/?h=mm-vmscan-node-lru-v4r5
> > 
> 
> The cost decreased to: 228 cycles(tsc), but there are some variations,
> sometimes it increase to 238 cycles(tsc).
> 

In the free path, a bulk pcp free adds to the cycles. In the alloc path,
a refill of the pcp lists costs quite a bit. Either option introduces
variances. The bulk free path can be optimised a little so I chucked
some additional patches at it that are not released yet but I suspect the
benefit will be marginal. The real heavy costs there are splitting/merging
buddies. Fixing that is much more fundamental but even fronting the allocator
with a new recycle allocator would not offset that as the refill of the
page-recycling thing would incur high costs.

> Nice, but there is still a looong way to my performance target, where I
> can spend 201 cycles for the entire forwarding path
> 

While I accept the cost is still too high, I think the effort should still
be spent on improving the allocator in general than trying to bypass it.

> 
> > This is an unreleased series that contains both the page allocator
> > optimisations and the one-LRU-per-node series which in combination remove a
> > lot of code from the page allocator fast paths. I have no data on how the
> > combined series behaves but each series individually is known to improve
> > page allocator performance.
> >
> > Once you have that, do a hackjob to remove the debugging checks from both 
> > the
> > alloc and free path and see what that leaves. They could be bypassed 
> > properly
> > with a __GFP_NOACCT flag used only by drivers that absolutely require pages
> > as quickly as possible and willing to be less safe to get that performance.
> 
> I would be interested in testing/benchmarking a patch where you remove
> the debugging checks...
> 

Right now, I'm not proposing to remove the debugging checks despite their
cost. They catch really difficult problems in the field unfortunately
including corruption from buggy hardware. A GFP flag that disables them
for a very specific case would be ok but I expect it to be resisted by
others if it's done for the general case. Even a static branch for runtime
debugging checks may be resisted.

Even if GFP flags are tight, I have a patch that deletes __GFP_COLD on
the grounds it is of questionable value. Applying that would free a flag
for __GFP_NOACCT that bypasses debugging checks and statistic updates.
That would work for the allocation side at least but doing the same for
the free side would be hard (potentially impossible) to do transparently
for drivers.

> You are also welcome to try out my benchmarking modules yourself:
>  
> https://github.com/netoptimizer/prototype-kernel/blob/master/getting_started.rst
> 

I took a quick look and functionally it's similar to the systemtap-based
microbenchmark I'm using in mmtests so I don't think we have a problem
with reproduction at the moment.

> > Be aware that compound order allocs like this are a double edged sword as
> > it'll be fast sometimes and other times require reclaim/compaction which
> > can stall for prolonged periods of time.
> 
> Yes, I've notice that there can be a fairly high variation, when doing
> compound order allocs, which is not so nice!  I really don't like these
> variations
> 

They can cripple you which is why I'm very wary of performance patches that
require compound pages. It tends to look great only on benchmarks and then
the corner cases hit in the real world and the bug reports are unpleasant.

> Drivers also do tricks where they fallback to smaller order pages. E.g.
> lookup function mlx4_alloc_pages().  I've tried to simulate that
> function here:
> https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69
> 
> It does not seem very optimal. I tried to mem pressure the system a bit
> to cause the alloc_pages() to fail, and then the result were very bad,
> something like 2500 cycles, and it usually got the next order pages.

The options for fallback tend to have one hazard after the next. It's
partially why the last series focused on order-0 pages only.

> > > I've measured order 3 (32KB) alloc_pages(order=3) + __free_pages() to
> > > cost approx 500 cycles(tsc).  That was more expensive, BUT an order=3
> > > page 32Kb correspond to 8 pages (32768/4096), thus 500/8 = 62.5
> > > cycles.  Usually a network RX-frame only need to be 2048 bytes, thus
> > > the "bulk" effect speed up is x16 (32768/2048), thus 31.25 cycles.
> 
> The order=3 cost were reduced to: 417 cycles(tsc), nice!  But I've also
> seen it jump to 611 cycles.
> 

The corner cases can be minimised to some extent -- lazy buddy merging for
example but it unfortunately has other consequences for users that require
high-order pages for functional reasons. I tried something like that once
(http://thread.gmane.org/gmane.linux.kernel/807683) 

Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Thadeu Lima de Souza Cascardo
On Mon, Apr 11, 2016 at 12:20:47PM -0400, Matthew Wilcox wrote:
> On Mon, Apr 11, 2016 at 02:08:27PM +0100, Mel Gorman wrote:
> > On Mon, Apr 11, 2016 at 02:26:39PM +0200, Jesper Dangaard Brouer wrote:
> > > On arch's like PowerPC, the DMA API is the bottleneck.  To workaround
> > > the cost of DMA calls, NIC driver alloc large order (compound) pages.
> > > (dma_map compound page, handout page-fragments for RX ring, and later
> > > dma_unmap when last RX page-fragments is seen).
> > 
> > So, IMO only holding onto the DMA pages is all that is justified but not a
> > recycle of order-0 pages built on top of the core allocator. For DMA pages,
> > it would take a bit of legwork but the per-cpu allocator could be split
> > and converted to hold arbitrary sized pages with a constructer/destructor
> > to do the DMA coherency step when pages are taken from or handed back to
> > the core allocator. I'm not volunteering to do that unfortunately but I
> > estimate it'd be a few days work unless it needs to be per-CPU and NUMA
> > aware in which case the memory footprint will be high.
> 
> Have "we" tried to accelerate the DMA calls in PowerPC?  For example, it
> could hold onto a cache of recently used mappings and recycle them if that
> still works.  It trades off a bit of security (a device can continue to DMA
> after the memory should no longer be accessible to it) for speed, but then
> so does the per-driver hack of keeping pages around still mapped.
> 

There are two problems on the DMA calls on Power servers. One is scalability. A
new allocation method for the address space would be necessary to fix it.

The other one is the latency or the cost of updating the TCE tables. The only
number I have is that I could push around 1M updates per second. So, we could
guess 1us per operation, which is pretty much a no-no for Jesper use case.

Your solution could address both. But I am concerned about the security problem.
Here is why I think this problem should be ignored if we go this way. IOMMU can
be used for three problems: virtualization, paranoia security and debuggability.

For virtualization, there is a solution already, and it's in place for Power and
x86. Power servers have the ability to enlarge the DMA window, allowing the
entire VM memory to be mapped during PCI driver probe time. After that, dma_map
is a simple sum and dma_unmap is a nop. x86 KVM maps the entire VM memory even
before booting the guest. Unless we want to fix this for old Power servers, I
see no point in fixing it.

Now, if you are using IOMMU on the host with no passthrough or linear system
memory mapping, you are paranoid. It's not just a matter of security, in fact.
It's also a matter of stability. Hardware, firmware and drivers can be buggy,
and they are. When I worked with drivers on Power servers, I found and fixed a
lot of driver bugs that caused the device to write to memory it was not supposed
to. Good thing is that IOMMU prevented that memory write to happen and the
driver would be reset by EEH. If we can make this scenario faster, and if we
want it to be the default we need to, then your solution might not be desired.
Otherwise, just turn your IOMMU off or put it into passthrough.

Now, the driver keeps pages mapped, but those pages belong to the driver. They
are not pages we decide to give to a userspace process because it's no longer in
use by the driver. So, I don't quite agree this would be a good tradeoff.
Certainly not if we can do it in a way that does not require this.

So, Jesper, please take into consideration that this pool design would rather be
per device. Otherwise, we allow some device to write into another's
device/driver memory.

Cascardo.


Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Eric Dumazet
On Mon, 2016-04-11 at 18:19 +0200, Jesper Dangaard Brouer wrote:

> Drivers also do tricks where they fallback to smaller order pages. E.g.
> lookup function mlx4_alloc_pages().  I've tried to simulate that
> function here:
> https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69

We use order-0 pages on mlx4 at Google, as order-3 pages are very
dangerous for some kind of attacks...

An out of order TCP packet can hold an order-3 pages, while claiming to
use 1.5 KBvia skb->truesize.

order-0 only pages allow the page recycle trick used by Intel driver,
and we hardly see any page allocations in typical workloads.

While order-3 pages are 'nice' for friendly datacenter kind of traffic,
they also are a higher risk on hosts connected to the wild Internet.

Maybe I should upstream this patch ;)






Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Matthew Wilcox
On Mon, Apr 11, 2016 at 02:08:27PM +0100, Mel Gorman wrote:
> On Mon, Apr 11, 2016 at 02:26:39PM +0200, Jesper Dangaard Brouer wrote:
> > On arch's like PowerPC, the DMA API is the bottleneck.  To workaround
> > the cost of DMA calls, NIC driver alloc large order (compound) pages.
> > (dma_map compound page, handout page-fragments for RX ring, and later
> > dma_unmap when last RX page-fragments is seen).
> 
> So, IMO only holding onto the DMA pages is all that is justified but not a
> recycle of order-0 pages built on top of the core allocator. For DMA pages,
> it would take a bit of legwork but the per-cpu allocator could be split
> and converted to hold arbitrary sized pages with a constructer/destructor
> to do the DMA coherency step when pages are taken from or handed back to
> the core allocator. I'm not volunteering to do that unfortunately but I
> estimate it'd be a few days work unless it needs to be per-CPU and NUMA
> aware in which case the memory footprint will be high.

Have "we" tried to accelerate the DMA calls in PowerPC?  For example, it
could hold onto a cache of recently used mappings and recycle them if that
still works.  It trades off a bit of security (a device can continue to DMA
after the memory should no longer be accessible to it) for speed, but then
so does the per-driver hack of keeping pages around still mapped.



Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Jesper Dangaard Brouer

On Mon, 11 Apr 2016 14:08:27 +0100 Mel Gorman  
wrote:
> On Mon, Apr 11, 2016 at 02:26:39PM +0200, Jesper Dangaard Brouer wrote:
[...]
> > 
> > It is always great if you can optimized the page allocator.  IMHO the
> > page allocator is too slow.  
> 
> It's why I spent some time on it as any improvement in the allocator is
> an unconditional win without requiring driver modifications.
> 
> > At least for my performance needs (67ns
> > per packet, approx 201 cycles at 3GHz).  I've measured[1]
> > alloc_pages(order=0) + __free_pages() to cost 277 cycles(tsc).
> >   
> 
> It'd be worth retrying this with the branch
> 
> http://git.kernel.org/cgit/linux/kernel/git/mel/linux.git/log/?h=mm-vmscan-node-lru-v4r5
> 

The cost decreased to: 228 cycles(tsc), but there are some variations,
sometimes it increase to 238 cycles(tsc).

Nice, but there is still a looong way to my performance target, where I
can spend 201 cycles for the entire forwarding path


> This is an unreleased series that contains both the page allocator
> optimisations and the one-LRU-per-node series which in combination remove a
> lot of code from the page allocator fast paths. I have no data on how the
> combined series behaves but each series individually is known to improve
> page allocator performance.
>
> Once you have that, do a hackjob to remove the debugging checks from both the
> alloc and free path and see what that leaves. They could be bypassed properly
> with a __GFP_NOACCT flag used only by drivers that absolutely require pages
> as quickly as possible and willing to be less safe to get that performance.

I would be interested in testing/benchmarking a patch where you remove
the debugging checks...

You are also welcome to try out my benchmarking modules yourself:
 
https://github.com/netoptimizer/prototype-kernel/blob/master/getting_started.rst

This is really simple stuff (for rapid prototyping) I'm just doing:
 modprobe page_bench01; rmmod page_bench01 ; dmesg | tail -n40

[...]
> 
> Be aware that compound order allocs like this are a double edged sword as
> it'll be fast sometimes and other times require reclaim/compaction which
> can stall for prolonged periods of time.

Yes, I've notice that there can be a fairly high variation, when doing
compound order allocs, which is not so nice!  I really don't like these
variations

Drivers also do tricks where they fallback to smaller order pages. E.g.
lookup function mlx4_alloc_pages().  I've tried to simulate that
function here:
https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69

It does not seem very optimal. I tried to mem pressure the system a bit
to cause the alloc_pages() to fail, and then the result were very bad,
something like 2500 cycles, and it usually got the next order pages.


> > I've measured order 3 (32KB) alloc_pages(order=3) + __free_pages() to
> > cost approx 500 cycles(tsc).  That was more expensive, BUT an order=3
> > page 32Kb correspond to 8 pages (32768/4096), thus 500/8 = 62.5
> > cycles.  Usually a network RX-frame only need to be 2048 bytes, thus
> > the "bulk" effect speed up is x16 (32768/2048), thus 31.25 cycles.

The order=3 cost were reduced to: 417 cycles(tsc), nice!  But I've also
seen it jump to 611 cycles.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-10 Thread Sagi Grimberg



This is also very interesting for storage targets, which face the same
issue.  SCST has a mode where it caches some fully constructed SGLs,
which is probably very similar to what NICs want to do.


I think a cached allocator for page sets + the scatterlists that
describe these page sets would not only be useful for SCSI target
implementations but also for the Linux SCSI initiator. Today the scsi-mq
code reserves space in each scsi_cmnd for a scatterlist of
SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page
sets less memory would be needed per scsi_cmnd.


If we go down this road how about also attaching some driver opaques
to the page sets?

I know of some drivers that can make good use of those ;)


Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-07 Thread Jesper Dangaard Brouer

On Thu, 07 Apr 2016 12:14:00 -0400 Rik van Riel  wrote:

> On Thu, 2016-04-07 at 08:48 -0700, Chuck Lever wrote:
> > > 
> > > On Apr 7, 2016, at 7:38 AM, Christoph Hellwig 
> > > wrote:
> > > 
> > > This is also very interesting for storage targets, which face the
> > > same issue.  SCST has a mode where it caches some fully constructed
> > > SGLs, which is probably very similar to what NICs want to do.  
> >
> > +1 for NFS server.  
> 
> I have swapped around my slot (into the MM track)
> with Jesper's slot (now a plenary session), since
> there seems to be a fair amount of interest in
> Jesper's proposal from IO and FS people, and my
> topic is more MM specific.

Wow - I'm impressed. I didn't expect such a good slot!
Glad to see the interest!
Thanks!

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


pgpkslyFsxOLP.pgp
Description: OpenPGP digital signature


Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-07 Thread Bart Van Assche

On 04/07/16 07:38, Christoph Hellwig wrote:

This is also very interesting for storage targets, which face the same
issue.  SCST has a mode where it caches some fully constructed SGLs,
which is probably very similar to what NICs want to do.


I think a cached allocator for page sets + the scatterlists that 
describe these page sets would not only be useful for SCSI target 
implementations but also for the Linux SCSI initiator. Today the scsi-mq 
code reserves space in each scsi_cmnd for a scatterlist of 
SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page 
sets less memory would be needed per scsi_cmnd. See also 
scsi_mq_setup_tags() and scsi_alloc_sgtable().


Bart.


Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-07 Thread Chuck Lever

> On Apr 7, 2016, at 7:38 AM, Christoph Hellwig  wrote:
> 
> This is also very interesting for storage targets, which face the same
> issue.  SCST has a mode where it caches some fully constructed SGLs,
> which is probably very similar to what NICs want to do.

+1 for NFS server.


--
Chuck Lever