Re: [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

2019-03-11 Thread Michael S. Tsirkin
On Tue, Mar 12, 2019 at 10:59:09AM +0800, Jason Wang wrote:
> 
> On 2019/3/12 上午2:14, David Miller wrote:
> > From: "Michael S. Tsirkin" 
> > Date: Mon, 11 Mar 2019 09:59:28 -0400
> > 
> > > On Mon, Mar 11, 2019 at 03:13:17PM +0800, Jason Wang wrote:
> > > > On 2019/3/8 下午10:12, Christoph Hellwig wrote:
> > > > > On Wed, Mar 06, 2019 at 02:18:07AM -0500, Jason Wang wrote:
> > > > > > This series tries to access virtqueue metadata through kernel 
> > > > > > virtual
> > > > > > address instead of copy_user() friends since they had too much
> > > > > > overheads like checks, spec barriers or even hardware feature
> > > > > > toggling. This is done through setup kernel address through vmap() 
> > > > > > and
> > > > > > resigter MMU notifier for invalidation.
> > > > > > 
> > > > > > Test shows about 24% improvement on TX PPS. TCP_STREAM doesn't see
> > > > > > obvious improvement.
> > > > > How is this going to work for CPUs with virtually tagged caches?
> > > > 
> > > > Anything different that you worry?
> > > If caches have virtual tags then kernel and userspace view of memory
> > > might not be automatically in sync if they access memory
> > > through different virtual addresses. You need to do things like
> > > flush_cache_page, probably multiple times.
> > "flush_dcache_page()"
> 
> 
> I get this. Then I think the current set_bit_to_user() is suspicious, we
> probably miss a flush_dcache_page() there:
> 
> 
> static int set_bit_to_user(int nr, void __user *addr)
> {
>     unsigned long log = (unsigned long)addr;
>     struct page *page;
>     void *base;
>     int bit = nr + (log % PAGE_SIZE) * 8;
>     int r;
> 
>     r = get_user_pages_fast(log, 1, 1, &page);
>     if (r < 0)
>     return r;
>     BUG_ON(r != 1);
>     base = kmap_atomic(page);
>     set_bit(bit, base);
>     kunmap_atomic(base);
>     set_page_dirty_lock(page);
>     put_page(page);
>     return 0;
> }
> 
> Thanks

I think you are right. The correct fix though is to re-implement
it using asm and handling pagefault, not gup.
Three atomic ops per bit is way to expensive.

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-03-11 Thread Michael S. Tsirkin
On Tue, Mar 12, 2019 at 10:56:20AM +0800, Jason Wang wrote:
> 
> On 2019/3/11 下午9:43, Andrea Arcangeli wrote:
> > On Mon, Mar 11, 2019 at 08:48:37AM -0400, Michael S. Tsirkin wrote:
> > > Using copyXuser is better I guess.
> > It certainly would be faster there, but I don't think it's needed if
> > that would be the only use case left that justifies supporting two
> > different models. On small 32bit systems with little RAM kmap won't
> > perform measurably different on 32bit or 64bit systems. If the 32bit
> > host has a lot of ram it all gets slow anyway at accessing RAM above
> > the direct mapping, if compared to 64bit host kernels, it's not just
> > an issue for vhost + mmu notifier + kmap and the best way to optimize
> > things is to run 64bit host kernels.
> > 
> > Like Christoph pointed out, the main use case for retaining the
> > copy-user model would be CPUs with virtually indexed not physically
> > tagged data caches (they'll still suffer from the spectre-v1 fix,
> > although I exclude they have to suffer the SMAP
> > slowdown/feature). Those may require some additional flushing than the
> > current copy-user model requires.
> > 
> > As a rule of thumb any arch where copy_user_page doesn't define as
> > copy_page will require some additional cache flushing after the
> > kmap. Supposedly with vmap, the vmap layer should have taken care of
> > that (I didn't verify that yet).
> 
> 
> vmap_page_range()/free_unmap_vmap_area() will call
> fluch_cache_vmap()/flush_cache_vunmap(). So vmap layer should be ok.
> 
> Thanks

You only unmap from mmu notifier though.
You don't do it after any access.

> 
> > 
> > There are some accessories like copy_to_user_page()
> > copy_from_user_page() that could work and obviously defines to raw
> > memcpy on x86 (the main cons is they don't provide word granular
> > access) and at least on sparc they're tailored to ptrace assumptions
> > so then we'd need to evaluate what happens if this is used outside of
> > ptrace context. kmap has been used generally either to access whole
> > pages (i.e. copy_user_page), so ptrace may actually be the only use
> > case with subpage granularity access.
> > 
> > #define copy_to_user_page(vma, page, vaddr, dst, src, len)  \
> > do {\
> > flush_cache_page(vma, vaddr, page_to_pfn(page));\
> > memcpy(dst, src, len);  \
> > flush_ptrace_access(vma, page, vaddr, src, len, 0); \
> > } while (0)
> > 
> > So I wouldn't rule out the need for a dual model, until we solve how
> > to run this stable on non-x86 arches with not physically tagged
> > caches.
> > 
> > Thanks,
> > Andrea
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-03-11 Thread Michael S. Tsirkin
On Tue, Mar 12, 2019 at 10:52:15AM +0800, Jason Wang wrote:
> 
> On 2019/3/11 下午8:48, Michael S. Tsirkin wrote:
> > On Mon, Mar 11, 2019 at 03:40:31PM +0800, Jason Wang wrote:
> > > On 2019/3/9 上午3:48, Andrea Arcangeli wrote:
> > > > Hello Jeson,
> > > > 
> > > > On Fri, Mar 08, 2019 at 04:50:36PM +0800, Jason Wang wrote:
> > > > > Just to make sure I understand here. For boosting through huge TLB, do
> > > > > you mean we can do that in the future (e.g by mapping more userspace
> > > > > pages to kenrel) or it can be done by this series (only about three 4K
> > > > > pages were vmapped per virtqueue)?
> > > > When I answered about the advantages of mmu notifier and I mentioned
> > > > guaranteed 2m/gigapages where available, I overlooked the detail you
> > > > were using vmap instead of kmap. So with vmap you're actually doing
> > > > the opposite, it slows down the access because it will always use a 4k
> > > > TLB even if QEMU runs on THP or gigapages hugetlbfs.
> > > > 
> > > > If there's just one page (or a few pages) in each vmap there's no need
> > > > of vmap, the linearity vmap provides doesn't pay off in such
> > > > case.
> > > > 
> > > > So likely there's further room for improvement here that you can
> > > > achieve in the current series by just dropping vmap/vunmap.
> > > > 
> > > > You can just use kmap (or kmap_atomic if you're in preemptible
> > > > section, should work from bh/irq).
> > > > 
> > > > In short the mmu notifier to invalidate only sets a "struct page *
> > > > userringpage" pointer to NULL without calls to vunmap.
> > > > 
> > > > In all cases immediately after gup_fast returns you can always call
> > > > put_page immediately (which explains why I'd like an option to drop
> > > > FOLL_GET from gup_fast to speed it up).
> > > > 
> > > > Then you can check the sequence_counter and inc/dec counter increased
> > > > by _start/_end. That will tell you if the page you got and you called
> > > > put_page to immediately unpin it or even to free it, cannot go away
> > > > under you until the invalidate is called.
> > > > 
> > > > If sequence counters and counter tells that gup_fast raced with anyt
> > > > mmu notifier invalidate you can just repeat gup_fast. Otherwise you're
> > > > done, the page cannot go away under you, the host virtual to host
> > > > physical mapping cannot change either. And the page is not pinned
> > > > either. So you can just set the "struct page * userringpage = page"
> > > > where "page" was the one setup by gup_fast.
> > > > 
> > > > When later the invalidate runs, you can just call set_page_dirty if
> > > > gup_fast was called with "write = 1" and then you clear the pointer
> > > > "userringpage = NULL".
> > > > 
> > > > When you need to read/write to the memory
> > > > kmap/kmap_atomic(userringpage) should work.
> > > Yes, I've considered kmap() from the start. The reason I don't do that is
> > > large virtqueue may need more than one page so VA might not be contiguous.
> > > But this is probably not a big issue which just need more tricks in the
> > > vhost memory accessors.
> > > 
> > > 
> > > > In short because there's no hardware involvement here, the established
> > > > mapping is just the pointer to the page, there is no need of setting
> > > > up any pagetables or to do any TLB flushes (except on 32bit archs if
> > > > the page is above the direct mapping but it never happens on 64bit
> > > > archs).
> > > I see, I believe we don't care much about the performance of 32bit archs 
> > > (or
> > > we can just fallback to copy_to_user() friends).
> > Using copyXuser is better I guess.
> 
> 
> Ok.
> 
> 
> > 
> > > Using direct mapping (I
> > > guess kernel will always try hugepage for that?) should be better and we 
> > > can
> > > even use it for the data transfer not only for the metadata.
> > > 
> > > Thanks
> > We can't really. The big issue is get user pages. Doing that on data
> > path will be slower than copyXuser.
> 
> 
> I meant if we can find a way to avoid doing gup in datapath. E.g vhost
> maintain a range tree and add or remove ranges through MMU notifier. Then in
> datapath, if we find the range, then use direct mapping otherwise
> copy_to_user().
> 
> Thanks

We can try. But I'm not sure there's any reason to think there's any
locality there.

> 
> >   Or maybe it won't with the
> > amount of mitigations spread around. Go ahead and try.
> > 
> > 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

2019-03-11 Thread Jason Wang


On 2019/3/12 上午2:14, David Miller wrote:

From: "Michael S. Tsirkin" 
Date: Mon, 11 Mar 2019 09:59:28 -0400


On Mon, Mar 11, 2019 at 03:13:17PM +0800, Jason Wang wrote:

On 2019/3/8 下午10:12, Christoph Hellwig wrote:

On Wed, Mar 06, 2019 at 02:18:07AM -0500, Jason Wang wrote:

This series tries to access virtqueue metadata through kernel virtual
address instead of copy_user() friends since they had too much
overheads like checks, spec barriers or even hardware feature
toggling. This is done through setup kernel address through vmap() and
resigter MMU notifier for invalidation.

Test shows about 24% improvement on TX PPS. TCP_STREAM doesn't see
obvious improvement.

How is this going to work for CPUs with virtually tagged caches?


Anything different that you worry?

If caches have virtual tags then kernel and userspace view of memory
might not be automatically in sync if they access memory
through different virtual addresses. You need to do things like
flush_cache_page, probably multiple times.

"flush_dcache_page()"



I get this. Then I think the current set_bit_to_user() is suspicious, we 
probably miss a flush_dcache_page() there:



static int set_bit_to_user(int nr, void __user *addr)
{
    unsigned long log = (unsigned long)addr;
    struct page *page;
    void *base;
    int bit = nr + (log % PAGE_SIZE) * 8;
    int r;

    r = get_user_pages_fast(log, 1, 1, &page);
    if (r < 0)
    return r;
    BUG_ON(r != 1);
    base = kmap_atomic(page);
    set_bit(bit, base);
    kunmap_atomic(base);
    set_page_dirty_lock(page);
    put_page(page);
    return 0;
}

Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-03-11 Thread Jason Wang


On 2019/3/11 下午9:43, Andrea Arcangeli wrote:

On Mon, Mar 11, 2019 at 08:48:37AM -0400, Michael S. Tsirkin wrote:

Using copyXuser is better I guess.

It certainly would be faster there, but I don't think it's needed if
that would be the only use case left that justifies supporting two
different models. On small 32bit systems with little RAM kmap won't
perform measurably different on 32bit or 64bit systems. If the 32bit
host has a lot of ram it all gets slow anyway at accessing RAM above
the direct mapping, if compared to 64bit host kernels, it's not just
an issue for vhost + mmu notifier + kmap and the best way to optimize
things is to run 64bit host kernels.

Like Christoph pointed out, the main use case for retaining the
copy-user model would be CPUs with virtually indexed not physically
tagged data caches (they'll still suffer from the spectre-v1 fix,
although I exclude they have to suffer the SMAP
slowdown/feature). Those may require some additional flushing than the
current copy-user model requires.

As a rule of thumb any arch where copy_user_page doesn't define as
copy_page will require some additional cache flushing after the
kmap. Supposedly with vmap, the vmap layer should have taken care of
that (I didn't verify that yet).



vmap_page_range()/free_unmap_vmap_area() will call 
fluch_cache_vmap()/flush_cache_vunmap(). So vmap layer should be ok.


Thanks




There are some accessories like copy_to_user_page()
copy_from_user_page() that could work and obviously defines to raw
memcpy on x86 (the main cons is they don't provide word granular
access) and at least on sparc they're tailored to ptrace assumptions
so then we'd need to evaluate what happens if this is used outside of
ptrace context. kmap has been used generally either to access whole
pages (i.e. copy_user_page), so ptrace may actually be the only use
case with subpage granularity access.

#define copy_to_user_page(vma, page, vaddr, dst, src, len)  \
do {\
flush_cache_page(vma, vaddr, page_to_pfn(page));\
memcpy(dst, src, len);  \
flush_ptrace_access(vma, page, vaddr, src, len, 0); \
} while (0)

So I wouldn't rule out the need for a dual model, until we solve how
to run this stable on non-x86 arches with not physically tagged
caches.

Thanks,
Andrea

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-03-11 Thread Jason Wang


On 2019/3/11 下午8:48, Michael S. Tsirkin wrote:

On Mon, Mar 11, 2019 at 03:40:31PM +0800, Jason Wang wrote:

On 2019/3/9 上午3:48, Andrea Arcangeli wrote:

Hello Jeson,

On Fri, Mar 08, 2019 at 04:50:36PM +0800, Jason Wang wrote:

Just to make sure I understand here. For boosting through huge TLB, do
you mean we can do that in the future (e.g by mapping more userspace
pages to kenrel) or it can be done by this series (only about three 4K
pages were vmapped per virtqueue)?

When I answered about the advantages of mmu notifier and I mentioned
guaranteed 2m/gigapages where available, I overlooked the detail you
were using vmap instead of kmap. So with vmap you're actually doing
the opposite, it slows down the access because it will always use a 4k
TLB even if QEMU runs on THP or gigapages hugetlbfs.

If there's just one page (or a few pages) in each vmap there's no need
of vmap, the linearity vmap provides doesn't pay off in such
case.

So likely there's further room for improvement here that you can
achieve in the current series by just dropping vmap/vunmap.

You can just use kmap (or kmap_atomic if you're in preemptible
section, should work from bh/irq).

In short the mmu notifier to invalidate only sets a "struct page *
userringpage" pointer to NULL without calls to vunmap.

In all cases immediately after gup_fast returns you can always call
put_page immediately (which explains why I'd like an option to drop
FOLL_GET from gup_fast to speed it up).

Then you can check the sequence_counter and inc/dec counter increased
by _start/_end. That will tell you if the page you got and you called
put_page to immediately unpin it or even to free it, cannot go away
under you until the invalidate is called.

If sequence counters and counter tells that gup_fast raced with anyt
mmu notifier invalidate you can just repeat gup_fast. Otherwise you're
done, the page cannot go away under you, the host virtual to host
physical mapping cannot change either. And the page is not pinned
either. So you can just set the "struct page * userringpage = page"
where "page" was the one setup by gup_fast.

When later the invalidate runs, you can just call set_page_dirty if
gup_fast was called with "write = 1" and then you clear the pointer
"userringpage = NULL".

When you need to read/write to the memory
kmap/kmap_atomic(userringpage) should work.

Yes, I've considered kmap() from the start. The reason I don't do that is
large virtqueue may need more than one page so VA might not be contiguous.
But this is probably not a big issue which just need more tricks in the
vhost memory accessors.



In short because there's no hardware involvement here, the established
mapping is just the pointer to the page, there is no need of setting
up any pagetables or to do any TLB flushes (except on 32bit archs if
the page is above the direct mapping but it never happens on 64bit
archs).

I see, I believe we don't care much about the performance of 32bit archs (or
we can just fallback to copy_to_user() friends).

Using copyXuser is better I guess.



Ok.





Using direct mapping (I
guess kernel will always try hugepage for that?) should be better and we can
even use it for the data transfer not only for the metadata.

Thanks

We can't really. The big issue is get user pages. Doing that on data
path will be slower than copyXuser.



I meant if we can find a way to avoid doing gup in datapath. E.g vhost 
maintain a range tree and add or remove ranges through MMU notifier. 
Then in datapath, if we find the range, then use direct mapping 
otherwise copy_to_user().


Thanks



  Or maybe it won't with the
amount of mitigations spread around. Go ahead and try.



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 1/1] virtio_blk: replace 0 by HCTX_TYPE_DEFAULT to index blk_mq_tag_set->map

2019-03-11 Thread Dongli Zhang
Use HCTX_TYPE_DEFAULT instead of 0 to avoid hardcoding.

Signed-off-by: Dongli Zhang 
---
This is the only place to cleanup under drivers/block/*. Another patch set
is sent to scsi and then all of below are cleaned up:
- block/*
- drivers/*

 drivers/block/virtio_blk.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 4bc083b..bed6035 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -691,7 +691,8 @@ static int virtblk_map_queues(struct blk_mq_tag_set *set)
 {
struct virtio_blk *vblk = set->driver_data;
 
-   return blk_mq_virtio_map_queues(&set->map[0], vblk->vdev, 0);
+   return blk_mq_virtio_map_queues(&set->map[HCTX_TYPE_DEFAULT],
+   vblk->vdev, 0);
 }
 
 #ifdef CONFIG_VIRTIO_BLK_SCSI
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] VMCI: Use BIT() macro for bit definitions

2019-03-11 Thread Randy Dunlap
On 3/11/19 3:09 PM, Vishnu DASA wrote:
> No functional changes, cleanup only.
> 
> Reviewed-by: Adit Ranadive 
> Reviewed-by: Jorgen Hansen 
> Signed-off-by: Vishnu Dasa 
> ---
>  include/linux/vmw_vmci_defs.h | 34 +-
>  1 file changed, 17 insertions(+), 17 deletions(-)
> 

Now this header file needs to #include 
or  for the BIT() macro.

Do the users of this header file build cleanly anyway?
Even if so, we prefer not to depend on implicit or accidental header
file inclusions that may change in the future.

> diff --git a/include/linux/vmw_vmci_defs.h b/include/linux/vmw_vmci_defs.h
> index eaa1e762bf06..007e28053da8 100644
> --- a/include/linux/vmw_vmci_defs.h
> +++ b/include/linux/vmw_vmci_defs.h
> @@ -33,27 +33,27 @@
>  #define VMCI_MAX_DEVICES 1
>  
>  /* Status register bits. */
> -#define VMCI_STATUS_INT_ON 0x1
> +#define VMCI_STATUS_INT_ON BIT(0)
>  
>  /* Control register bits. */
> -#define VMCI_CONTROL_RESET0x1
> -#define VMCI_CONTROL_INT_ENABLE   0x2
> -#define VMCI_CONTROL_INT_DISABLE  0x4
> +#define VMCI_CONTROL_RESETBIT(0)
> +#define VMCI_CONTROL_INT_ENABLE   BIT(1)
> +#define VMCI_CONTROL_INT_DISABLE  BIT(2)
>  
>  /* Capabilities register bits. */
> -#define VMCI_CAPS_HYPERCALL 0x1
> -#define VMCI_CAPS_GUESTCALL 0x2
> -#define VMCI_CAPS_DATAGRAM  0x4
> -#define VMCI_CAPS_NOTIFICATIONS 0x8
> -#define VMCI_CAPS_PPN64 0x10
> +#define VMCI_CAPS_HYPERCALL BIT(0)
> +#define VMCI_CAPS_GUESTCALL BIT(1)
> +#define VMCI_CAPS_DATAGRAM  BIT(2)
> +#define VMCI_CAPS_NOTIFICATIONS BIT(3)
> +#define VMCI_CAPS_PPN64 BIT(4)
>  
>  /* Interrupt Cause register bits. */
> -#define VMCI_ICR_DATAGRAM  0x1
> -#define VMCI_ICR_NOTIFICATION  0x2
> +#define VMCI_ICR_DATAGRAM  BIT(0)
> +#define VMCI_ICR_NOTIFICATION  BIT(1)
>  
>  /* Interrupt Mask register bits. */
> -#define VMCI_IMR_DATAGRAM  0x1
> -#define VMCI_IMR_NOTIFICATION  0x2
> +#define VMCI_IMR_DATAGRAM  BIT(0)
> +#define VMCI_IMR_NOTIFICATION  BIT(1)
>  
>  /* Maximum MSI/MSI-X interrupt vectors in the device. */
>  #define VMCI_MAX_INTRS 2
> @@ -463,9 +463,9 @@ struct vmci_datagram {
>   * datagram callback is invoked in a delayed context (not interrupt context).
>   */
>  #define VMCI_FLAG_DG_NONE  0
> -#define VMCI_FLAG_WELLKNOWN_DG_HND 0x1
> -#define VMCI_FLAG_ANYCID_DG_HND0x2
> -#define VMCI_FLAG_DG_DELAYED_CB0x4
> +#define VMCI_FLAG_WELLKNOWN_DG_HND BIT(0)
> +#define VMCI_FLAG_ANYCID_DG_HNDBIT(1)
> +#define VMCI_FLAG_DG_DELAYED_CBBIT(2)
>  
>  /*
>   * Maximum supported size of a VMCI datagram for routable datagrams.
> @@ -694,7 +694,7 @@ struct vmci_qp_detach_msg {
>  };
>  
>  /* VMCI Doorbell API. */
> -#define VMCI_FLAG_DELAYED_CB 0x01
> +#define VMCI_FLAG_DELAYED_CB BIT(0)
>  
>  typedef void (*vmci_callback) (void *client_data);
>  
> 


-- 
~Randy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] drm/bochs: Fix NULL dereference on atomic_disable helper

2019-03-11 Thread Daniel Vetter
On Mon, Mar 11, 2019 at 02:49:58PM -0300, Rodrigo Siqueira wrote:
> On 03/11, Gerd Hoffmann wrote:
> >   Hi,
> > 
> > > > IIRC the drm code checks for the atomic_enable callback presence to
> > > > figure whenever it should take the atomic or legacy code paths.
> > > 
> > > It should check for drm_driver->mode_config.funcs.atomic_commit for that,
> > > see drm_drv_uses_atomic_modeset(). Anything else should be a bug.
> > > 
> > > Or do you mean the fallback to the old crtc helper prepare/commit
> > > callbacks?
> > 
> > Probably the later.  There was some reason why I've left in the empty
> > bochs_crtc_atomic_enable() callback ...
> 
> Just for checking before I start to work in this patch:
> The correct solution should be made atomic_enable and atomic_disable
> optional, right? I should do it, and check if Bochs driver really needs
> bochs_crtc_atomic_enable after my change, right?

Yup. I just tried to remember why we haven't done this yet, but I think
that was a patch to make crtc->helper_funcs optional. And that doesn't
make sense imo, since if your crtc doesn't do anything then you don't
really have an atomic driver :-) And if there's ever a legit use case for
this, then that drive probably shouldn't use the atomic helpers ...

But making crtc_helper_funcs->atomic_enable/disable optional sounds like a
good idea.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

2019-03-11 Thread David Miller
From: "Michael S. Tsirkin" 
Date: Mon, 11 Mar 2019 09:59:28 -0400

> On Mon, Mar 11, 2019 at 03:13:17PM +0800, Jason Wang wrote:
>> 
>> On 2019/3/8 下午10:12, Christoph Hellwig wrote:
>> > On Wed, Mar 06, 2019 at 02:18:07AM -0500, Jason Wang wrote:
>> > > This series tries to access virtqueue metadata through kernel virtual
>> > > address instead of copy_user() friends since they had too much
>> > > overheads like checks, spec barriers or even hardware feature
>> > > toggling. This is done through setup kernel address through vmap() and
>> > > resigter MMU notifier for invalidation.
>> > > 
>> > > Test shows about 24% improvement on TX PPS. TCP_STREAM doesn't see
>> > > obvious improvement.
>> > How is this going to work for CPUs with virtually tagged caches?
>> 
>> 
>> Anything different that you worry?
> 
> If caches have virtual tags then kernel and userspace view of memory
> might not be automatically in sync if they access memory
> through different virtual addresses. You need to do things like
> flush_cache_page, probably multiple times.

"flush_dcache_page()"
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/5] Clean up TTM mmap offsets

2019-03-11 Thread Daniel Vetter
On Mon, Mar 11, 2019 at 05:51:39PM +0100, Christian König wrote:
> Am 11.03.19 um 17:39 schrieb Hans de Goede:
> > Hi,
> > 
> > On 07-02-19 09:59, Thomas Zimmermann wrote:
> > > Almost all TTM-based drivers use the same values for the mmap-able
> > > range of BO addresses. Each driver therefore duplicates the
> > > DRM_FILE_PAGE_OFFSET constant. OTOH, the mmap range's size is not
> > > configurable by drivers.
> > > 
> > > This patch set replaces driver-specific configuration with a single
> > > setup. All code is located within TTM. TTM and GEM share the same
> > > range for mmap-able BOs.
> > > 
> > > Thomas Zimmermann (5):
> > >    staging/vboxvideo: Use same BO mmap offset as other drivers
> > >    drm/ttm: Define a single DRM_FILE_PAGE_OFFSET constant
> > >    drm/ttm: Remove file_page_offset parameter from ttm_bo_device_init()
> > >    drm/ttm: Quick-test mmap offset in ttm_bo_mmap()
> > >    drm: Use the same mmap-range offset and size for GEM and TTM
> > 
> > Note I'm about to push a patch-series to drm-misc-next which moves
> > vboxvideo out of staging and I see that this series has not landed
> > in drm-misc-next yet, so it will needs to be rebased.
> 
> Mhm, TTM is usually not pushed upstream through drm-misc-next, so that will
> certainly collide with the next TTM pull request.
> 
> So can you wait with that or should I make an exception and merge this
> change though drm-misc-next?

Other options:
- Get amdgpu added to drm-tip and linux-next so we have a heads-up about
  the conflict. That's usually good enough to avoid the broken merge
  conflict.
- Do a topic branch, pull it into both trees.
- Really stuff ttm into a shared tree if it's meant to be shared
  infrastructure :-P

Waiting+rebasing is imo the worst option, and usually not needed.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/5] Clean up TTM mmap offsets

2019-03-11 Thread Hans de Goede

Hi,

On 11-03-19 17:51, Christian König wrote:

Am 11.03.19 um 17:39 schrieb Hans de Goede:

Hi,

On 07-02-19 09:59, Thomas Zimmermann wrote:

Almost all TTM-based drivers use the same values for the mmap-able
range of BO addresses. Each driver therefore duplicates the
DRM_FILE_PAGE_OFFSET constant. OTOH, the mmap range's size is not
configurable by drivers.

This patch set replaces driver-specific configuration with a single
setup. All code is located within TTM. TTM and GEM share the same
range for mmap-able BOs.

Thomas Zimmermann (5):
   staging/vboxvideo: Use same BO mmap offset as other drivers
   drm/ttm: Define a single DRM_FILE_PAGE_OFFSET constant
   drm/ttm: Remove file_page_offset parameter from ttm_bo_device_init()
   drm/ttm: Quick-test mmap offset in ttm_bo_mmap()
   drm: Use the same mmap-range offset and size for GEM and TTM


Note I'm about to push a patch-series to drm-misc-next which moves
vboxvideo out of staging and I see that this series has not landed
in drm-misc-next yet, so it will needs to be rebased.


Mhm, TTM is usually not pushed upstream through drm-misc-next, so that will 
certainly collide with the next TTM pull request.


Ugh, I didn't realize that this series would not be going through drm-misc-next.


So can you wait with that or should I make an exception and merge this change 
though drm-misc-next?


I've already pushed it now :| My mail was more intended as a headsup then
that I expected an objection, sorry.

I see 2 possible solutions:

1) Merge drm-misc-next into the ttm tree (probably the cleanest)
2) Push your series through drm-misc-next

Regards,

Hans
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 0/5] Clean up TTM mmap offsets

2019-03-11 Thread Hans de Goede

Hi,

On 07-02-19 09:59, Thomas Zimmermann wrote:

Almost all TTM-based drivers use the same values for the mmap-able
range of BO addresses. Each driver therefore duplicates the
DRM_FILE_PAGE_OFFSET constant. OTOH, the mmap range's size is not
configurable by drivers.

This patch set replaces driver-specific configuration with a single
setup. All code is located within TTM. TTM and GEM share the same
range for mmap-able BOs.

Thomas Zimmermann (5):
   staging/vboxvideo: Use same BO mmap offset as other drivers
   drm/ttm: Define a single DRM_FILE_PAGE_OFFSET constant
   drm/ttm: Remove file_page_offset parameter from ttm_bo_device_init()
   drm/ttm: Quick-test mmap offset in ttm_bo_mmap()
   drm: Use the same mmap-range offset and size for GEM and TTM


Note I'm about to push a patch-series to drm-misc-next which moves
vboxvideo out of staging and I see that this series has not landed
in drm-misc-next yet, so it will needs to be rebased.

Regards,

Hans

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-03-11 Thread Jan Kara
On Thu 07-03-19 16:27:17, Andrea Arcangeli wrote:
> > driver that GUP page for hours/days/weeks/months ... obviously the
> > race window is big enough here. It affects many fs (ext4, xfs, ...)
> > in different ways. I think ext4 is the most obvious because of the
> > kernel log trace it leaves behind.
> > 
> > Bottom line is for set_page_dirty to be safe you need the following:
> > lock_page()
> > page_mkwrite()
> > set_pte_with_write()
> > unlock_page()
> 
> I also wondered why ext4 writepage doesn't recreate the bh if they got
> dropped by the VM and page->private is 0. I mean, page->index and
> page->mapping are still there, that's enough info for writepage itself
> to take a slow path and calls page_mkwrite to find where to write the
> page on disk.

There are two problems:

1) What to do with errors that page_mkwrite() can generate (ENOMEM, ENOSPC,
EIO). On page fault you just propagate them to userspace, on set_page_dirty()
you have no chance so you just silently loose data.

2) We need various locks to protect page_mkwrite(), possibly do some IO.
set_page_dirty() is rather uncertain context to acquire locks or do IO...

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

2019-03-11 Thread Michael S. Tsirkin
On Mon, Mar 11, 2019 at 03:13:17PM +0800, Jason Wang wrote:
> 
> On 2019/3/8 下午10:12, Christoph Hellwig wrote:
> > On Wed, Mar 06, 2019 at 02:18:07AM -0500, Jason Wang wrote:
> > > This series tries to access virtqueue metadata through kernel virtual
> > > address instead of copy_user() friends since they had too much
> > > overheads like checks, spec barriers or even hardware feature
> > > toggling. This is done through setup kernel address through vmap() and
> > > resigter MMU notifier for invalidation.
> > > 
> > > Test shows about 24% improvement on TX PPS. TCP_STREAM doesn't see
> > > obvious improvement.
> > How is this going to work for CPUs with virtually tagged caches?
> 
> 
> Anything different that you worry?

If caches have virtual tags then kernel and userspace view of memory
might not be automatically in sync if they access memory
through different virtual addresses. You need to do things like
flush_cache_page, probably multiple times.

> I can have a test but do you know any
> archs that use virtual tag cache?

sparc I believe.

> Thanks



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-03-11 Thread Andrea Arcangeli
On Mon, Mar 11, 2019 at 08:48:37AM -0400, Michael S. Tsirkin wrote:
> Using copyXuser is better I guess.

It certainly would be faster there, but I don't think it's needed if
that would be the only use case left that justifies supporting two
different models. On small 32bit systems with little RAM kmap won't
perform measurably different on 32bit or 64bit systems. If the 32bit
host has a lot of ram it all gets slow anyway at accessing RAM above
the direct mapping, if compared to 64bit host kernels, it's not just
an issue for vhost + mmu notifier + kmap and the best way to optimize
things is to run 64bit host kernels.

Like Christoph pointed out, the main use case for retaining the
copy-user model would be CPUs with virtually indexed not physically
tagged data caches (they'll still suffer from the spectre-v1 fix,
although I exclude they have to suffer the SMAP
slowdown/feature). Those may require some additional flushing than the
current copy-user model requires.

As a rule of thumb any arch where copy_user_page doesn't define as
copy_page will require some additional cache flushing after the
kmap. Supposedly with vmap, the vmap layer should have taken care of
that (I didn't verify that yet).

There are some accessories like copy_to_user_page()
copy_from_user_page() that could work and obviously defines to raw
memcpy on x86 (the main cons is they don't provide word granular
access) and at least on sparc they're tailored to ptrace assumptions
so then we'd need to evaluate what happens if this is used outside of
ptrace context. kmap has been used generally either to access whole
pages (i.e. copy_user_page), so ptrace may actually be the only use
case with subpage granularity access.

#define copy_to_user_page(vma, page, vaddr, dst, src, len)  \
do {\
flush_cache_page(vma, vaddr, page_to_pfn(page));\
memcpy(dst, src, len);  \
flush_ptrace_access(vma, page, vaddr, src, len, 0); \
} while (0)

So I wouldn't rule out the need for a dual model, until we solve how
to run this stable on non-x86 arches with not physically tagged
caches.

Thanks,
Andrea
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] drm/bochs: Fix NULL dereference on atomic_disable helper

2019-03-11 Thread Gerd Hoffmann
  Hi,

> > IIRC the drm code checks for the atomic_enable callback presence to
> > figure whenever it should take the atomic or legacy code paths.
> 
> It should check for drm_driver->mode_config.funcs.atomic_commit for that,
> see drm_drv_uses_atomic_modeset(). Anything else should be a bug.
> 
> Or do you mean the fallback to the old crtc helper prepare/commit
> callbacks?

Probably the later.  There was some reason why I've left in the empty
bochs_crtc_atomic_enable() callback ...

cheers,
  Gerd

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] drm/bochs: Fix NULL dereference on atomic_disable helper

2019-03-11 Thread Daniel Vetter
On Mon, Mar 11, 2019 at 02:07:16PM +0100, Gerd Hoffmann wrote:
>   Hi,
> 
> > > >  static void bochs_crtc_atomic_flush(struct drm_crtc *crtc,
> > > > struct drm_crtc_state 
> > > > *old_crtc_state)
> > > >  {
> > > > @@ -66,6 +71,7 @@ static const struct drm_crtc_funcs bochs_crtc_funcs = 
> > > > {
> > > >  static const struct drm_crtc_helper_funcs bochs_helper_funcs = {
> > > > .mode_set_nofb = bochs_crtc_mode_set_nofb,
> > > > .atomic_enable = bochs_crtc_atomic_enable,
> > > > +   .atomic_disable = bochs_crtc_atomic_disable,
> > > 
> > > Shouldn't we make the callback optional instead of adding empty dummy
> > > functions to drivers?
> > 
> > Hi Gerd,
> > 
> > I agree, and I can work in this issue.
> > Just one question, should we make atomic_enable optional as well?
> 
> IIRC the drm code checks for the atomic_enable callback presence to
> figure whenever it should take the atomic or legacy code paths.

It should check for drm_driver->mode_config.funcs.atomic_commit for that,
see drm_drv_uses_atomic_modeset(). Anything else should be a bug.

Or do you mean the fallback to the old crtc helper prepare/commit
callbacks? We'd need to make all of them optional ofc, with atomic_
variants being preferred ofc.
-Daniel


> 
> So, I think that will not work.
> 
> cheers,
>   Gerd
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] drm/bochs: Fix NULL dereference on atomic_disable helper

2019-03-11 Thread Gerd Hoffmann
  Hi,

> > >  static void bochs_crtc_atomic_flush(struct drm_crtc *crtc,
> > >   struct drm_crtc_state *old_crtc_state)
> > >  {
> > > @@ -66,6 +71,7 @@ static const struct drm_crtc_funcs bochs_crtc_funcs = {
> > >  static const struct drm_crtc_helper_funcs bochs_helper_funcs = {
> > >   .mode_set_nofb = bochs_crtc_mode_set_nofb,
> > >   .atomic_enable = bochs_crtc_atomic_enable,
> > > + .atomic_disable = bochs_crtc_atomic_disable,
> > 
> > Shouldn't we make the callback optional instead of adding empty dummy
> > functions to drivers?
> 
> Hi Gerd,
> 
> I agree, and I can work in this issue.
> Just one question, should we make atomic_enable optional as well?

IIRC the drm code checks for the atomic_enable callback presence to
figure whenever it should take the atomic or legacy code paths.

So, I think that will not work.

cheers,
  Gerd

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-03-11 Thread Michael S. Tsirkin
On Mon, Mar 11, 2019 at 03:40:31PM +0800, Jason Wang wrote:
> 
> On 2019/3/9 上午3:48, Andrea Arcangeli wrote:
> > Hello Jeson,
> > 
> > On Fri, Mar 08, 2019 at 04:50:36PM +0800, Jason Wang wrote:
> > > Just to make sure I understand here. For boosting through huge TLB, do
> > > you mean we can do that in the future (e.g by mapping more userspace
> > > pages to kenrel) or it can be done by this series (only about three 4K
> > > pages were vmapped per virtqueue)?
> > When I answered about the advantages of mmu notifier and I mentioned
> > guaranteed 2m/gigapages where available, I overlooked the detail you
> > were using vmap instead of kmap. So with vmap you're actually doing
> > the opposite, it slows down the access because it will always use a 4k
> > TLB even if QEMU runs on THP or gigapages hugetlbfs.
> > 
> > If there's just one page (or a few pages) in each vmap there's no need
> > of vmap, the linearity vmap provides doesn't pay off in such
> > case.
> > 
> > So likely there's further room for improvement here that you can
> > achieve in the current series by just dropping vmap/vunmap.
> > 
> > You can just use kmap (or kmap_atomic if you're in preemptible
> > section, should work from bh/irq).
> > 
> > In short the mmu notifier to invalidate only sets a "struct page *
> > userringpage" pointer to NULL without calls to vunmap.
> > 
> > In all cases immediately after gup_fast returns you can always call
> > put_page immediately (which explains why I'd like an option to drop
> > FOLL_GET from gup_fast to speed it up).
> > 
> > Then you can check the sequence_counter and inc/dec counter increased
> > by _start/_end. That will tell you if the page you got and you called
> > put_page to immediately unpin it or even to free it, cannot go away
> > under you until the invalidate is called.
> > 
> > If sequence counters and counter tells that gup_fast raced with anyt
> > mmu notifier invalidate you can just repeat gup_fast. Otherwise you're
> > done, the page cannot go away under you, the host virtual to host
> > physical mapping cannot change either. And the page is not pinned
> > either. So you can just set the "struct page * userringpage = page"
> > where "page" was the one setup by gup_fast.
> > 
> > When later the invalidate runs, you can just call set_page_dirty if
> > gup_fast was called with "write = 1" and then you clear the pointer
> > "userringpage = NULL".
> > 
> > When you need to read/write to the memory
> > kmap/kmap_atomic(userringpage) should work.
> 
> 
> Yes, I've considered kmap() from the start. The reason I don't do that is
> large virtqueue may need more than one page so VA might not be contiguous.
> But this is probably not a big issue which just need more tricks in the
> vhost memory accessors.
> 
> 
> > 
> > In short because there's no hardware involvement here, the established
> > mapping is just the pointer to the page, there is no need of setting
> > up any pagetables or to do any TLB flushes (except on 32bit archs if
> > the page is above the direct mapping but it never happens on 64bit
> > archs).
> 
> 
> I see, I believe we don't care much about the performance of 32bit archs (or
> we can just fallback to copy_to_user() friends).

Using copyXuser is better I guess.

> Using direct mapping (I
> guess kernel will always try hugepage for that?) should be better and we can
> even use it for the data transfer not only for the metadata.
> 
> Thanks

We can't really. The big issue is get user pages. Doing that on data
path will be slower than copyXuser. Or maybe it won't with the
amount of mitigations spread around. Go ahead and try.


> 
> > 
> > Thanks,
> > Andrea
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-03-11 Thread Jason Wang


On 2019/3/9 上午3:48, Andrea Arcangeli wrote:

Hello Jeson,

On Fri, Mar 08, 2019 at 04:50:36PM +0800, Jason Wang wrote:

Just to make sure I understand here. For boosting through huge TLB, do
you mean we can do that in the future (e.g by mapping more userspace
pages to kenrel) or it can be done by this series (only about three 4K
pages were vmapped per virtqueue)?

When I answered about the advantages of mmu notifier and I mentioned
guaranteed 2m/gigapages where available, I overlooked the detail you
were using vmap instead of kmap. So with vmap you're actually doing
the opposite, it slows down the access because it will always use a 4k
TLB even if QEMU runs on THP or gigapages hugetlbfs.

If there's just one page (or a few pages) in each vmap there's no need
of vmap, the linearity vmap provides doesn't pay off in such
case.

So likely there's further room for improvement here that you can
achieve in the current series by just dropping vmap/vunmap.

You can just use kmap (or kmap_atomic if you're in preemptible
section, should work from bh/irq).

In short the mmu notifier to invalidate only sets a "struct page *
userringpage" pointer to NULL without calls to vunmap.

In all cases immediately after gup_fast returns you can always call
put_page immediately (which explains why I'd like an option to drop
FOLL_GET from gup_fast to speed it up).

Then you can check the sequence_counter and inc/dec counter increased
by _start/_end. That will tell you if the page you got and you called
put_page to immediately unpin it or even to free it, cannot go away
under you until the invalidate is called.

If sequence counters and counter tells that gup_fast raced with anyt
mmu notifier invalidate you can just repeat gup_fast. Otherwise you're
done, the page cannot go away under you, the host virtual to host
physical mapping cannot change either. And the page is not pinned
either. So you can just set the "struct page * userringpage = page"
where "page" was the one setup by gup_fast.

When later the invalidate runs, you can just call set_page_dirty if
gup_fast was called with "write = 1" and then you clear the pointer
"userringpage = NULL".

When you need to read/write to the memory
kmap/kmap_atomic(userringpage) should work.



Yes, I've considered kmap() from the start. The reason I don't do that 
is large virtqueue may need more than one page so VA might not be 
contiguous. But this is probably not a big issue which just need more 
tricks in the vhost memory accessors.





In short because there's no hardware involvement here, the established
mapping is just the pointer to the page, there is no need of setting
up any pagetables or to do any TLB flushes (except on 32bit archs if
the page is above the direct mapping but it never happens on 64bit
archs).



I see, I believe we don't care much about the performance of 32bit archs 
(or we can just fallback to copy_to_user() friends). Using direct 
mapping (I guess kernel will always try hugepage for that?) should be 
better and we can even use it for the data transfer not only for the 
metadata.


Thanks




Thanks,
Andrea

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-03-11 Thread Jason Wang


On 2019/3/9 上午3:11, Andrea Arcangeli wrote:

On Fri, Mar 08, 2019 at 05:13:26PM +0800, Jason Wang wrote:

Actually not wrapping around,  the pages for used ring was marked as
dirty after a round of virtqueue processing when we're sure vhost wrote
something there.

Thanks for the clarification. So we need to convert it to
set_page_dirty and move it to the mmu notifier invalidate but in those
cases where gup_fast was called with write=1 (1 out of 3).

If using ->invalidate_range the page pin also must be removed
immediately after get_user_pages returns (not ok to hold the pin in
vmap until ->invalidate_range is called) to avoid false positive gup
pin checks in things like KSM, or the pin must be released in
invalidate_range_start (which is called before the pin checks).

Here's why:

/*
 * Check that no O_DIRECT or similar I/O is in progress on the
 * page
 */
if (page_mapcount(page) + 1 + swapped != page_count(page)) {
set_pte_at(mm, pvmw.address, pvmw.pte, entry);
goto out_unlock;
}
[..]
set_pte_at_notify(mm, pvmw.address, pvmw.pte, entry);
  ^^^ too late release the pin here, the
  above already failed

->invalidate_range cannot be used with mutex anyway so you need to go
back with invalidate_range_start/end anyway, just the pin must be
released in _start at the latest in such case.



Yes.




My prefer is generally to call gup_fast() followed by immediate
put_page() because I always want to drop FOLL_GET from gup_fast as a
whole to avoid 2 useless atomic ops per gup_fast.



Ok, will do this (if I still plan to use vmap() in next version).




I'll write more about vmap in answer to the other email.

Thanks,
Andrea



Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-03-11 Thread Jason Wang


On 2019/3/8 下午10:58, Jerome Glisse wrote:

On Fri, Mar 08, 2019 at 04:50:36PM +0800, Jason Wang wrote:

On 2019/3/8 上午3:16, Andrea Arcangeli wrote:

On Thu, Mar 07, 2019 at 12:56:45PM -0500, Michael S. Tsirkin wrote:

On Thu, Mar 07, 2019 at 10:47:22AM -0500, Michael S. Tsirkin wrote:

On Wed, Mar 06, 2019 at 02:18:12AM -0500, Jason Wang wrote:

+static const struct mmu_notifier_ops vhost_mmu_notifier_ops = {
+   .invalidate_range = vhost_invalidate_range,
+};
+
   void vhost_dev_init(struct vhost_dev *dev,
struct vhost_virtqueue **vqs, int nvqs, int iov_limit)
   {

I also wonder here: when page is write protected then
it does not look like .invalidate_range is invoked.

E.g. mm/ksm.c calls

mmu_notifier_invalidate_range_start and
mmu_notifier_invalidate_range_end but not mmu_notifier_invalidate_range.

Similarly, rmap in page_mkclean_one will not call
mmu_notifier_invalidate_range.

If I'm right vhost won't get notified when page is write-protected since you
didn't install start/end notifiers. Note that end notifier can be called
with page locked, so it's not as straight-forward as just adding a call.
Writing into a write-protected page isn't a good idea.

Note that documentation says:
it is fine to delay the mmu_notifier_invalidate_range
call to mmu_notifier_invalidate_range_end() outside the page table lock.
implying it's called just later.

OK I missed the fact that _end actually calls
mmu_notifier_invalidate_range internally. So that part is fine but the
fact that you are trying to take page lock under VQ mutex and take same
mutex within notifier probably means it's broken for ksm and rmap at
least since these call invalidate with lock taken.

Yes this lock inversion needs more thoughts.


And generally, Andrea told me offline one can not take mutex under
the notifier callback. I CC'd Andrea for why.

Yes, the problem then is the ->invalidate_page is called then under PT
lock so it cannot take mutex, you also cannot take the page_lock, it
can at most take a spinlock or trylock_page.

So it must switch back to the _start/_end methods unless you rewrite
the locking.

The difference with _start/_end, is that ->invalidate_range avoids the
_start callback basically, but to avoid the _start callback safely, it
has to be called in between the ptep_clear_flush and the set_pte_at
whenever the pfn changes like during a COW. So it cannot be coalesced
in a single TLB flush that invalidates all sptes in a range like we
prefer for performance reasons for example in KVM. It also cannot
sleep.

In short ->invalidate_range must be really fast (it shouldn't require
to send IPI to all other CPUs like KVM may require during an
invalidate_range_start) and it must not sleep, in order to prefer it
to _start/_end.

I.e. the invalidate of the secondary MMU that walks the linux
pagetables in hardware (in vhost case with GUP in software) has to
happen while the linux pagetable is zero, otherwise a concurrent
hardware pagetable lookup could re-instantiate a mapping to the old
page in between the set_pte_at and the invalidate_range_end (which
internally calls ->invalidate_range). Jerome documented it nicely in
Documentation/vm/mmu_notifier.rst .


Right, I've actually gone through this several times but some details were
missed by me obviously.



Now you don't really walk the pagetable in hardware in vhost, but if
you use gup_fast after usemm() it's similar.

For vhost the invalidate would be really fast, there are no IPI to
deliver at all, the problem is just the mutex.


Yes. A possible solution is to introduce a valid flag for VA. Vhost may only
try to access kernel VA when it was valid. Invalidate_range_start() will
clear this under the protection of the vq mutex when it can block. Then
invalidate_range_end() then can clear this flag. An issue is blockable is
always false for range_end().


Note that there can be multiple asynchronous concurrent invalidate_range
callbacks. So a flag does not work but a counter of number of active
invalidation would. See how KSM is doing it for instance in kvm_main.c

The pattern for this kind of thing is:
 my_invalidate_range_start(start,end) {
 ...
 if (mystruct_overlap(mystruct, start, end)) {
 mystruct_lock();
 mystruct->invalidate_count++;
 ...
 mystruct_unlock();
 }
 }

 my_invalidate_range_end(start,end) {
 ...
 if (mystruct_overlap(mystruct, start, end)) {
 mystruct_lock();
 mystruct->invalidate_count--;
 ...
 mystruct_unlock();
 }
 }

 my_access_va(mystruct) {
 again:
 wait_on(!mystruct->invalidate_count)
 mystruct_lock();
 if (mystruct->invalidate_count) {
 mystruct_unlock();
 goto again;
 }
 GUP();
 ...
 mystruct_unlock();
 }

Cheers,
Jérôme



Yes, this should work.

Thanks


Re: [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

2019-03-11 Thread Jason Wang


On 2019/3/8 下午10:12, Christoph Hellwig wrote:

On Wed, Mar 06, 2019 at 02:18:07AM -0500, Jason Wang wrote:

This series tries to access virtqueue metadata through kernel virtual
address instead of copy_user() friends since they had too much
overheads like checks, spec barriers or even hardware feature
toggling. This is done through setup kernel address through vmap() and
resigter MMU notifier for invalidation.

Test shows about 24% improvement on TX PPS. TCP_STREAM doesn't see
obvious improvement.

How is this going to work for CPUs with virtually tagged caches?



Anything different that you worry? I can have a test but do you know any 
archs that use virtual tag cache?


Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization