Re: virtio-blk: should num_vqs be limited by num_possible_cpus()?

2019-03-12 Thread Dongli Zhang



On 3/13/19 1:33 AM, Cornelia Huck wrote:
> On Tue, 12 Mar 2019 10:22:46 -0700 (PDT)
> Dongli Zhang  wrote:
> 
>> I observed that there is one msix vector for config and one shared vector
>> for all queues in below qemu cmdline, when the num-queues for virtio-blk
>> is more than the number of possible cpus:
>>
>> qemu: "-smp 4" while "-device 
>> virtio-blk-pci,drive=drive-0,id=virtblk0,num-queues=6"
>>
>> # cat /proc/interrupts 
>>CPU0   CPU1   CPU2   CPU3
>> ... ...
>>  24:  0  0  0  0   PCI-MSI 65536-edge  
>> virtio0-config
>>  25:  0  0  0 59   PCI-MSI 65537-edge  
>> virtio0-virtqueues
>> ... ...
>>
>>
>> However, when num-queues is the same as number of possible cpus:
>>
>> qemu: "-smp 4" while "-device 
>> virtio-blk-pci,drive=drive-0,id=virtblk0,num-queues=4"
>>
>> # cat /proc/interrupts 
>>CPU0   CPU1   CPU2   CPU3
>> ... ... 
>>  24:  0  0  0  0   PCI-MSI 65536-edge  
>> virtio0-config
>>  25:  2  0  0  0   PCI-MSI 65537-edge  
>> virtio0-req.0
>>  26:  0 35  0  0   PCI-MSI 65538-edge  
>> virtio0-req.1
>>  27:  0  0 32  0   PCI-MSI 65539-edge  
>> virtio0-req.2
>>  28:  0  0  0  0   PCI-MSI 65540-edge  
>> virtio0-req.3
>> ... ...
>>
>> In above case, there is one msix vector per queue.
> 
> Please note that this is pci-specific...
> 
>>
>>
>> This is because the max number of queues is not limited by the number of
>> possible cpus.
>>
>> By default, nvme (regardless about write_queues and poll_queues) and
>> xen-blkfront limit the number of queues with num_possible_cpus().
> 
> ...and these are probably pci-specific as well.

Not pci-specific, but per-cpu as well.

> 
>>
>>
>> Is this by design on purpose, or can we fix with below?
>>
>>
>> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
>> index 4bc083b..df95ce3 100644
>> --- a/drivers/block/virtio_blk.c
>> +++ b/drivers/block/virtio_blk.c
>> @@ -513,6 +513,8 @@ static int init_vq(struct virtio_blk *vblk)
>>  if (err)
>>  num_vqs = 1;
>>  
>> +num_vqs = min(num_possible_cpus(), num_vqs);
>> +
>>  vblk->vqs = kmalloc_array(num_vqs, sizeof(*vblk->vqs), GFP_KERNEL);
>>  if (!vblk->vqs)
>>  return -ENOMEM;
> 
> virtio-blk, however, is not pci-specific.
> 
> If we are using the ccw transport on s390, a completely different
> interrupt mechanism is in use ('floating' interrupts, which are not
> per-cpu). A check like that should therefore not go into the generic
> driver.
> 

So far there seems two options.

The 1st option is to ask the qemu user to always specify "-num-queues" with the
same number of vcpus when running x86 guest with pci for virtio-blk or
virtio-scsi, in order to assign a vector for each queue.

Or, is it fine for virtio folks to add a new hook to 'struct virtio_config_ops'
so that different platforms (e.g., pci or ccw) would use different ways to limit
the max number of queues in guest, with something like below?

So far both virtio-scsi and virtio-blk would benefit from the new hook.

---
 drivers/block/virtio_blk.c |  2 ++
 drivers/virtio/virtio_pci_common.c |  6 ++
 drivers/virtio/virtio_pci_common.h |  2 ++
 drivers/virtio/virtio_pci_legacy.c |  1 +
 drivers/virtio/virtio_pci_modern.c |  2 ++
 include/linux/virtio_config.h  | 11 +++
 6 files changed, 24 insertions(+)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 4bc083b..93cfeda 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -513,6 +513,8 @@ static int init_vq(struct virtio_blk *vblk)
if (err)
num_vqs = 1;

+   num_vqs = virtio_calc_num_vqs(vdev, num_vqs);
+
vblk->vqs = kmalloc_array(num_vqs, sizeof(*vblk->vqs), GFP_KERNEL);
if (!vblk->vqs)
return -ENOMEM;
diff --git a/drivers/virtio/virtio_pci_common.c 
b/drivers/virtio/virtio_pci_common.c
index d0584c0..ce021d1 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -409,6 +409,12 @@ int vp_find_vqs(struct virtio_device *vdev, unsigned nvqs,
return vp_find_vqs_intx(vdev, nvqs, vqs, callbacks, names, ctx);
 }

+/* the config->calc_num_vqs() implementation */
+unsigned short vp_calc_num_vqs(unsigned short num_vqs)
+{
+   return min_t(unsigned short, num_possible_cpus(), num_vqs);
+}
+
 const char *vp_bus_name(struct virtio_device *vdev)
 {
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
diff --git a/drivers/virtio/virtio_pci_common.h 
b/drivers/virtio/virtio_pci_common.h
index 0227100..cc5ac80 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -134,6 +134,8 @@ int vp_find_vqs(struct virtio_device *vdev, unsigned nvqs,
 

Re: [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

2019-03-12 Thread Andrea Arcangeli
On Tue, Mar 12, 2019 at 03:02:54PM -0700, James Bottomley wrote:
> I'm sure there must be workarounds elsewhere in the other arch code
> otherwise things like this, which appear all over drivers/, wouldn't
> work:
> 
> drivers/scsi/isci/request.c:1430
> 
>   kaddr = kmap_atomic(page);
>   memcpy(kaddr + sg->offset, src_addr, copy_len);
>   kunmap_atomic(kaddr);
> 

Are you sure "page" is an userland page with an alias address?

sg->page_link = (unsigned long)virt_to_page(addr);

page_link seems to point to kernel memory.

I found an apparent solution like parisc on arm 32bit:

void __kunmap_atomic(void *kvaddr)
{
unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK;
int idx, type;

if (kvaddr >= (void *)FIXADDR_START) {
type = kmap_atomic_idx();
idx = FIX_KMAP_BEGIN + type + KM_TYPE_NR * smp_processor_id();

if (cache_is_vivt())
__cpuc_flush_dcache_area((void *)vaddr, PAGE_SIZE);

However on arm 64bit kunmap_atomic is not implemented at all and other
32bit implementations don't do it, for example sparc seems to do the
cache flush too if the kernel is built with CONFIG_DEBUG_HIGHMEM
(which makes the flushing conditional to the debug option).

The kunmap_atomic where fixmap is used, is flushing the tlb lazily so
even on 32bit you can't even be sure if there was a tlb flush for each
single page you unmapped, so it's hard to see how the above can work
safe, is "page" would have been an userland page mapped with aliased
CPU cache.

> the sequence dirties the kernel virtual address but doesn't flush
> before doing kunmap.  There are hundreds of other examples which is why
> I think adding flush_kernel_dcache_page() is an already lost cause.

In lots of cases kmap is needed to just modify kernel memory not to
modify userland memory (where get/put_user is more commonly used
instead..), there's no cache aliasing in such case.

> Actually copy_user_page() is unused in the main kernel.  The big
> problem is copy_user_highpage() but that's mostly highly optimised by
> the VIPT architectures (in other words you can fiddle with kmap without
> impacting it).

copy_user_page is not unused, it's called precisely by
copy_user_highpage, which is why the cache flushes are done inside
copy_user_page.

static inline void copy_user_highpage(struct page *to, struct page *from,
unsigned long vaddr, struct vm_area_struct *vma)
{
char *vfrom, *vto;

vfrom = kmap_atomic(from);
vto = kmap_atomic(to);
copy_user_page(vto, vfrom, vaddr, to);
kunmap_atomic(vto);
kunmap_atomic(vfrom);
}
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

2019-03-12 Thread Andrea Arcangeli
On Tue, Mar 12, 2019 at 02:19:15PM -0700, James Bottomley wrote:
> I mean in the sequence
> 
> flush_dcache_page(page);
> flush_dcache_page(page);
> 
> The first flush_dcache_page did all the work and the second it a
> tightly pipelined no-op.  That's what I mean by there not really being
> a double hit.

Ok I wasn't sure it was clear there was a double (profiling) hit on
that function.

void flush_kernel_dcache_page_addr(void *addr)
{
unsigned long flags;

flush_kernel_dcache_page_asm(addr);
purge_tlb_start(flags);
pdtlb_kernel(addr);
purge_tlb_end(flags);
}

#define purge_tlb_start(flags)  spin_lock_irqsave(_tlb_lock, flags)
#define purge_tlb_end(flags)spin_unlock_irqrestore(_tlb_lock, flags)

You got a system-wide spinlock in there that won't just go away the
second time. So it's a bit more than a tightly pipelined "noop".

Your logic of adding the flush on kunmap makes sense, all I'm saying
is that it's sacrificing some performance for safety. You asked
"optimized what", I meant to optimize away all the above quoted code
that will end running twice for each vhost set_bit when it should run
just once like in other archs. And it clearly paid off until now
(until now it run just once and it was the only safe one).

Before we can leverage your idea to flush the dcache on kunmap in
common code without having to sacrifice performance in arch code, we'd
need to change all other archs to add the cache flushes on kunmap too,
and then remove the cache flushes from the other places like copy_page
or we'd waste CPU. Then you'd have the best of both words, no double
flush and kunmap would be enough.

Thanks,
Andrea
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

2019-03-12 Thread Andrea Arcangeli
On Tue, Mar 12, 2019 at 01:53:37PM -0700, James Bottomley wrote:
> I've got to say: optimize what?  What code do we ever have in the
> kernel that kmap's a page and then doesn't do anything with it? You can
> guarantee that on kunmap the page is either referenced (needs
> invalidating) or updated (needs flushing). The in-kernel use of kmap is
> always
> 
> kmap
> do something with the mapped page
> kunmap
> 
> In a very short interval.  It seems just a simplification to make
> kunmap do the flush if needed rather than try to have the users
> remember.  The thing which makes this really simple is that on most
> architectures flush and invalidate is the same operation.  If you
> really want to optimize you can use the referenced and dirty bits on
> the kmapped pte to tell you what operation to do, but if your flush is
> your invalidate, you simply assume the data needs flushing on kunmap
> without checking anything.

Except other archs like arm64 and sparc do the cache flushing on
copy_to_user_page and copy_user_page, not on kunmap.

#define copy_user_page(to,from,vaddr,pg) __cpu_copy_user_page(to, from, vaddr)
void __cpu_copy_user_page(void *kto, const void *kfrom, unsigned long vaddr)
{
struct page *page = virt_to_page(kto);
copy_page(kto, kfrom);
flush_dcache_page(page);
}
#define copy_user_page(to, from, vaddr, page)   \
do {copy_page(to, from);\
sparc_flush_page_to_ram(page);  \
} while (0)

And they do nothing on kunmap:

static inline void kunmap(struct page *page)
{
BUG_ON(in_interrupt());
if (!PageHighMem(page))
return;
kunmap_high(page);
}
void kunmap_high(struct page *page)
{
unsigned long vaddr;
unsigned long nr;
unsigned long flags;
int need_wakeup;
unsigned int color = get_pkmap_color(page);
wait_queue_head_t *pkmap_map_wait;

lock_kmap_any(flags);
vaddr = (unsigned long)page_address(page);
BUG_ON(!vaddr);
nr = PKMAP_NR(vaddr);

/*
 * A count must never go down to zero
 * without a TLB flush!
 */
need_wakeup = 0;
switch (--pkmap_count[nr]) {
case 0:
BUG();
case 1:
/*
 * Avoid an unnecessary wake_up() function call.
 * The common case is pkmap_count[] == 1, but
 * no waiters.
 * The tasks queued in the wait-queue are guarded
 * by both the lock in the wait-queue-head and by
 * the kmap_lock.  As the kmap_lock is held here,
 * no need for the wait-queue-head's lock.  Simply
 * test if the queue is empty.
 */
pkmap_map_wait = get_pkmap_wait_queue_head(color);
need_wakeup = waitqueue_active(pkmap_map_wait);
}
unlock_kmap_any(flags);

/* do wake-up, if needed, race-free outside of the spin lock */
if (need_wakeup)
wake_up(pkmap_map_wait);
}
static inline void kunmap(struct page *page)
{
}

because they already did it just above.


> > Which means after we fix vhost to add the flush_dcache_page after
> > kunmap, Parisc will get a double hit (but it also means Parisc was
> > the only one of those archs needed explicit cache flushes, where
> > vhost worked correctly so far.. so it kinds of proofs your point of
> > giving up being the safe choice).
> 
> What double hit?  If there's no cache to flush then cache flush is a
> no-op.  It's also a highly piplineable no-op because the CPU has the L1
> cache within easy reach.  The only event when flush takes a large
> amount time is if we actually have dirty data to write back to main
> memory.

The double hit is in parisc copy_to_user_page:

#define copy_to_user_page(vma, page, vaddr, dst, src, len) \
do { \
flush_cache_page(vma, vaddr, page_to_pfn(page)); \
memcpy(dst, src, len); \
flush_kernel_dcache_range_asm((unsigned long)dst, (unsigned long)dst + 
len); \
} while (0)

That is executed just before kunmap:

static inline void kunmap(struct page *page)
{
flush_kernel_dcache_page_addr(page_address(page));
}

Can't argue about the fact your "safer" kunmap is safer, but we cannot
rely on common code unless we remove some optimization from the common
code abstractions and we make all archs do kunmap like parisc.

Thanks,
Andrea
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

2019-03-12 Thread Andrea Arcangeli
On Tue, Mar 12, 2019 at 08:46:50AM -0700, James Bottomley wrote:
> On Tue, 2019-03-12 at 07:54 -0400, Michael S. Tsirkin wrote:
> > On Tue, Mar 12, 2019 at 03:17:00PM +0800, Jason Wang wrote:
> > > 
> > > On 2019/3/12 上午11:52, Michael S. Tsirkin wrote:
> > > > On Tue, Mar 12, 2019 at 10:59:09AM +0800, Jason Wang wrote:
> [...]
> > > At least for -stable, we need the flush?
> > > 
> > > 
> > > > Three atomic ops per bit is way to expensive.
> > > 
> > > 
> > > Yes.
> > > 
> > > Thanks
> > 
> > See James's reply - I stand corrected we do kunmap so no need to
> > flush.
> 
> Well, I said that's what we do on Parisc.  The cachetlb document
> definitely says if you alter the data between kmap and kunmap you are
> responsible for the flush.  It's just that flush_dcache_page() is a no-
> op on x86 so they never remember to add it and since it will crash
> parisc if you get it wrong we finally gave up trying to make them.
> 
> But that's the point: it is a no-op on your favourite architecture so
> it costs you nothing to add it.

Yes, the fact Parisc gave up and is doing it on kunmap is reasonable
approach for Parisc, but it doesn't move the needle as far as vhost
common code is concerned, because other archs don't flush any cache on
kunmap.

So either all other archs give up trying to optimize, or vhost still
has to call flush_dcache_page() after kunmap.

Which means after we fix vhost to add the flush_dcache_page after
kunmap, Parisc will get a double hit (but it also means Parisc was the
only one of those archs needed explicit cache flushes, where vhost
worked correctly so far.. so it kinds of proofs your point of giving
up being the safe choice).

Thanks,
Andrea
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: virtio-blk: should num_vqs be limited by num_possible_cpus()?

2019-03-12 Thread Cornelia Huck
On Tue, 12 Mar 2019 10:22:46 -0700 (PDT)
Dongli Zhang  wrote:

> I observed that there is one msix vector for config and one shared vector
> for all queues in below qemu cmdline, when the num-queues for virtio-blk
> is more than the number of possible cpus:
> 
> qemu: "-smp 4" while "-device 
> virtio-blk-pci,drive=drive-0,id=virtblk0,num-queues=6"
> 
> # cat /proc/interrupts 
>CPU0   CPU1   CPU2   CPU3
> ... ...
>  24:  0  0  0  0   PCI-MSI 65536-edge  
> virtio0-config
>  25:  0  0  0 59   PCI-MSI 65537-edge  
> virtio0-virtqueues
> ... ...
> 
> 
> However, when num-queues is the same as number of possible cpus:
> 
> qemu: "-smp 4" while "-device 
> virtio-blk-pci,drive=drive-0,id=virtblk0,num-queues=4"
> 
> # cat /proc/interrupts 
>CPU0   CPU1   CPU2   CPU3
> ... ... 
>  24:  0  0  0  0   PCI-MSI 65536-edge  
> virtio0-config
>  25:  2  0  0  0   PCI-MSI 65537-edge  
> virtio0-req.0
>  26:  0 35  0  0   PCI-MSI 65538-edge  
> virtio0-req.1
>  27:  0  0 32  0   PCI-MSI 65539-edge  
> virtio0-req.2
>  28:  0  0  0  0   PCI-MSI 65540-edge  
> virtio0-req.3
> ... ...
> 
> In above case, there is one msix vector per queue.

Please note that this is pci-specific...

> 
> 
> This is because the max number of queues is not limited by the number of
> possible cpus.
> 
> By default, nvme (regardless about write_queues and poll_queues) and
> xen-blkfront limit the number of queues with num_possible_cpus().

...and these are probably pci-specific as well.

> 
> 
> Is this by design on purpose, or can we fix with below?
> 
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 4bc083b..df95ce3 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -513,6 +513,8 @@ static int init_vq(struct virtio_blk *vblk)
>   if (err)
>   num_vqs = 1;
>  
> + num_vqs = min(num_possible_cpus(), num_vqs);
> +
>   vblk->vqs = kmalloc_array(num_vqs, sizeof(*vblk->vqs), GFP_KERNEL);
>   if (!vblk->vqs)
>   return -ENOMEM;

virtio-blk, however, is not pci-specific.

If we are using the ccw transport on s390, a completely different
interrupt mechanism is in use ('floating' interrupts, which are not
per-cpu). A check like that should therefore not go into the generic
driver.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


virtio-blk: should num_vqs be limited by num_possible_cpus()?

2019-03-12 Thread Dongli Zhang
I observed that there is one msix vector for config and one shared vector
for all queues in below qemu cmdline, when the num-queues for virtio-blk
is more than the number of possible cpus:

qemu: "-smp 4" while "-device 
virtio-blk-pci,drive=drive-0,id=virtblk0,num-queues=6"

# cat /proc/interrupts 
   CPU0   CPU1   CPU2   CPU3
... ...
 24:  0  0  0  0   PCI-MSI 65536-edge  
virtio0-config
 25:  0  0  0 59   PCI-MSI 65537-edge  
virtio0-virtqueues
... ...


However, when num-queues is the same as number of possible cpus:

qemu: "-smp 4" while "-device 
virtio-blk-pci,drive=drive-0,id=virtblk0,num-queues=4"

# cat /proc/interrupts 
   CPU0   CPU1   CPU2   CPU3
... ... 
 24:  0  0  0  0   PCI-MSI 65536-edge  
virtio0-config
 25:  2  0  0  0   PCI-MSI 65537-edge  
virtio0-req.0
 26:  0 35  0  0   PCI-MSI 65538-edge  
virtio0-req.1
 27:  0  0 32  0   PCI-MSI 65539-edge  
virtio0-req.2
 28:  0  0  0  0   PCI-MSI 65540-edge  
virtio0-req.3
... ...

In above case, there is one msix vector per queue.


This is because the max number of queues is not limited by the number of
possible cpus.

By default, nvme (regardless about write_queues and poll_queues) and
xen-blkfront limit the number of queues with num_possible_cpus().


Is this by design on purpose, or can we fix with below?


diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 4bc083b..df95ce3 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -513,6 +513,8 @@ static int init_vq(struct virtio_blk *vblk)
if (err)
num_vqs = 1;
 
+   num_vqs = min(num_possible_cpus(), num_vqs);
+
vblk->vqs = kmalloc_array(num_vqs, sizeof(*vblk->vqs), GFP_KERNEL);
if (!vblk->vqs)
return -ENOMEM;
--


PS: The same issue is applicable to virtio-scsi as well.

Thank you very much!

Dongli Zhang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

2019-03-12 Thread Michael S. Tsirkin
On Tue, Mar 12, 2019 at 03:17:00PM +0800, Jason Wang wrote:
> 
> On 2019/3/12 上午11:52, Michael S. Tsirkin wrote:
> > On Tue, Mar 12, 2019 at 10:59:09AM +0800, Jason Wang wrote:
> > > On 2019/3/12 上午2:14, David Miller wrote:
> > > > From: "Michael S. Tsirkin" 
> > > > Date: Mon, 11 Mar 2019 09:59:28 -0400
> > > > 
> > > > > On Mon, Mar 11, 2019 at 03:13:17PM +0800, Jason Wang wrote:
> > > > > > On 2019/3/8 下午10:12, Christoph Hellwig wrote:
> > > > > > > On Wed, Mar 06, 2019 at 02:18:07AM -0500, Jason Wang wrote:
> > > > > > > > This series tries to access virtqueue metadata through kernel 
> > > > > > > > virtual
> > > > > > > > address instead of copy_user() friends since they had too much
> > > > > > > > overheads like checks, spec barriers or even hardware feature
> > > > > > > > toggling. This is done through setup kernel address through 
> > > > > > > > vmap() and
> > > > > > > > resigter MMU notifier for invalidation.
> > > > > > > > 
> > > > > > > > Test shows about 24% improvement on TX PPS. TCP_STREAM doesn't 
> > > > > > > > see
> > > > > > > > obvious improvement.
> > > > > > > How is this going to work for CPUs with virtually tagged caches?
> > > > > > Anything different that you worry?
> > > > > If caches have virtual tags then kernel and userspace view of memory
> > > > > might not be automatically in sync if they access memory
> > > > > through different virtual addresses. You need to do things like
> > > > > flush_cache_page, probably multiple times.
> > > > "flush_dcache_page()"
> > > 
> > > I get this. Then I think the current set_bit_to_user() is suspicious, we
> > > probably miss a flush_dcache_page() there:
> > > 
> > > 
> > > static int set_bit_to_user(int nr, void __user *addr)
> > > {
> > >      unsigned long log = (unsigned long)addr;
> > >      struct page *page;
> > >      void *base;
> > >      int bit = nr + (log % PAGE_SIZE) * 8;
> > >      int r;
> > > 
> > >      r = get_user_pages_fast(log, 1, 1, );
> > >      if (r < 0)
> > >      return r;
> > >      BUG_ON(r != 1);
> > >      base = kmap_atomic(page);
> > >      set_bit(bit, base);
> > >      kunmap_atomic(base);
> > >      set_page_dirty_lock(page);
> > >      put_page(page);
> > >      return 0;
> > > }
> > > 
> > > Thanks
> > I think you are right. The correct fix though is to re-implement
> > it using asm and handling pagefault, not gup.
> 
> 
> I agree but it needs to introduce new helpers in asm  for all archs which is
> not trivial.

We can have a generic implementation using kmap.

> At least for -stable, we need the flush?
> 
> 
> > Three atomic ops per bit is way to expensive.
> 
> 
> Yes.
> 
> Thanks

See James's reply - I stand corrected we do kunmap so no need to flush.

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] virtio_ring: Fix potential mem leak in virtqueue_add_indirect_packed

2019-03-12 Thread Jason Wang


On 2019/3/12 下午3:06, Yue Haibing wrote:

From: YueHaibing 

'desc' should be freed before leaving from err handing path.

Fixes: 1ce9e6055fa0 ("virtio_ring: introduce packed ring support")
Signed-off-by: YueHaibing 
---
  drivers/virtio/virtio_ring.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index a0b07c3..9d95d9c 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -991,6 +991,7 @@ static int virtqueue_add_indirect_packed(struct 
vring_virtqueue *vq,
  
  	if (unlikely(vq->vq.num_free < 1)) {

pr_debug("Can't add buf len 1 - avail = 0\n");
+   kfree(desc);
END_USE(vq);
return -ENOSPC;
}



Or you can move the check before the allocation.

Acked-by: Jason Wang 

Please cc stable next time.

Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

2019-03-12 Thread Jason Wang


On 2019/3/12 下午3:51, Jason Wang wrote:


On 2019/3/12 下午1:14, James Bottomley wrote:

On Tue, 2019-03-12 at 10:59 +0800, Jason Wang wrote:

On 2019/3/12 上午2:14, David Miller wrote:

From: "Michael S. Tsirkin" 
Date: Mon, 11 Mar 2019 09:59:28 -0400


On Mon, Mar 11, 2019 at 03:13:17PM +0800, Jason Wang wrote:

On 2019/3/8 下午10:12, Christoph Hellwig wrote:

On Wed, Mar 06, 2019 at 02:18:07AM -0500, Jason Wang wrote:

This series tries to access virtqueue metadata through
kernel virtual
address instead of copy_user() friends since they had too
much
overheads like checks, spec barriers or even hardware
feature
toggling. This is done through setup kernel address through
vmap() and
resigter MMU notifier for invalidation.

Test shows about 24% improvement on TX PPS. TCP_STREAM
doesn't see
obvious improvement.

How is this going to work for CPUs with virtually tagged
caches?

Anything different that you worry?

If caches have virtual tags then kernel and userspace view of
memory
might not be automatically in sync if they access memory
through different virtual addresses. You need to do things like
flush_cache_page, probably multiple times.

"flush_dcache_page()"


I get this. Then I think the current set_bit_to_user() is suspicious,
we
probably miss a flush_dcache_page() there:


static int set_bit_to_user(int nr, void __user *addr)
{
  unsigned long log = (unsigned long)addr;
  struct page *page;
  void *base;
  int bit = nr + (log % PAGE_SIZE) * 8;
  int r;

  r = get_user_pages_fast(log, 1, 1, );
  if (r < 0)
  return r;
  BUG_ON(r != 1);
  base = kmap_atomic(page);
  set_bit(bit, base);
  kunmap_atomic(base);

This sequence should be OK.  get_user_pages() contains a flush which
clears the cache above the user virtual address, so on kmap, the page
is coherent at the new alias.  On parisc at least, kunmap embodies a
flush_dcache_page() which pushes any changes in the cache above the
kernel virtual address back to main memory and makes it coherent again
for the user alias to pick it up.



It would be good if kmap()/kunmap() can do this but looks like we can 
not assume this? For example, sparc's flush_dcache_page() 



Sorry, I meant kunmap_atomic().

Thanks


doesn't do flush_dcache_page(). And bio_copy_data_iter() do 
flush_dcache_page() after kunmap_atomic().


Thanks




James




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

2019-03-12 Thread Jason Wang


On 2019/3/12 下午1:14, James Bottomley wrote:

On Tue, 2019-03-12 at 10:59 +0800, Jason Wang wrote:

On 2019/3/12 上午2:14, David Miller wrote:

From: "Michael S. Tsirkin" 
Date: Mon, 11 Mar 2019 09:59:28 -0400


On Mon, Mar 11, 2019 at 03:13:17PM +0800, Jason Wang wrote:

On 2019/3/8 下午10:12, Christoph Hellwig wrote:

On Wed, Mar 06, 2019 at 02:18:07AM -0500, Jason Wang wrote:

This series tries to access virtqueue metadata through
kernel virtual
address instead of copy_user() friends since they had too
much
overheads like checks, spec barriers or even hardware
feature
toggling. This is done through setup kernel address through
vmap() and
resigter MMU notifier for invalidation.

Test shows about 24% improvement on TX PPS. TCP_STREAM
doesn't see
obvious improvement.

How is this going to work for CPUs with virtually tagged
caches?

Anything different that you worry?

If caches have virtual tags then kernel and userspace view of
memory
might not be automatically in sync if they access memory
through different virtual addresses. You need to do things like
flush_cache_page, probably multiple times.

"flush_dcache_page()"


I get this. Then I think the current set_bit_to_user() is suspicious,
we
probably miss a flush_dcache_page() there:


static int set_bit_to_user(int nr, void __user *addr)
{
  unsigned long log = (unsigned long)addr;
  struct page *page;
  void *base;
  int bit = nr + (log % PAGE_SIZE) * 8;
  int r;

  r = get_user_pages_fast(log, 1, 1, );
  if (r < 0)
  return r;
  BUG_ON(r != 1);
  base = kmap_atomic(page);
  set_bit(bit, base);
  kunmap_atomic(base);

This sequence should be OK.  get_user_pages() contains a flush which
clears the cache above the user virtual address, so on kmap, the page
is coherent at the new alias.  On parisc at least, kunmap embodies a
flush_dcache_page() which pushes any changes in the cache above the
kernel virtual address back to main memory and makes it coherent again
for the user alias to pick it up.



It would be good if kmap()/kunmap() can do this but looks like we can 
not assume this? For example, sparc's flush_dcache_page() doesn't do 
flush_dcache_page(). And bio_copy_data_iter() do flush_dcache_page() 
after kunmap_atomic().


Thanks




James


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

2019-03-12 Thread Jason Wang


On 2019/3/12 上午11:52, Michael S. Tsirkin wrote:

On Tue, Mar 12, 2019 at 10:59:09AM +0800, Jason Wang wrote:

On 2019/3/12 上午2:14, David Miller wrote:

From: "Michael S. Tsirkin" 
Date: Mon, 11 Mar 2019 09:59:28 -0400


On Mon, Mar 11, 2019 at 03:13:17PM +0800, Jason Wang wrote:

On 2019/3/8 下午10:12, Christoph Hellwig wrote:

On Wed, Mar 06, 2019 at 02:18:07AM -0500, Jason Wang wrote:

This series tries to access virtqueue metadata through kernel virtual
address instead of copy_user() friends since they had too much
overheads like checks, spec barriers or even hardware feature
toggling. This is done through setup kernel address through vmap() and
resigter MMU notifier for invalidation.

Test shows about 24% improvement on TX PPS. TCP_STREAM doesn't see
obvious improvement.

How is this going to work for CPUs with virtually tagged caches?

Anything different that you worry?

If caches have virtual tags then kernel and userspace view of memory
might not be automatically in sync if they access memory
through different virtual addresses. You need to do things like
flush_cache_page, probably multiple times.

"flush_dcache_page()"


I get this. Then I think the current set_bit_to_user() is suspicious, we
probably miss a flush_dcache_page() there:


static int set_bit_to_user(int nr, void __user *addr)
{
     unsigned long log = (unsigned long)addr;
     struct page *page;
     void *base;
     int bit = nr + (log % PAGE_SIZE) * 8;
     int r;

     r = get_user_pages_fast(log, 1, 1, );
     if (r < 0)
     return r;
     BUG_ON(r != 1);
     base = kmap_atomic(page);
     set_bit(bit, base);
     kunmap_atomic(base);
     set_page_dirty_lock(page);
     put_page(page);
     return 0;
}

Thanks

I think you are right. The correct fix though is to re-implement
it using asm and handling pagefault, not gup.



I agree but it needs to introduce new helpers in asm  for all archs 
which is not trivial. At least for -stable, we need the flush?




Three atomic ops per bit is way to expensive.



Yes.

Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-03-12 Thread Jason Wang


On 2019/3/12 上午11:50, Michael S. Tsirkin wrote:

Using direct mapping (I
guess kernel will always try hugepage for that?) should be better and we can
even use it for the data transfer not only for the metadata.

Thanks

We can't really. The big issue is get user pages. Doing that on data
path will be slower than copyXuser.

I meant if we can find a way to avoid doing gup in datapath. E.g vhost
maintain a range tree and add or remove ranges through MMU notifier. Then in
datapath, if we find the range, then use direct mapping otherwise
copy_to_user().

Thanks

We can try. But I'm not sure there's any reason to think there's any
locality there.



Ok, but what kind of locality do you mean here?

Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization