Re: [PATCH RFC 1/1] x86/paravirt: introduce param to disable pv sched_clock

2023-10-19 Thread Dongli Zhang
Hi Vitaly, Sean and David,

On 10/19/23 08:40, Sean Christopherson wrote:
> On Thu, Oct 19, 2023, Vitaly Kuznetsov wrote:
>> Dongli Zhang  writes:
>>
>>> As mentioned in the linux kernel development document, "sched_clock() is
>>> used for scheduling and timestamping". While there is a default native
>>> implementation, many paravirtualizations have their own implementations.
>>>
>>> About KVM, it uses kvm_sched_clock_read() and there is no way to only
>>> disable KVM's sched_clock. The "no-kvmclock" may disable all
>>> paravirtualized kvmclock features.
> 
> ...
> 
>>> Please suggest and comment if other options are better:
>>>
>>> 1. Global param (this RFC patch).
>>>
>>> 2. The kvmclock specific param (e.g., "no-vmw-sched-clock" in vmware).
>>>
>>> Indeed I like the 2nd method.
>>>
>>> 3. Enforce native sched_clock only when TSC is invariant (hyper-v method).
>>>
>>> 4. Remove and cleanup pv sched_clock, and always use pv_sched_clock() for
>>> all (suggested by Peter Zijlstra in [3]). Some paravirtualizations may
>>> want to keep the pv sched_clock.
>>
>> Normally, it should be up to the hypervisor to tell the guest which
>> clock to use, i.e. if TSC is reliable or not. Let me put my question
>> this way: if TSC on the particular host is good for everything, why
>> does the hypervisor advertises 'kvmclock' to its guests?
> 
> I suspect there are two reasons.
> 
>   1. As is likely the case in our fleet, no one revisited the set of 
> advertised
>  PV features when defining the VM shapes for a new generation of 
> hardware, or
>  whoever did the reviews wasn't aware that advertising kvmclock is 
> actually
>  suboptimal.  All the PV clock stuff in KVM is quite labyrinthian, so it's
>  not hard to imagine it getting overlooked.
> 
>   2. Legacy VMs.  If VMs have been running with a PV clock for years, forcing
>  them to switch to a new clocksource is high-risk, low-reward.
> 
>> If for some 'historical reasons' we can't revoke features we can always
>> introduce a new PV feature bit saying that TSC is preferred.
>>
>> 1) Global param doesn't sound like a good idea to me: chances are that
>> people will be setting it on their guest images to workaround problems
>> on one hypervisor (or, rather, on one public cloud which is too lazy to
>> fix their hypervisor) while simultaneously creating problems on another.
>>
>> 2) KVM specific parameter can work, but as KVM's sched_clock is the same
>> as kvmclock, I'm not convinced it actually makes sense to separate the
>> two. Like if sched_clock is known to be bad but TSC is good, why do we
>> need to use PV clock at all? Having a parameter for debugging purposes
>> may be OK though...
>>
>> 3) This is Hyper-V specific, you can see that it uses a dedicated PV bit
>> (HV_ACCESS_TSC_INVARIANT) and not the architectural
>> CPUID.8007H:EDX[8]. I'm not sure we can blindly trust the later on
>> all hypervisors.
>>
>> 4) Personally, I'm not sure that relying on 'TSC is crap' detection is
>> 100% reliable. I can imagine cases when we can't detect that fact that
>> while synchronized across CPUs and not going backwards, it is, for
>> example, ticking with an unstable frequency and PV sched clock is
>> supposed to give the right correction (all of them are rdtsc() based
>> anyways, aren't they?).
> 
> Yeah, practically speaking, the only thing adding a knob to turn off using PV
> clocks for sched_clock will accomplish is creating an even bigger matrix of
> combinations that can cause problems, e.g. where guests end up using kvmclock
> timekeeping but not scheduling.
> 
> The explanation above and the links below fail to capture _the_ key point:
> Linux-as-a-guest already prioritizes the TSC over paravirt clocks as the 
> clocksource
> when the TSC is constant and nonstop (first spliced blob below).
> 
> What I suggested is that if the TSC is chosen over a PV clock as the 
> clocksource,
> then we have the kernel also override the sched_clock selection (second 
> spliced
> blob below).
> 
> That doesn't require the guest admin to opt-in, and doesn't create even more
> combinations to support.  It also provides for a smoother transition for when
> customers inevitably end up creating VMs on hosts that don't advertise 
> kvmclock
> (or any PV clock).

I would prefer to always leave the option to allow the guest admin to change the
decision, especially for diagnostic/workaround reason (although the kvmclock is
always buggy when tsc is bu

[PATCH RFC 1/1] x86/paravirt: introduce param to disable pv sched_clock

2023-10-18 Thread Dongli Zhang
As mentioned in the linux kernel development document, "sched_clock() is
used for scheduling and timestamping". While there is a default native
implementation, many paravirtualizations have their own implementations.

About KVM, it uses kvm_sched_clock_read() and there is no way to only
disable KVM's sched_clock. The "no-kvmclock" may disable all
paravirtualized kvmclock features.

 94 static inline void kvm_sched_clock_init(bool stable)
 95 {
 96 if (!stable)
 97 clear_sched_clock_stable();
 98 kvm_sched_clock_offset = kvm_clock_read();
 99 paravirt_set_sched_clock(kvm_sched_clock_read);
100
101 pr_info("kvm-clock: using sched offset of %llu cycles",
102 kvm_sched_clock_offset);
103
104 BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
105 sizeof(((struct pvclock_vcpu_time_info 
*)NULL)->system_time));
106 }

There is known issue that kvmclock may drift during vCPU hotplug [1].
Although a temporary fix is available [2], we may need a way to disable pv
sched_clock. Nowadays, the TSC is more stable and has less performance
overhead than kvmclock.

This is to propose to introduce a global param to disable pv sched_clock
for all paravirtualizations.

Please suggest and comment if other options are better:

1. Global param (this RFC patch).

2. The kvmclock specific param (e.g., "no-vmw-sched-clock" in vmware).

Indeed I like the 2nd method.

3. Enforce native sched_clock only when TSC is invariant (hyper-v method).

4. Remove and cleanup pv sched_clock, and always use pv_sched_clock() for
all (suggested by Peter Zijlstra in [3]). Some paravirtualizations may
want to keep the pv sched_clock.

To introduce a param may be easier to backport to old kernel version.

References:
[1] https://lore.kernel.org/all/20230926230649.67852-1-dongli.zh...@oracle.com/
[2] https://lore.kernel.org/all/20231018195638.1898375-1-sea...@google.com/
[3] 
https://lore.kernel.org/all/20231002211651.ga3...@noisy.programming.kicks-ass.net/

Thank you very much for the suggestion!

Signed-off-by: Dongli Zhang 
---
 arch/x86/include/asm/paravirt.h |  2 +-
 arch/x86/kernel/kvmclock.c  | 12 +++-
 arch/x86/kernel/paravirt.c  | 18 +-
 3 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 6c8ff12140ae..f36edf608b6b 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -24,7 +24,7 @@ u64 dummy_sched_clock(void);
 DECLARE_STATIC_CALL(pv_steal_clock, dummy_steal_clock);
 DECLARE_STATIC_CALL(pv_sched_clock, dummy_sched_clock);
 
-void paravirt_set_sched_clock(u64 (*func)(void));
+int paravirt_set_sched_clock(u64 (*func)(void));
 
 static __always_inline u64 paravirt_sched_clock(void)
 {
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index fb8f52149be9..0b8bf5677d44 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -93,13 +93,15 @@ static noinstr u64 kvm_sched_clock_read(void)
 
 static inline void kvm_sched_clock_init(bool stable)
 {
-   if (!stable)
-   clear_sched_clock_stable();
kvm_sched_clock_offset = kvm_clock_read();
-   paravirt_set_sched_clock(kvm_sched_clock_read);
 
-   pr_info("kvm-clock: using sched offset of %llu cycles",
-   kvm_sched_clock_offset);
+   if (!paravirt_set_sched_clock(kvm_sched_clock_read)) {
+   if (!stable)
+   clear_sched_clock_stable();
+
+   pr_info("kvm-clock: using sched offset of %llu cycles",
+   kvm_sched_clock_offset);
+   }
 
BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 97f1436c1a20..2cfef94317b0 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -118,9 +118,25 @@ static u64 native_steal_clock(int cpu)
 DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
 DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
 
-void paravirt_set_sched_clock(u64 (*func)(void))
+static bool no_pv_sched_clock;
+
+static int __init parse_no_pv_sched_clock(char *arg)
+{
+   no_pv_sched_clock = true;
+   return 0;
+}
+early_param("no_pv_sched_clock", parse_no_pv_sched_clock);
+
+int paravirt_set_sched_clock(u64 (*func)(void))
 {
+   if (no_pv_sched_clock) {
+   pr_info("sched_clock: not configurable\n");
+   return -EPERM;
+   }
+
static_call_update(pv_sched_clock, func);
+
+   return 0;
 }
 
 /* These are in entry.S */
-- 
2.34.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Question on vhost vq->signalled_used and vq->signalled_used_valid

2022-06-14 Thread Dongli Zhang
Hello,

May I have a question on vq->signalled_used and vq->signalled_used_valid?

According to comments at line 2395, "If the driver never bothers to signal in a
very long while, used index might wrap around. If that happens, invalidate
signalled_used index we stored."

(BTW, I see QEMU-7.0 uses int16_t at something like line 2399 and I am thinking
about why)

2372 static int __vhost_add_used_n(struct vhost_virtqueue *vq,
2373 struct vring_used_elem *heads,
2374 unsigned count)
2375 {
2376 vring_used_elem_t __user *used;
2377 u16 old, new;
2378 int start;
2379
2380 start = vq->last_used_idx & (vq->num - 1);
2381 used = vq->used->ring + start;
2382 if (vhost_put_used(vq, heads, start, count)) {
2383 vq_err(vq, "Failed to write used");
2384 return -EFAULT;
2385 }
2386 if (unlikely(vq->log_used)) {
2387 /* Make sure data is seen before log. */
2388 smp_wmb();
2389 /* Log used ring entry write. */
2390 log_used(vq, ((void __user *)used - (void __user 
*)vq->used),
2391  count * sizeof *used);
2392 }
2393 old = vq->last_used_idx;
2394 new = (vq->last_used_idx += count);
2395 /* If the driver never bothers to signal in a very long while,
2396  * used index might wrap around. If that happens, invalidate
2397  * signalled_used index we stored. TODO: make sure driver
2398  * signals at least once in 2^16 and remove this. */
2399 if (unlikely((u16)(new - vq->signalled_used) < (u16)(new - old)))
2400 vq->signalled_used_valid = false;
2401 return 0;
2402 }

However, although the vhost signals the frontend virtio *conditionally*, the
vq->signalled_used is always updated at the end of vhost_notify() at line 2465,
no matter whether line 2475 returns true/false.

I did some tests but never see line 2399 returns true.

2441 static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
2442 {
2443 __u16 old, new;
2444 __virtio16 event;
2445 bool v;
2446 /* Flush out used index updates. This is paired
2447  * with the barrier that the Guest executes when enabling
2448  * interrupts. */
2449 smp_mb();
2450
2451 if (vhost_has_feature(vq, VIRTIO_F_NOTIFY_ON_EMPTY) &&
2452 unlikely(vq->avail_idx == vq->last_avail_idx))
2453 return true;
2454
2455 if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
2456 __virtio16 flags;
2457 if (vhost_get_avail_flags(vq, )) {
2458 vq_err(vq, "Failed to get flags");
2459 return true;
2460 }
2461 return !(flags & cpu_to_vhost16(vq,
VRING_AVAIL_F_NO_INTERRUPT));
2462 }
2463 old = vq->signalled_used;
2464 v = vq->signalled_used_valid;
2465 new = vq->signalled_used = vq->last_used_idx;
2466 vq->signalled_used_valid = true;
2467
2468 if (unlikely(!v))
2469 return true;
2470
2471 if (vhost_get_used_event(vq, )) {
2472 vq_err(vq, "Failed to get used event idx");
2473 return true;
2474 }
2475 return vring_need_event(vhost16_to_cpu(vq, event), new, old);
2476 }


Therefore, would you mind helping me understand what does "If the driver never
bothers to signal in a very long while" indicate?

How the vhost driver "never bothers to signal in a very long while" as
vq->signalled_used is always updated?

About the naming, why not use "vq->added_used" because the value is always
updated after something is added to the vring buffer?

Perhaps this is a silly question. Sorry that currently I am stuck on it :)

Thank you very much!

Dongli Zhang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC v1 4/7] swiotlb: to implement io_tlb_high_mem

2022-06-10 Thread Dongli Zhang
Hi Christoph,

On 6/8/22 10:05 PM, Christoph Hellwig wrote:
> All this really needs to be hidden under the hood.
> 

Since this patch file has 200+ lines, would you please help clarify what does
'this' indicate?

The idea of this patch:

1. Convert the functions to initialize swiotlb into callee. The callee accepts
'true' or 'false' as arguments to indicate whether it is for default or new
swiotlb buffer, e.g., swiotlb_init_remap() into __swiotlb_init_remap().

2. At the caller side to decide if we are going to call the callee to create the
extra buffer.

Do you mean the callee if still *too high level* and all the
decision/allocation/initialization should be down at *lower level functions*?

E.g., if I re-work the "struct io_tlb_mem" to make the 64-bit buffer as the 2nd
array of io_tlb_mem->slots[32_or_64][index], the user will only notice it is the
default 'io_tlb_default_mem', but will not be able to know if it is allocated
from 32-bit or 64-bit buffer.

Thank you very much for the feedback.

Dongli Zhang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH RFC v1 0/7] swiotlb: extra 64-bit buffer for dev->dma_io_tlb_mem

2022-06-08 Thread Dongli Zhang
Hello,

I used to send out a patchset on 64-bit buffer and people thought it was
the same as Restricted DMA. However, the 64-bit buffer is still not supported.

https://lore.kernel.org/all/20210203233709.19819-1-dongli.zh...@oracle.com/

This RFC is to introduce the extra swiotlb buffer with SWIOTLB_ANY flag,
to support 64-bit swiotlb.

The core ideas are:

1. Create an extra io_tlb_mem with SWIOTLB_ANY flags.

2. The dev->dma_io_tlb_mem is set to either default or the extra io_tlb_mem,
   depending on dma mask.


Would you please help suggest for below questions in the RFC?

- Is it fine to create the extra io_tlb_mem?

- Which one is better: to create a separate variable for the extra
  io_tlb_mem, or make it an array of two io_tlb_mem?

- Should I set dev->dma_io_tlb_mem in each driver (e.g., virtio driver as
  in this patchset)based on the value of
  min_not_zero(*dev->dma_mask, dev->bus_dma_limit), or at higher level
  (e.g., post pci driver)?


This patchset is to demonstrate that the idea works. Since this is just a
RFC, I have only tested virtio-blk on qemu-7.0 by enforcing swiotlb. It is
not tested on AMD SEV environment.

qemu-system-x86_64 -cpu host -name debug-threads=on \
-smp 8 -m 16G -machine q35,accel=kvm -vnc :5 -hda boot.img \
-kernel mainline-linux/arch/x86_64/boot/bzImage \
-append "root=/dev/sda1 init=/sbin/init text console=ttyS0 loglevel=7 
swiotlb=327680,3145728,force" \
-device 
virtio-blk-pci,id=vblk0,num-queues=8,drive=drive0,disable-legacy=on,iommu_platform=true
 \
-drive file=test.raw,if=none,id=drive0,cache=none \
-net nic -net user,hostfwd=tcp::5025-:22 -serial stdio


The kernel command line "swiotlb=327680,3145728,force" is to allocate 6GB for
the extra swiotlb.

[2.826676] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[2.826693] software IO TLB: default mapped [mem 
0x3700-0x5f00] (640MB)
[2.826697] software IO TLB: high mapped [mem 
0x0002edc8-0x00046dc8] (6144MB)

The highmem swiotlb is being used by virtio-blk.

$ cat /sys/kernel/debug/swiotlb/swiotlb-hi/io_tlb_nslabs 
3145728
$ cat /sys/kernel/debug/swiotlb/swiotlb-hi/io_tlb_used 
8960


Dongli Zhang (7):
  swiotlb: introduce the highmem swiotlb buffer
  swiotlb: change the signature of remap function
  swiotlb-xen: support highmem for xen specific code
  swiotlb: to implement io_tlb_high_mem
  swiotlb: add interface to set dev->dma_io_tlb_mem
  virtio: use io_tlb_high_mem if it is active
  swiotlb: fix the slot_addr() overflow

arch/powerpc/kernel/dma-swiotlb.c  |   8 +-
arch/x86/include/asm/xen/swiotlb-xen.h |   2 +-
arch/x86/kernel/pci-dma.c  |   5 +-
drivers/virtio/virtio.c|   8 ++
drivers/xen/swiotlb-xen.c  |  16 +++-
include/linux/swiotlb.h|  14 ++-
kernel/dma/swiotlb.c   | 136 +---
7 files changed, 145 insertions(+), 44 deletions(-)

Thank you very much for feedback and suggestion!

Dongli Zhang


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH RFC v1 7/7] swiotlb: fix the slot_addr() overflow

2022-06-08 Thread Dongli Zhang
Since the type of swiotlb slot index is a signed integer, the
"((idx) << IO_TLB_SHIFT)" will returns incorrect value. As a result, the
slot_addr() returns a value which is smaller than the expected one.

E.g., the 'tlb_addr' generated in swiotlb_tbl_map_single() may return a
value smaller than the expected one. As a result, the swiotlb_bounce()
will access a wrong swiotlb slot.

Cc: Konrad Wilk 
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 kernel/dma/swiotlb.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 0dcdd25ea95d..c64e557de55c 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -531,7 +531,8 @@ static void swiotlb_bounce(struct device *dev, phys_addr_t 
tlb_addr, size_t size
}
 }
 
-#define slot_addr(start, idx)  ((start) + ((idx) << IO_TLB_SHIFT))
+#define slot_addr(start, idx)  ((start) + \
+   (((unsigned long)idx) << IO_TLB_SHIFT))
 
 /*
  * Carefully handle integer overflow which can occur when boundary_mask == 
~0UL.
-- 
2.17.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH RFC v1 3/7] swiotlb-xen: support highmem for xen specific code

2022-06-08 Thread Dongli Zhang
While for most of times the swiotlb-xen relies on the generic swiotlb api
to initialize and use swiotlb, this patch is to support highmem swiotlb
for swiotlb-xen specific code.

E.g., the xen_swiotlb_fixup() may request the hypervisor to provide
64-bit memory pages as swiotlb buffer.

Cc: Konrad Wilk 
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 drivers/xen/swiotlb-xen.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 339f46e21053..d15321e9f9db 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -38,7 +38,8 @@
 #include 
 
 #include 
-#define MAX_DMA_BITS 32
+#define MAX_DMA32_BITS 32
+#define MAX_DMA64_BITS 64
 
 /*
  * Quick lookup value of the bus address of the IOTLB.
@@ -109,19 +110,25 @@ int xen_swiotlb_fixup(void *buf, unsigned long nslabs, 
bool high)
int rc;
unsigned int order = get_order(IO_TLB_SEGSIZE << IO_TLB_SHIFT);
unsigned int i, dma_bits = order + PAGE_SHIFT;
+   unsigned int max_dma_bits = MAX_DMA32_BITS;
dma_addr_t dma_handle;
phys_addr_t p = virt_to_phys(buf);
 
BUILD_BUG_ON(IO_TLB_SEGSIZE & (IO_TLB_SEGSIZE - 1));
BUG_ON(nslabs % IO_TLB_SEGSIZE);
 
+   if (high) {
+   dma_bits = MAX_DMA64_BITS;
+   max_dma_bits = MAX_DMA64_BITS;
+   }
+
i = 0;
do {
do {
rc = xen_create_contiguous_region(
p + (i << IO_TLB_SHIFT), order,
dma_bits, _handle);
-   } while (rc && dma_bits++ < MAX_DMA_BITS);
+   } while (rc && dma_bits++ < max_dma_bits);
if (rc)
return rc;
 
@@ -381,7 +388,8 @@ xen_swiotlb_sync_sg_for_device(struct device *dev, struct 
scatterlist *sgl,
 static int
 xen_swiotlb_dma_supported(struct device *hwdev, u64 mask)
 {
-   return xen_phys_to_dma(hwdev, io_tlb_default_mem.end - 1) <= mask;
+   return xen_phys_to_dma(hwdev, io_tlb_default_mem.end - 1) <= mask ||
+  xen_phys_to_dma(hwdev, io_tlb_high_mem.end - 1) <= mask;
 }
 
 const struct dma_map_ops xen_swiotlb_dma_ops = {
-- 
2.17.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH RFC v1 1/7] swiotlb: introduce the highmem swiotlb buffer

2022-06-08 Thread Dongli Zhang
Currently, the virtio driver is not able to use 4+ GB memory when the
swiotlb is enforced, e.g., when amd sev is involved.

Fortunately, the SWIOTLB_ANY flag has been introduced since
commit 8ba2ed1be90f ("swiotlb: add a SWIOTLB_ANY flag to lift the low
memory restriction") to allocate swiotlb buffer from high memory.

While the default swiotlb is 'io_tlb_default_mem', the extra
'io_tlb_high_mem' is introduced to allocate with SWIOTLB_ANY flag in the
future patches. E.g., the user may configure the extra highmem swiotlb
buffer via "swiotlb=327680,4194304" to allocate 8GB memory.

In the future, the driver will be able to decide to use whether
'io_tlb_default_mem' or 'io_tlb_high_mem'.

The highmem swiotlb is enabled by user if io_tlb_high_mem is set. It can
be actively used if swiotlb_high_active() returns true.

The kernel command line "swiotlb=32768,3145728,force" is to allocate 64MB
for default swiotlb, and 6GB for the extra highmem swiotlb.

Cc: Konrad Wilk 
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 include/linux/swiotlb.h |  2 ++
 kernel/dma/swiotlb.c| 16 
 2 files changed, 18 insertions(+)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 7ed35dd3de6e..e67e605af2dd 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -109,6 +109,7 @@ struct io_tlb_mem {
} *slots;
 };
 extern struct io_tlb_mem io_tlb_default_mem;
+extern struct io_tlb_mem io_tlb_high_mem;
 
 static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
 {
@@ -164,6 +165,7 @@ static inline void swiotlb_adjust_size(unsigned long size)
 }
 #endif /* CONFIG_SWIOTLB */
 
+extern bool swiotlb_high_active(void);
 extern void swiotlb_print_info(void);
 
 #ifdef CONFIG_DMA_RESTRICTED_POOL
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index cb50f8d38360..569bc30e7b7a 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -66,10 +66,12 @@ static bool swiotlb_force_bounce;
 static bool swiotlb_force_disable;
 
 struct io_tlb_mem io_tlb_default_mem;
+struct io_tlb_mem io_tlb_high_mem;
 
 phys_addr_t swiotlb_unencrypted_base;
 
 static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
+static unsigned long high_nslabs;
 
 static int __init
 setup_io_tlb_npages(char *str)
@@ -81,6 +83,15 @@ setup_io_tlb_npages(char *str)
}
if (*str == ',')
++str;
+
+   if (isdigit(*str)) {
+   /* avoid tail segment of size < IO_TLB_SEGSIZE */
+   high_nslabs =
+   ALIGN(simple_strtoul(str, , 0), IO_TLB_SEGSIZE);
+   }
+   if (*str == ',')
+   ++str;
+
if (!strcmp(str, "force"))
swiotlb_force_bounce = true;
else if (!strcmp(str, "noforce"))
@@ -90,6 +101,11 @@ setup_io_tlb_npages(char *str)
 }
 early_param("swiotlb", setup_io_tlb_npages);
 
+bool swiotlb_high_active(void)
+{
+   return high_nslabs && io_tlb_high_mem.nslabs;
+}
+
 unsigned int swiotlb_max_segment(void)
 {
if (!io_tlb_default_mem.nslabs)
-- 
2.17.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH RFC v1 4/7] swiotlb: to implement io_tlb_high_mem

2022-06-08 Thread Dongli Zhang
This patch is to implement the extra 'io_tlb_high_mem'. In the future, the
device drivers may choose to use either 'io_tlb_default_mem' or
'io_tlb_high_mem' as dev->dma_io_tlb_mem.

The highmem buffer is regarded as active if
(high_nslabs && io_tlb_high_mem.nslabs) returns true.

Cc: Konrad Wilk 
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 arch/powerpc/kernel/dma-swiotlb.c |   8 ++-
 arch/x86/kernel/pci-dma.c |   5 +-
 include/linux/swiotlb.h   |   2 +-
 kernel/dma/swiotlb.c  | 103 +-
 4 files changed, 84 insertions(+), 34 deletions(-)

diff --git a/arch/powerpc/kernel/dma-swiotlb.c 
b/arch/powerpc/kernel/dma-swiotlb.c
index ba256c37bcc0..f18694881264 100644
--- a/arch/powerpc/kernel/dma-swiotlb.c
+++ b/arch/powerpc/kernel/dma-swiotlb.c
@@ -20,9 +20,11 @@ void __init swiotlb_detect_4g(void)
 
 static int __init check_swiotlb_enabled(void)
 {
-   if (ppc_swiotlb_enable)
-   swiotlb_print_info();
-   else
+   if (ppc_swiotlb_enable) {
+   swiotlb_print_info(false);
+   if (swiotlb_high_active())
+   swiotlb_print_info(true);
+   } else
swiotlb_exit();
 
return 0;
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 30bbe4abb5d6..1504b349b312 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -196,7 +196,10 @@ static int __init pci_iommu_init(void)
/* An IOMMU turned us off. */
if (x86_swiotlb_enable) {
pr_info("PCI-DMA: Using software bounce buffering for IO 
(SWIOTLB)\n");
-   swiotlb_print_info();
+
+   swiotlb_print_info(false);
+   if (swiotlb_high_active())
+   swiotlb_print_info(true);
} else {
swiotlb_exit();
}
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index e61c074c55eb..8196bf961aab 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -166,7 +166,7 @@ static inline void swiotlb_adjust_size(unsigned long size)
 #endif /* CONFIG_SWIOTLB */
 
 extern bool swiotlb_high_active(void);
-extern void swiotlb_print_info(void);
+extern void swiotlb_print_info(bool high);
 
 #ifdef CONFIG_DMA_RESTRICTED_POOL
 struct page *swiotlb_alloc(struct device *dev, size_t size);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 793ca7f9..ff82b281ce01 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -101,6 +101,21 @@ setup_io_tlb_npages(char *str)
 }
 early_param("swiotlb", setup_io_tlb_npages);
 
+static struct io_tlb_mem *io_tlb_mem_get(bool high)
+{
+   return high ? _tlb_high_mem : _tlb_default_mem;
+}
+
+static unsigned long nslabs_get(bool high)
+{
+   return high ? high_nslabs : default_nslabs;
+}
+
+static char *swiotlb_name_get(bool high)
+{
+   return high ? "high" : "default";
+}
+
 bool swiotlb_high_active(void)
 {
return high_nslabs && io_tlb_high_mem.nslabs;
@@ -133,17 +148,18 @@ void __init swiotlb_adjust_size(unsigned long size)
pr_info("SWIOTLB bounce buffer size adjusted to %luMB", size >> 20);
 }
 
-void swiotlb_print_info(void)
+void swiotlb_print_info(bool high)
 {
-   struct io_tlb_mem *mem = _tlb_default_mem;
+   struct io_tlb_mem *mem = io_tlb_mem_get(high);
 
if (!mem->nslabs) {
pr_warn("No low mem\n");
return;
}
 
-   pr_info("mapped [mem %pa-%pa] (%luMB)\n", >start, >end,
-  (mem->nslabs << IO_TLB_SHIFT) >> 20);
+   pr_info("%s mapped [mem %pa-%pa] (%luMB)\n",
+   swiotlb_name_get(high), >start, >end,
+   (mem->nslabs << IO_TLB_SHIFT) >> 20);
 }
 
 static inline unsigned long io_tlb_offset(unsigned long val)
@@ -184,15 +200,9 @@ static void *swiotlb_mem_remap(struct io_tlb_mem *mem, 
unsigned long bytes)
 }
 #endif
 
-/*
- * Early SWIOTLB allocation may be too early to allow an architecture to
- * perform the desired operations.  This function allows the architecture to
- * call SWIOTLB when the operations are possible.  It needs to be called
- * before the SWIOTLB memory is used.
- */
-void __init swiotlb_update_mem_attributes(void)
+static void __init __swiotlb_update_mem_attributes(bool high)
 {
-   struct io_tlb_mem *mem = _tlb_default_mem;
+   struct io_tlb_mem *mem = io_tlb_mem_get(high);
void *vaddr;
unsigned long bytes;
 
@@ -207,6 +217,19 @@ void __init swiotlb_update_mem_attributes(void)
mem->vaddr = vaddr;
 }
 
+/*
+ * Early SWIOTLB allocation may be too early to allow an architecture to
+ * perform the desired operations.  This function allows the architecture to
+ * call SWIOTLB when the operations are possible.  It needs to be called
+ * before the SWIOTLB

[PATCH RFC v1 2/7] swiotlb: change the signature of remap function

2022-06-08 Thread Dongli Zhang
Add new argument 'high' to remap function, so that it will be able to
remap the swiotlb buffer based on whether the swiotlb is 32-bit or
64-bit.

Currently the only remap function is xen_swiotlb_fixup().

Cc: Konrad Wilk 
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 arch/x86/include/asm/xen/swiotlb-xen.h | 2 +-
 drivers/xen/swiotlb-xen.c  | 2 +-
 include/linux/swiotlb.h| 4 ++--
 kernel/dma/swiotlb.c   | 8 
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/xen/swiotlb-xen.h 
b/arch/x86/include/asm/xen/swiotlb-xen.h
index 77a2d19cc990..a54eae15605e 100644
--- a/arch/x86/include/asm/xen/swiotlb-xen.h
+++ b/arch/x86/include/asm/xen/swiotlb-xen.h
@@ -8,7 +8,7 @@ extern int pci_xen_swiotlb_init_late(void);
 static inline int pci_xen_swiotlb_init_late(void) { return -ENXIO; }
 #endif
 
-int xen_swiotlb_fixup(void *buf, unsigned long nslabs);
+int xen_swiotlb_fixup(void *buf, unsigned long nslabs, bool high);
 int xen_create_contiguous_region(phys_addr_t pstart, unsigned int order,
unsigned int address_bits,
dma_addr_t *dma_handle);
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 67aa74d20162..339f46e21053 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -104,7 +104,7 @@ static int is_xen_swiotlb_buffer(struct device *dev, 
dma_addr_t dma_addr)
 }
 
 #ifdef CONFIG_X86
-int xen_swiotlb_fixup(void *buf, unsigned long nslabs)
+int xen_swiotlb_fixup(void *buf, unsigned long nslabs, bool high)
 {
int rc;
unsigned int order = get_order(IO_TLB_SEGSIZE << IO_TLB_SHIFT);
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index e67e605af2dd..e61c074c55eb 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -36,9 +36,9 @@ struct scatterlist;
 
 unsigned long swiotlb_size_or_default(void);
 void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
-   int (*remap)(void *tlb, unsigned long nslabs));
+   int (*remap)(void *tlb, unsigned long nslabs, bool high));
 int swiotlb_init_late(size_t size, gfp_t gfp_mask,
-   int (*remap)(void *tlb, unsigned long nslabs));
+   int (*remap)(void *tlb, unsigned long nslabs, bool high));
 extern void __init swiotlb_update_mem_attributes(void);
 
 phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t phys,
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 569bc30e7b7a..793ca7f9 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -245,7 +245,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, 
phys_addr_t start,
  * structures for the software IO TLB used to implement the DMA API.
  */
 void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
-   int (*remap)(void *tlb, unsigned long nslabs))
+   int (*remap)(void *tlb, unsigned long nslabs, bool high))
 {
struct io_tlb_mem *mem = _tlb_default_mem;
unsigned long nslabs = default_nslabs;
@@ -274,7 +274,7 @@ void __init swiotlb_init_remap(bool addressing_limit, 
unsigned int flags,
return;
}
 
-   if (remap && remap(tlb, nslabs) < 0) {
+   if (remap && remap(tlb, nslabs, false) < 0) {
memblock_free(tlb, PAGE_ALIGN(bytes));
 
nslabs = ALIGN(nslabs >> 1, IO_TLB_SEGSIZE);
@@ -307,7 +307,7 @@ void __init swiotlb_init(bool addressing_limit, unsigned 
int flags)
  * This should be just like above, but with some error catching.
  */
 int swiotlb_init_late(size_t size, gfp_t gfp_mask,
-   int (*remap)(void *tlb, unsigned long nslabs))
+   int (*remap)(void *tlb, unsigned long nslabs, bool high))
 {
struct io_tlb_mem *mem = _tlb_default_mem;
unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
@@ -337,7 +337,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
return -ENOMEM;
 
if (remap)
-   rc = remap(vstart, nslabs);
+   rc = remap(vstart, nslabs, false);
if (rc) {
free_pages((unsigned long)vstart, order);
 
-- 
2.17.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH RFC v1 6/7] virtio: use io_tlb_high_mem if it is active

2022-06-08 Thread Dongli Zhang
When the swiotlb is enforced (e.g., when amd sev is involved), the virito
driver will not be able to use 4+ GB memory. Therefore, the virtio driver
uses 'io_tlb_high_mem' as swiotlb.

Cc: Konrad Wilk 
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 drivers/virtio/virtio.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index ef04a96942bf..d9ebe3940e2d 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -5,6 +5,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 
 /* Unique numbering for virtio devices. */
@@ -241,6 +243,12 @@ static int virtio_dev_probe(struct device *_d)
u64 device_features;
u64 driver_features;
u64 driver_features_legacy;
+   struct device *parent = dev->dev.parent;
+   u64 dma_mask = min_not_zero(*parent->dma_mask,
+   parent->bus_dma_limit);
+
+   if (dma_mask == DMA_BIT_MASK(64))
+   swiotlb_use_high(parent);
 
/* We have a driver! */
virtio_add_status(dev, VIRTIO_CONFIG_S_DRIVER);
-- 
2.17.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH RFC v1 5/7] swiotlb: add interface to set dev->dma_io_tlb_mem

2022-06-08 Thread Dongli Zhang
The interface re-configures dev->dma_io_tlb_mem conditionally, if the
'io_tlb_high_mem' is active.

Cc: Konrad Wilk 
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 include/linux/swiotlb.h |  6 ++
 kernel/dma/swiotlb.c| 10 ++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 8196bf961aab..78217d8bbee2 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -131,6 +131,7 @@ unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
 bool is_swiotlb_active(struct device *dev);
 void __init swiotlb_adjust_size(unsigned long size);
+bool swiotlb_use_high(struct device *dev);
 #else
 static inline void swiotlb_init(bool addressing_limited, unsigned int flags)
 {
@@ -163,6 +164,11 @@ static inline bool is_swiotlb_active(struct device *dev)
 static inline void swiotlb_adjust_size(unsigned long size)
 {
 }
+
+static bool swiotlb_use_high(struct device *dev);
+{
+   return false;
+}
 #endif /* CONFIG_SWIOTLB */
 
 extern bool swiotlb_high_active(void);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index ff82b281ce01..0dcdd25ea95d 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -121,6 +121,16 @@ bool swiotlb_high_active(void)
return high_nslabs && io_tlb_high_mem.nslabs;
 }
 
+bool swiotlb_use_high(struct device *dev)
+{
+   if (!swiotlb_high_active())
+   return false;
+
+   dev->dma_io_tlb_mem = _tlb_high_mem;
+   return true;
+}
+EXPORT_SYMBOL_GPL(swiotlb_use_high);
+
 unsigned int swiotlb_max_segment(void)
 {
if (!io_tlb_default_mem.nslabs)
-- 
2.17.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 1/2] virtio-blk: support polling I/O

2022-03-24 Thread Dongli Zhang
Hi Suwan,

The NVMe prints something like below by nvme_setup_io_queues() to confirm
if the configuration takes effect.

"[ 0.620458] nvme nvme0: 4/0/0 default/read/poll queues".

How about to print in virtio-blk as well?

Thank you very much!

Dongli Zhang


On 3/24/22 7:04 AM, Suwan Kim wrote:
> This patch supports polling I/O via virtio-blk driver. Polling
> feature is enabled by module parameter "num_poll_queues" and it
> sets dedicated polling queues for virtio-blk. This patch improves
> the polling I/O throughput and latency.
> 
> The virtio-blk driver doesn't not have a poll function and a poll
> queue and it has been operating in interrupt driven method even if
> the polling function is called in the upper layer.
> 
> virtio-blk polling is implemented upon 'batched completion' of block
> layer. virtblk_poll() queues completed request to io_comp_batch->req_list
> and later, virtblk_complete_batch() calls unmap function and ends
> the requests in batch.
> 
> virtio-blk reads the number of poll queues from module parameter
> "num_poll_queues". If VM sets queue parameter as below,
> ("num-queues=N" [QEMU property], "num_poll_queues=M" [module parameter])
> It allocates N virtqueues to virtio_blk->vqs[N] and it uses [0..(N-M-1)]
> as default queues and [(N-M)..(N-1)] as poll queues. Unlike the default
> queues, the poll queues have no callback function.
> 
> Regarding HW-SW queue mapping, the default queue mapping uses the
> existing method that condsiders MSI irq vector. But the poll queue
> doesn't have an irq, so it uses the regular blk-mq cpu mapping.
> 
> For verifying the improvement, I did Fio polling I/O performance test
> with io_uring engine with the options below.
> (io_uring, hipri, randread, direct=1, bs=512, iodepth=64 numjobs=N)
> I set 4 vcpu and 4 virtio-blk queues - 2 default queues and 2 poll
> queues for VM.
> 
> As a result, IOPS and average latency improved about 10%.
> 
> Test result:
> 
> - Fio io_uring poll without virtio-blk poll support
>   -- numjobs=1 : IOPS = 339K, avg latency = 188.33us
>   -- numjobs=2 : IOPS = 367K, avg latency = 347.33us
>   -- numjobs=4 : IOPS = 383K, avg latency = 682.06us
> 
> - Fio io_uring poll with virtio-blk poll support
>   -- numjobs=1 : IOPS = 380K, avg latency = 167.87us
>   -- numjobs=2 : IOPS = 409K, avg latency = 312.6us
>   -- numjobs=4 : IOPS = 413K, avg latency = 619.72us
> 
> Reported-by: kernel test robot 
> Signed-off-by: Suwan Kim 
> ---
>  drivers/block/virtio_blk.c | 101 +++--
>  1 file changed, 97 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 8c415be86732..3d16f8b753e7 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -37,6 +37,10 @@ MODULE_PARM_DESC(num_request_queues,
>"0 for no limit. "
>"Values > nr_cpu_ids truncated to nr_cpu_ids.");
>  
> +static unsigned int num_poll_queues;
> +module_param(num_poll_queues, uint, 0644);
> +MODULE_PARM_DESC(num_poll_queues, "The number of dedicated virtqueues for 
> polling I/O");
> +
>  static int major;
>  static DEFINE_IDA(vd_index_ida);
>  
> @@ -81,6 +85,7 @@ struct virtio_blk {
>  
>   /* num of vqs */
>   int num_vqs;
> + int io_queues[HCTX_MAX_TYPES];
>   struct virtio_blk_vq *vqs;
>  };
>  
> @@ -548,6 +553,7 @@ static int init_vq(struct virtio_blk *vblk)
>   const char **names;
>   struct virtqueue **vqs;
>   unsigned short num_vqs;
> + unsigned int num_poll_vqs;
>   struct virtio_device *vdev = vblk->vdev;
>   struct irq_affinity desc = { 0, };
>  
> @@ -556,6 +562,7 @@ static int init_vq(struct virtio_blk *vblk)
>  _vqs);
>   if (err)
>   num_vqs = 1;
> +
>   if (!err && !num_vqs) {
>   dev_err(>dev, "MQ advertised but zero queues reported\n");
>   return -EINVAL;
> @@ -565,6 +572,13 @@ static int init_vq(struct virtio_blk *vblk)
>   min_not_zero(num_request_queues, nr_cpu_ids),
>   num_vqs);
>  
> + num_poll_vqs = min_t(unsigned int, num_poll_queues, num_vqs - 1);
> +
> + memset(vblk->io_queues, 0, sizeof(int) * HCTX_MAX_TYPES);
> + vblk->io_queues[HCTX_TYPE_DEFAULT] = num_vqs - num_poll_vqs;
> + vblk->io_queues[HCTX_TYPE_READ] = 0;
> + vblk->io_queues[HCTX_TYPE_POLL] = num_poll_vqs;
> +
>   vblk->vqs = kmalloc_array(num_vqs, sizeof(*vblk->vqs), GFP_KERNEL);
>   if (!vblk->vqs

Re: [PATCH 1/3] virtio: cache indirect desc for split

2021-10-27 Thread Dongli Zhang



On 10/26/21 11:19 PM, Xuan Zhuo wrote:
> In the case of using indirect, indirect desc must be allocated and
> released each time, which increases a lot of cpu overhead.
> 
> Here, a cache is added for indirect. If the number of indirect desc to be
> applied for is less than VIRT_QUEUE_CACHE_DESC_NUM, the desc array with
> the size of VIRT_QUEUE_CACHE_DESC_NUM is fixed and cached for reuse.
> 
> Signed-off-by: Xuan Zhuo 
> ---
>  drivers/virtio/virtio.c  |  6 
>  drivers/virtio/virtio_ring.c | 63 ++--
>  include/linux/virtio.h   | 10 ++
>  3 files changed, 70 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
> index 0a5b54034d4b..04bcb74e5b9a 100644
> --- a/drivers/virtio/virtio.c
> +++ b/drivers/virtio/virtio.c
> @@ -431,6 +431,12 @@ bool is_virtio_device(struct device *dev)
>  }
>  EXPORT_SYMBOL_GPL(is_virtio_device);
>  
> +void virtio_use_desc_cache(struct virtio_device *dev, bool val)
> +{
> + dev->desc_cache = val;
> +}
> +EXPORT_SYMBOL_GPL(virtio_use_desc_cache);
> +
>  void unregister_virtio_device(struct virtio_device *dev)
>  {
>   int index = dev->index; /* save for after device release */
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index dd95dfd85e98..0b9a8544b0e8 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -117,6 +117,10 @@ struct vring_virtqueue {
>   /* Hint for event idx: already triggered no need to disable. */
>   bool event_triggered;
>  
> + /* Is indirect cache used? */
> + bool use_desc_cache;
> + void *desc_cache_chain;
> +
>   union {
>   /* Available for split ring */
>   struct {
> @@ -423,12 +427,47 @@ static unsigned int vring_unmap_one_split(const struct 
> vring_virtqueue *vq,
>   return extra[i].next;
>  }
>  
> -static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
> +#define VIRT_QUEUE_CACHE_DESC_NUM 4
> +
> +static void desc_cache_chain_free_split(void *chain)
> +{
> + struct vring_desc *desc;
> +
> + while (chain) {
> + desc = chain;
> + chain = (void *)desc->addr;
> + kfree(desc);
> + }
> +}
> +
> +static void desc_cache_put_split(struct vring_virtqueue *vq,
> +  struct vring_desc *desc, int n)
> +{
> + if (vq->use_desc_cache && n <= VIRT_QUEUE_CACHE_DESC_NUM) {
> + desc->addr = (u64)vq->desc_cache_chain;
> + vq->desc_cache_chain = desc;
> + } else {
> + kfree(desc);
> + }
> +}
> +
> +static struct vring_desc *alloc_indirect_split(struct vring_virtqueue *vq,
>  unsigned int total_sg,
>  gfp_t gfp)
>  {
>   struct vring_desc *desc;
> - unsigned int i;
> + unsigned int i, n;
> +
> + if (vq->use_desc_cache && total_sg <= VIRT_QUEUE_CACHE_DESC_NUM) {
> + if (vq->desc_cache_chain) {
> + desc = vq->desc_cache_chain;
> +     vq->desc_cache_chain = (void *)desc->addr;
> + goto got;
> + }
> + n = VIRT_QUEUE_CACHE_DESC_NUM;

How about to make the VIRT_QUEUE_CACHE_DESC_NUM configurable (at least during
driver probing) unless there is a reason that the default value is 4.

Thank you very much!

Dongli Zhang



> + } else {
> + n = total_sg;
> + }
>  
>   /*
>* We require lowmem mappings for the descriptors because
> @@ -437,12 +476,13 @@ static struct vring_desc *alloc_indirect_split(struct 
> virtqueue *_vq,
>*/
>   gfp &= ~__GFP_HIGHMEM;
>  
> - desc = kmalloc_array(total_sg, sizeof(struct vring_desc), gfp);
> + desc = kmalloc_array(n, sizeof(struct vring_desc), gfp);
>   if (!desc)
>   return NULL;
>  
> +got:
>   for (i = 0; i < total_sg; i++)
> - desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1);
> + desc[i].next = cpu_to_virtio16(vq->vq.vdev, i + 1);
>   return desc;
>  }
>  
> @@ -508,7 +548,7 @@ static inline int virtqueue_add_split(struct virtqueue 
> *_vq,
>   head = vq->free_head;
>  
>   if (virtqueue_use_indirect(_vq, total_sg))
> - desc = alloc_indirect_split(_vq, total_sg, gfp);
> + desc = alloc_indirect_split(vq, total_sg, gfp);
>   else {
>   desc = NULL;
>   WARN_ON_ONCE(total_sg > vq->split.vring.num

Re: [PATCH V2 06/12] virtio_pci: harden MSI-X interrupts

2021-10-19 Thread Dongli Zhang



On 10/18/21 6:33 PM, Jason Wang wrote:
> On Sat, Oct 16, 2021 at 1:27 AM Michael S. Tsirkin  wrote:
>>
>> On Fri, Oct 15, 2021 at 05:09:38AM -0700, Dongli Zhang wrote:
>>> Hi Jason,
>>>
>>> On 10/11/21 11:52 PM, Jason Wang wrote:
>>>> We used to synchronize pending MSI-X irq handlers via
>>>> synchronize_irq(), this may not work for the untrusted device which
>>>> may keep sending interrupts after reset which may lead unexpected
>>>> results. Similarly, we should not enable MSI-X interrupt until the
>>>
>>> About "unexpected results", while you mentioned below in v1 ...
>>>
>>> "My understanding is that e.g in the case of SEV/TDX we don't trust the
>>> hypervisor. So the hypervisor can keep sending interrupts even if the
>>> device is reset. The guest can only trust its own software interrupt
>>> management logic to avoid call virtio callback in this case."
>>>
>>> .. and you also mentioned to avoid the panic (due to untrusted device) in as
>>> many scenarios as possible.
>>>
>>>
>>> Here is my understanding.
>>>
>>> The reason we do not trust hypervisor is to protect (1) data/code privacy 
>>> for
>>> most of times and sometimes (2) program execution integrity.
>>>
>>> The bad thing is: the hypervisor is able to panic/destroy the VM in the 
>>> worst case.
>>>
>>> It is reasonable to re-configure/recover if we assume there is BUG at
>>> hypervisor/device side. That is, the hypervisor/device is buggy, but not 
>>> malicious.
>>>
>>> However, how about to just detect and report the hypervisor/device is 
>>> malicious
>>> and shutdown/panic the VM immediately? If we have detected the malicious
>>> hypervisor, we should avoid running VMs on the malicious hypervisor 
>>> further. At
>>> least how about to print error message to reminder users that the 
>>> hypervisor is
>>> malicious?
> 
> I understand your point, but several questions needs to be answered:
> 
> 1) how can we easily differentiate "malicious" from "buggy"?
> 2) If we want to detect malicious hypervisor, virtio is not the only
> place that we want to do this
> 3) Is there a way that "malicious" hypervisor can prevent guest from
> shutting down itself?
> 
>>>
>>>
>>> About "unexpected results", it should not be hang/panic. I suggest ...
>>>
> 
> It's just the phenomenon not the logic behind that. It could be e.g
> OOB which may lead the unexpected kernel codes to be executed in
> unexpected ways (e.g mark the page as shared or go with IOTLB etc).
> Sometimes we can see panic finally but it's not always.

This is what I was trying to explain.

The objective is to protect "data privacy" (or code execution integrity in some
case), but not to prevent DoS attack. That is, the 'malicious' hypervisor should
not be able to derive VM data privacy.

Suppose the hypervisor did something buggy/malicious when SEV/TDX is used by VM,
the "unexpected results" should never reveal secure/private data to the 
hypervisor.

In my own opinion, the threat model is:

Attacker: 'malicious' hypervisor

Victim: VM with SEV/TDX/SGX

The attacker should not be able to steal secure/private data from VM, when the
hypervisor's action is unexpected. DoS is out of the scope.

My concern is: it is very hard to clearly explain in the patchset how the
hypervisor is able to steal VM's data, by setting queue=0 or injecting unwanted
interrupts to VM.

Thank you very much!

Dongli Zhang

> 
>>> Assuming SEV/TDX is involved, the hypervisor should never be able to derive 
>>> the
>>> VM privacy (at least secure memory data) by providing malicious 
>>> configuration,
>>> e.g., num_queues=0.
> 
> Yes.
> 
>>> If we detect hypervisor is malicious, the VM is
>>> panic/shutdown immediately.
> 
> This seems to enforce the policy into the kernel, we need to leave
> this to userspace to decide.
> 
>>> At least how about to print error message to
>>> reminder users.
> 
> This is fine.
> 
>>>
>>>
>>> BTW, while I always do care about the loss of interrupt issue, the malicious
>>> device is able to hang a VM by dropping a single interrupt on purpose for
>>> virtio-scsi :)
>>>
> 
> Right.
> 
>>>
>>> Thank you very much!
>>
>>
>> Can't say I understand what it's about. TDX does not protect against
>> hypervisor DoS att

Re: [PATCH V2 06/12] virtio_pci: harden MSI-X interrupts

2021-10-15 Thread Dongli Zhang
Hi Jason,

On 10/11/21 11:52 PM, Jason Wang wrote:
> We used to synchronize pending MSI-X irq handlers via
> synchronize_irq(), this may not work for the untrusted device which
> may keep sending interrupts after reset which may lead unexpected
> results. Similarly, we should not enable MSI-X interrupt until the

About "unexpected results", while you mentioned below in v1 ...

"My understanding is that e.g in the case of SEV/TDX we don't trust the
hypervisor. So the hypervisor can keep sending interrupts even if the
device is reset. The guest can only trust its own software interrupt
management logic to avoid call virtio callback in this case."

.. and you also mentioned to avoid the panic (due to untrusted device) in as
many scenarios as possible.


Here is my understanding.

The reason we do not trust hypervisor is to protect (1) data/code privacy for
most of times and sometimes (2) program execution integrity.

The bad thing is: the hypervisor is able to panic/destroy the VM in the worst 
case.

It is reasonable to re-configure/recover if we assume there is BUG at
hypervisor/device side. That is, the hypervisor/device is buggy, but not 
malicious.

However, how about to just detect and report the hypervisor/device is malicious
and shutdown/panic the VM immediately? If we have detected the malicious
hypervisor, we should avoid running VMs on the malicious hypervisor further. At
least how about to print error message to reminder users that the hypervisor is
malicious?


About "unexpected results", it should not be hang/panic. I suggest ...

Assuming SEV/TDX is involved, the hypervisor should never be able to derive the
VM privacy (at least secure memory data) by providing malicious configuration,
e.g., num_queues=0. If we detect hypervisor is malicious, the VM is
panic/shutdown immediately. At least how about to print error message to
reminder users.


BTW, while I always do care about the loss of interrupt issue, the malicious
device is able to hang a VM by dropping a single interrupt on purpose for
virtio-scsi :)


Thank you very much!

Dongli Zhang

> device is ready. So this patch fixes those two issues by:
> 
> 1) switching to use disable_irq() to prevent the virtio interrupt
>handlers to be called after the device is reset.
> 2) using IRQF_NO_AUTOEN and enable the MSI-X irq during .ready()
> 
> This can make sure the virtio interrupt handler won't be called before
> virtio_device_ready() and after reset.
> 
> Cc: Thomas Gleixner 
> Cc: Peter Zijlstra 
> Cc: Paul E. McKenney 
> Signed-off-by: Jason Wang 
> ---
>  drivers/virtio/virtio_pci_common.c | 27 +--
>  drivers/virtio/virtio_pci_common.h |  6 --
>  drivers/virtio/virtio_pci_legacy.c |  5 +++--
>  drivers/virtio/virtio_pci_modern.c |  6 --
>  4 files changed, 32 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_pci_common.c 
> b/drivers/virtio/virtio_pci_common.c
> index b35bb2d57f62..0b9523e6dd39 100644
> --- a/drivers/virtio/virtio_pci_common.c
> +++ b/drivers/virtio/virtio_pci_common.c
> @@ -24,8 +24,8 @@ MODULE_PARM_DESC(force_legacy,
>"Force legacy mode for transitional virtio 1 devices");
>  #endif
>  
> -/* wait for pending irq handlers */
> -void vp_synchronize_vectors(struct virtio_device *vdev)
> +/* disable irq handlers */
> +void vp_disable_vectors(struct virtio_device *vdev)
>  {
>   struct virtio_pci_device *vp_dev = to_vp_device(vdev);
>   int i;
> @@ -34,7 +34,20 @@ void vp_synchronize_vectors(struct virtio_device *vdev)
>   synchronize_irq(vp_dev->pci_dev->irq);
>  
>   for (i = 0; i < vp_dev->msix_vectors; ++i)
> - synchronize_irq(pci_irq_vector(vp_dev->pci_dev, i));
> + disable_irq(pci_irq_vector(vp_dev->pci_dev, i));
> +}
> +
> +/* enable irq handlers */
> +void vp_enable_vectors(struct virtio_device *vdev)
> +{
> + struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> + int i;
> +
> + if (vp_dev->intx_enabled)
> + return;
> +
> + for (i = 0; i < vp_dev->msix_vectors; ++i)
> + enable_irq(pci_irq_vector(vp_dev->pci_dev, i));
>  }
>  
>  /* the notify function used when creating a virt queue */
> @@ -141,7 +154,8 @@ static int vp_request_msix_vectors(struct virtio_device 
> *vdev, int nvectors,
>   snprintf(vp_dev->msix_names[v], sizeof *vp_dev->msix_names,
>"%s-config", name);
>   err = request_irq(pci_irq_vector(vp_dev->pci_dev, v),
> -   vp_config_changed, 0, vp_dev->msix_names[v],
> +   vp_config_changed, IRQF_NO_AUTOEN,
> +   vp_dev->msix_names[v],
> 

Re: [RFC] virtio_scsi: to poll and kick the virtqueue in timeout handler

2021-05-25 Thread Dongli Zhang



On 5/25/21 10:24 AM, Hannes Reinecke wrote:
> On 5/25/21 6:47 PM, Stefan Hajnoczi wrote:
>> On Mon, May 24, 2021 at 11:33:33PM -0700, Dongli Zhang wrote:
>>> On 5/24/21 6:24 AM, Stefan Hajnoczi wrote:
>>>> On Sun, May 23, 2021 at 09:39:51AM +0200, Hannes Reinecke wrote:
>>>>> On 5/23/21 8:38 AM, Dongli Zhang wrote:
>>>>>> This RFC is to trigger the discussion about to poll and kick the
>>>>>> virtqueue on purpose in virtio-scsi timeout handler.
>>>>>>
>>>>>> The virtio-scsi relies on the virtio vring shared between VM and host.
>>>>>> The VM side produces requests to vring and kicks the virtqueue, while the
>>>>>> host side produces responses to vring and interrupts the VM side.
>>>>>>
>>>>>> By default the virtio-scsi handler depends on the host timeout handler
>>>>>> by BLK_EH_RESET_TIMER to give host a chance to perform EH.
>>>>>>
>>>>>> However, this is not helpful for the case that the responses are 
>>>>>> available
>>>>>> on vring but the notification from host to VM is lost.
>>>>>>
>>>>> How can this happen?
>>>>> If responses are lost the communication between VM and host is broken, and
>>>>> we should rather reset the virtio rings themselves.
>>>>
>>>> I agree. In principle it's fine to poll the virtqueue at any time, but I
>>>> don't understand the failure scenario here. It's not clear to me why the
>>>> device-to-driver vq notification could be lost.
>>>>
>>>
>>> One example is the CPU hotplug issue before the commit bf0beec0607d 
>>> ("blk-mq:
>>> drain I/O when all CPUs in a hctx are offline") was available. The issue is
>>> equivalent to loss of interrupt. Without the CPU hotplug fix, while NVMe 
>>> driver
>>> relies on the timeout handler to complete inflight IO requests, the PV
>>> virtio-scsi may hang permanently.
>>>
>>> In addition, as the virtio/vhost/QEMU are complex software, we are not able 
>>> to
>>> guarantee there is no further lost of interrupt/kick issue in the future. 
>>> It is
>>> really painful if we encounter such issue in production environment.
>>
>> Any number of hardware or software bugs might exist that we don't know
>> about, yet we don't pre-emptively add workarounds for them because where
>> do you draw the line?
>>
>> I checked other SCSI/block drivers and found it's rare to poll in the
>> timeout function so there does not seem to be a consensus that it's
>> useful to do this.
>>
> Not only this; it's downright dangerous attempting to do that in SCSI.
> In SCSI we don't have fixed lifetime guarantees that NVMe has, so there will 
> be
> a race condition between timeout and command completion.

Thank you very much for the explanation. Yes, we cannot do that due to the race.

Dongli Zhang


> Plus there is no interface in SCSI allowing to 'poll' for completions in a
> meaningful manner.
> 
>> That said, it's technically fine to do it, the virtqueue APIs are there
>> and can be used like this. So if you and others think this is necessary,
>> then it's a pretty small change and I'm not against merging a patch like
>> this.
>>
> I would rather _not_ put more functionality into the virtio_scsi timeout
> handler; this only serves to assume that the timeout handler has some
> functionality in virtio.
> Which it patently hasn't, as the prime reason for a timeout handler is to
> _abort_ a command, which we can't on virtio.
> Well, we can on virtio, but qemu as the main user will re-route the I/O from
> virtio into doing async-I/O, and there is no way how we can abort outstanding
> asynchronous I/O.
> Or any other ioctl, for that matter.
> 
> Cheers,
> 
> Hannes
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] virtio_scsi: to poll and kick the virtqueue in timeout handler

2021-05-25 Thread Dongli Zhang
Hi Stefan and Hannes,

On 5/24/21 6:24 AM, Stefan Hajnoczi wrote:
> On Sun, May 23, 2021 at 09:39:51AM +0200, Hannes Reinecke wrote:
>> On 5/23/21 8:38 AM, Dongli Zhang wrote:
>>> This RFC is to trigger the discussion about to poll and kick the
>>> virtqueue on purpose in virtio-scsi timeout handler.
>>>
>>> The virtio-scsi relies on the virtio vring shared between VM and host.
>>> The VM side produces requests to vring and kicks the virtqueue, while the
>>> host side produces responses to vring and interrupts the VM side.
>>>
>>> By default the virtio-scsi handler depends on the host timeout handler
>>> by BLK_EH_RESET_TIMER to give host a chance to perform EH.
>>>
>>> However, this is not helpful for the case that the responses are available
>>> on vring but the notification from host to VM is lost.
>>>
>> How can this happen?
>> If responses are lost the communication between VM and host is broken, and
>> we should rather reset the virtio rings themselves.
> 
> I agree. In principle it's fine to poll the virtqueue at any time, but I
> don't understand the failure scenario here. It's not clear to me why the
> device-to-driver vq notification could be lost.
> 

One example is the CPU hotplug issue before the commit bf0beec0607d ("blk-mq:
drain I/O when all CPUs in a hctx are offline") was available. The issue is
equivalent to loss of interrupt. Without the CPU hotplug fix, while NVMe driver
relies on the timeout handler to complete inflight IO requests, the PV
virtio-scsi may hang permanently.

In addition, as the virtio/vhost/QEMU are complex software, we are not able to
guarantee there is no further lost of interrupt/kick issue in the future. It is
really painful if we encounter such issue in production environment.


About to reset vring, if this is just due to loss of interrupt, I do not think
it is necessary to reset the entire vring. To poll the vring should be enough.
The NVMe PCI does the same by assuming there may be loss of interrupt.

Once there is request timeout, the NVMe PCI driver first polls the ring buffer
and confirm if the request is completed, instead of reset/abort immediately.


1254 static enum blk_eh_timer_return nvme_timeout(struct request *req, bool
reserved)
... ...
1280 /*
1281  * Did we miss an interrupt?
1282  */
1283 if (test_bit(NVMEQ_POLLED, >flags))
1284 nvme_poll(req->mq_hctx);
1285 else
1286 nvme_poll_irqdisable(nvmeq);
1287
1288 if (blk_mq_request_completed(req)) {
1289 dev_warn(dev->ctrl.device,
1290  "I/O %d QID %d timeout, completion polled\n",
1291  req->tag, nvmeq->qid);
1292 return BLK_EH_DONE;
1293 }


Thank you very much!

Dongli Zhang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC] virtio_scsi: to poll and kick the virtqueue in timeout handler

2021-05-23 Thread Dongli Zhang
This RFC is to trigger the discussion about to poll and kick the
virtqueue on purpose in virtio-scsi timeout handler.

The virtio-scsi relies on the virtio vring shared between VM and host.
The VM side produces requests to vring and kicks the virtqueue, while the
host side produces responses to vring and interrupts the VM side.

By default the virtio-scsi handler depends on the host timeout handler
by BLK_EH_RESET_TIMER to give host a chance to perform EH.

However, this is not helpful for the case that the responses are available
on vring but the notification from host to VM is lost.


Would you mind share your feedback on this idea to poll the vring on
purpose in timeout handler to give VM a chance to recover from stalled
IO? In addition, how about to kick for an extra time, in case the
stalled IO is due to the loss of VM-to-host notification?


I have tested the IO can be recovered after interrupt is lost on purpose
with below debug patch.

https://github.com/finallyjustice/patchset/blob/master/scsi-virtio_scsi-to-lose-an-interrupt-on-purpose.patch


In addition, the virtio-blk may implement the timeout handler as well,
with the helper function in below patchset from Stefan.

https://lore.kernel.org/linux-block/20210520141305.355961-1-stefa...@redhat.com/T/#t


Thank you very much!

Dongli Zhang



Considering there might be loss of interrupt or kick issue, the timeout
handler may poll or kick the virtqueue on purpose, in order to recover the
IO.

If the response is already available on vring, it indicates the host side
has already processed the request and it is not helpful by giving host a
chance to perform EH.

There will be syslog like below by the timeout handler:

[  135.830746] virtio_scsi: Virtio SCSI HBA 0: I/O 196 QID 3 timeout, 
completion polled

Signed-off-by: Dongli Zhang 
---
 drivers/scsi/virtio_scsi.c | 32 
 1 file changed, 32 insertions(+)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index b9c86a7e3b97..633950b6336a 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -36,6 +36,9 @@
 #define VIRTIO_SCSI_EVENT_LEN 8
 #define VIRTIO_SCSI_VQ_BASE 2
 
+static bool __read_mostly timed_out_enabled;
+module_param(timed_out_enabled, bool, 0644);
+
 /* Command queue element */
 struct virtio_scsi_cmd {
struct scsi_cmnd *sc;
@@ -732,9 +735,38 @@ static void virtscsi_commit_rqs(struct Scsi_Host *shost, 
u16 hwq)
  * The host guarantees to respond to each command, although I/O
  * latencies might be higher than on bare metal.  Reset the timer
  * unconditionally to give the host a chance to perform EH.
+ *
+ * If 'timed_out_enabled' is set, the timeout handler polls or kicks the
+ * virtqueue on purpose, in order to recover the IO, in case there is loss
+ * of interrupt or kick issue with virtio.
  */
 static enum blk_eh_timer_return virtscsi_eh_timed_out(struct scsi_cmnd *scmnd)
 {
+   struct Scsi_Host *shost;
+   struct virtio_scsi *vscsi;
+   struct virtio_scsi_vq *req_vq;
+
+   if (!timed_out_enabled)
+   return BLK_EH_RESET_TIMER;
+
+   shost = scmnd->device->host;
+   vscsi = shost_priv(shost);
+   req_vq = virtscsi_pick_vq_mq(vscsi, scmnd);
+
+   virtscsi_vq_done(vscsi, req_vq, virtscsi_complete_cmd);
+
+   if (test_bit(SCMD_STATE_COMPLETE, >state)) {
+   pr_warn("Virtio SCSI HBA %u: I/O %u QID %d timeout, completion 
polled\n",
+   shost->host_no, scmnd->tag, req_vq->vq->index);
+   return BLK_EH_DONE;
+   }
+
+   /*
+* To kick the virtqueue in case the timeout is due to the loss of
+* one prior notification.
+*/
+   virtqueue_notify(req_vq->vq);
+
return BLK_EH_RESET_TIMER;
 }
 
-- 
2.17.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [vdpa_sim_net] 79991caf52: net/ipv4/ipmr.c:#RCU-list_traversed_in_non-reader_section

2021-02-07 Thread Dongli Zhang
Is it possible that the issue is not due to this change?

This change is just to call different API to allocate memory, which is
equivalent to kzalloc()+vzalloc().

Before the change:

try kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_MAYFAIL);

... and then below if the former is failed.

vzalloc(sizeof(*vs));


After the change:

try kmalloc_node(size, FP_KERNEL|GFP_ZERO|__GFP_NOWARN|__GFP_NORETRY, node);

... and then below if the former is failed

__vmalloc_node(size, 1, GFP_KERNEL|GFP_ZERO, node, __builtin_return_address(0));


The below is the first WARNING in uploaded dmesg. I assume it was called before
to open /dev/vhost-scsi.

Will this test try to open /dev/vhost-scsi?

[5.095515] =
[5.095515] WARNING: suspicious RCU usage
[5.095515] 5.11.0-rc4-8-g79991caf5202 #1 Not tainted
[5.095534] -
[5.096041] security/smack/smack_lsm.c:351 RCU-list traversed in non-reader
section!!
[5.096982]
[5.096982] other info that might help us debug this:
[5.096982]
[5.097953]
[5.097953] rcu_scheduler_active = 1, debug_locks = 1
[5.098739] no locks held by kthreadd/2.
[5.099237]
[5.099237] stack backtrace:
[5.099537] CPU: 0 PID: 2 Comm: kthreadd Not tainted
5.11.0-rc4-8-g79991caf5202 #1
[5.100470] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.12.0-1 04/01/2014
[5.101442] Call Trace:
[5.101807]  dump_stack+0x15f/0x1bf
[5.102298]  smack_cred_prepare+0x400/0x420
[5.102840]  ? security_prepare_creds+0xd4/0x120
[5.103441]  security_prepare_creds+0x84/0x120
[5.103515]  prepare_creds+0x3f1/0x580
[5.103515]  copy_creds+0x65/0x480
[5.103515]  copy_process+0x7b4/0x3600
[5.103515]  ? check_prev_add+0xa40/0xa40
[5.103515]  ? lockdep_enabled+0xd/0x60
[5.103515]  ? lock_is_held_type+0x1a/0x100
[5.103515]  ? __cleanup_sighand+0xc0/0xc0
[5.103515]  ? lockdep_unlock+0x39/0x160
[5.103515]  kernel_clone+0x165/0xd20
[5.103515]  ? copy_init_mm+0x20/0x20
[5.103515]  ? pvclock_clocksource_read+0xd9/0x1a0
[5.103515]  ? sched_clock_local+0x99/0xc0
[5.103515]  ? kthread_insert_work_sanity_check+0xc0/0xc0
[5.103515]  kernel_thread+0xba/0x100
[5.103515]  ? __ia32_sys_clone3+0x40/0x40
[5.103515]  ? kthread_insert_work_sanity_check+0xc0/0xc0
[5.103515]  ? do_raw_spin_unlock+0xa9/0x160
[5.103515]  kthreadd+0x68f/0x7a0
[5.103515]  ? kthread_create_on_cpu+0x160/0x160
[5.103515]  ? lockdep_hardirqs_on+0x77/0x100
[5.103515]  ? _raw_spin_unlock_irq+0x24/0x60
[5.103515]  ? kthread_create_on_cpu+0x160/0x160
[5.103515]  ret_from_fork+0x22/0x30

Thank you very much!

Dongli Zhang


On 2/6/21 7:03 PM, kernel test robot wrote:
> 
> Greeting,
> 
> FYI, we noticed the following commit (built with gcc-9):
> 
> commit: 79991caf5202c7989928be534727805f8f68bb8d ("vdpa_sim_net: Add support 
> for user supported devices")
> https://urldefense.com/v3/__https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git__;!!GqivPVa7Brio!LfgrgVVtPAjwjqTZX8yANgsix4f3cJmAA_CcMeCVymh5XYcamWdR9dnbIQA-p61PJtI$
>   
> Dongli-Zhang/vhost-scsi-alloc-vhost_scsi-with-kvzalloc-to-avoid-delay/20210129-191605
> 
> 
> in testcase: trinity
> version: trinity-static-x86_64-x86_64-f93256fb_2019-08-28
> with following parameters:
> 
>   runtime: 300s
> 
> test-description: Trinity is a linux system call fuzz tester.
> test-url: 
> https://urldefense.com/v3/__http://codemonkey.org.uk/projects/trinity/__;!!GqivPVa7Brio!LfgrgVVtPAjwjqTZX8yANgsix4f3cJmAA_CcMeCVymh5XYcamWdR9dnbIQA-6Y4x88c$
>  
> 
> 
> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 8G
> 
> caused below changes (please refer to attached dmesg/kmsg for entire 
> log/backtrace):
> 
> 
> +-+++
> | | 
> 39502d042a | 79991caf52 |
> +-+++
> | boot_successes  | 0 
>  | 0  |
> | boot_failures   | 
> 62 | 57 |
> | WARNING:suspicious_RCU_usage| 
> 62 | 57 |
> | security/smack/smack_lsm.c:#RCU-list_traversed_in_non-reader_section| 
> 62 | 57 |
> | security/smack/smack_access.c:#RCU-list_traversed_in_non-reader_section | 
> 62 | 57 |
> | BUG:workqueue_lockup-pool   | 
> 33 | 40 |
> | BUG:kernel_hang_in_boot_stage   

Re: [PATCH v2 1/1] vhost scsi: alloc vhost_scsi with kvzalloc() to avoid delay

2021-01-23 Thread Dongli Zhang
According to my "git send-email" history, I have CCed jasow...@redhat.com. Not
sure why Jason is not on the list.

CCed Jason. Thank you very much!

Dongli Zhang

On 1/23/21 12:08 AM, Dongli Zhang wrote:
> The size of 'struct vhost_scsi' is order-10 (~2.3MB). It may take long time
> delay by kzalloc() to compact memory pages by retrying multiple times when
> there is a lack of high-order pages. As a result, there is latency to
> create a VM (with vhost-scsi) or to hotadd vhost-scsi-based storage.
> 
> The prior commit 595cb754983d ("vhost/scsi: use vmalloc for order-10
> allocation") prefers to fallback only when really needed, while this patch
> allocates with kvzalloc() with __GFP_NORETRY implicitly set to avoid
> retrying memory pages compact for multiple times.
> 
> The __GFP_NORETRY is implicitly set if the size to allocate is more than
> PAGE_SZIE and when __GFP_RETRY_MAYFAIL is not explicitly set.
> 
> Cc: Aruna Ramakrishna 
> Cc: Joe Jin 
> Signed-off-by: Dongli Zhang 
> ---
> Changed since v1:
>   - To combine kzalloc() and vzalloc() as kvzalloc()
> (suggested by Jason Wang)
> 
>  drivers/vhost/scsi.c | 9 +++--
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index 4ce9f00ae10e..5de21ad4bd05 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -1814,12 +1814,9 @@ static int vhost_scsi_open(struct inode *inode, struct 
> file *f)
>   struct vhost_virtqueue **vqs;
>   int r = -ENOMEM, i;
>  
> - vs = kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN | 
> __GFP_RETRY_MAYFAIL);
> - if (!vs) {
> - vs = vzalloc(sizeof(*vs));
> - if (!vs)
> - goto err_vs;
> - }
> + vs = kvzalloc(sizeof(*vs), GFP_KERNEL);
> + if (!vs)
> + goto err_vs;
>  
>   vqs = kmalloc_array(VHOST_SCSI_MAX_VQ, sizeof(*vqs), GFP_KERNEL);
>   if (!vqs)
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 1/1] vhost scsi: alloc vhost_scsi with kvzalloc() to avoid delay

2021-01-23 Thread Dongli Zhang
The size of 'struct vhost_scsi' is order-10 (~2.3MB). It may take long time
delay by kzalloc() to compact memory pages by retrying multiple times when
there is a lack of high-order pages. As a result, there is latency to
create a VM (with vhost-scsi) or to hotadd vhost-scsi-based storage.

The prior commit 595cb754983d ("vhost/scsi: use vmalloc for order-10
allocation") prefers to fallback only when really needed, while this patch
allocates with kvzalloc() with __GFP_NORETRY implicitly set to avoid
retrying memory pages compact for multiple times.

The __GFP_NORETRY is implicitly set if the size to allocate is more than
PAGE_SZIE and when __GFP_RETRY_MAYFAIL is not explicitly set.

Cc: Aruna Ramakrishna 
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
Changed since v1:
  - To combine kzalloc() and vzalloc() as kvzalloc()
(suggested by Jason Wang)

 drivers/vhost/scsi.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 4ce9f00ae10e..5de21ad4bd05 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1814,12 +1814,9 @@ static int vhost_scsi_open(struct inode *inode, struct 
file *f)
struct vhost_virtqueue **vqs;
int r = -ENOMEM, i;
 
-   vs = kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN | 
__GFP_RETRY_MAYFAIL);
-   if (!vs) {
-   vs = vzalloc(sizeof(*vs));
-   if (!vs)
-   goto err_vs;
-   }
+   vs = kvzalloc(sizeof(*vs), GFP_KERNEL);
+   if (!vs)
+   goto err_vs;
 
vqs = kmalloc_array(VHOST_SCSI_MAX_VQ, sizeof(*vqs), GFP_KERNEL);
if (!vqs)
-- 
2.17.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/1] vhost scsi: allocate vhost_scsi with GFP_NOWAIT to avoid delay

2021-01-23 Thread Dongli Zhang


On 1/21/21 1:00 AM, Jason Wang wrote:
> 
> On 2021/1/21 13:03, Dongli Zhang wrote:
>> The size of 'struct vhost_scsi' is order-10 (~2.3MB). It may take long time
>> delay by kzalloc() to compact memory pages when there is a lack of
>> high-order pages. As a result, there is latency to create a VM (with
>> vhost-scsi) or to hotadd vhost-scsi-based storage.
>>
>> The prior commit 595cb754983d ("vhost/scsi: use vmalloc for order-10
>> allocation") prefers to fallback only when really needed, while this patch
>> changes allocation to GFP_NOWAIT in order to avoid the delay caused by
>> memory page compact.
>>
>> Cc: Aruna Ramakrishna 
>> Cc: Joe Jin 
>> Signed-off-by: Dongli Zhang 
>> ---
>> Another option is to rework by reducing the size of 'struct vhost_scsi',
>> e.g., by replacing inline vhost_scsi.vqs with just memory pointers while
>> each vhost_scsi.vqs[i] should be allocated separately. Please let me
>> know if that option is better.
>>
>>   drivers/vhost/scsi.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
>> index 4ce9f00ae10e..85eaa4e883f4 100644
>> --- a/drivers/vhost/scsi.c
>> +++ b/drivers/vhost/scsi.c
>> @@ -1814,7 +1814,7 @@ static int vhost_scsi_open(struct inode *inode, struct
>> file *f)
>>   struct vhost_virtqueue **vqs;
>>   int r = -ENOMEM, i;
>>   -    vs = kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN |
>> __GFP_RETRY_MAYFAIL);
>> +    vs = kzalloc(sizeof(*vs), GFP_NOWAIT | __GFP_NOWARN);
>>   if (!vs) {
>>   vs = vzalloc(sizeof(*vs));
>>   if (!vs)
> 
> 
> Can we use kvzalloc?
> 
Thank you very much for the suggestion.

To use 'GFP_NOWAIT' will avoid any direct compact in __alloc_pages_slowpath(),
while to use kvzalloc() will just avoid retrying direct compact for multiple 
times.

Although the latter will still do direct compact (without retry), I think it is
better than the former using GFP_NOWAIT.

I will send v2 with kvzalloc().

Thank you very much!

Dongli Zhang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 1/1] vhost scsi: allocate vhost_scsi with GFP_NOWAIT to avoid delay

2021-01-20 Thread Dongli Zhang
The size of 'struct vhost_scsi' is order-10 (~2.3MB). It may take long time
delay by kzalloc() to compact memory pages when there is a lack of
high-order pages. As a result, there is latency to create a VM (with
vhost-scsi) or to hotadd vhost-scsi-based storage.

The prior commit 595cb754983d ("vhost/scsi: use vmalloc for order-10
allocation") prefers to fallback only when really needed, while this patch
changes allocation to GFP_NOWAIT in order to avoid the delay caused by
memory page compact.

Cc: Aruna Ramakrishna 
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
Another option is to rework by reducing the size of 'struct vhost_scsi',
e.g., by replacing inline vhost_scsi.vqs with just memory pointers while
each vhost_scsi.vqs[i] should be allocated separately. Please let me
know if that option is better.

 drivers/vhost/scsi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 4ce9f00ae10e..85eaa4e883f4 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1814,7 +1814,7 @@ static int vhost_scsi_open(struct inode *inode, struct 
file *f)
struct vhost_virtqueue **vqs;
int r = -ENOMEM, i;
 
-   vs = kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN | 
__GFP_RETRY_MAYFAIL);
+   vs = kzalloc(sizeof(*vs), GFP_NOWAIT | __GFP_NOWARN);
if (!vs) {
vs = vzalloc(sizeof(*vs));
if (!vs)
-- 
2.17.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2] virtio-blk: fix hw_queue stopped on arbitrary error

2020-02-14 Thread dongli . zhang
Hi Halil,

When swiotlb full is hit for virtio_blk, there is below warning for once (the
warning is not by this patch set). Is this expected or just false positive?

[   54.767257] virtio-pci :00:04.0: swiotlb buffer is full (sz: 16 bytes),
total 32768 (slots), used 258 (slots)
[   54.767260] virtio-pci :00:04.0: overflow 0x75770110+16 of DMA
mask  bus limit 0
[   54.769192] [ cut here ]
[   54.769200] WARNING: CPU: 3 PID: 102 at kernel/dma/direct.c:35
report_addr+0x71/0x77
[   54.769200] Modules linked in:
[   54.769203] CPU: 3 PID: 102 Comm: kworker/u8:2 Not tainted 5.6.0-rc1+ #2
[   54.769204] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
[   54.769208] Workqueue: writeback wb_workfn (flush-253:0)
[   54.769211] RIP: 0010:report_addr+0x71/0x77
... ...
[   54.769226] Call Trace:
[   54.769241]  dma_direct_map_page+0xc9/0xe0
[   54.769245]  virtqueue_add+0x172/0xaa0
[   54.769248]  virtqueue_add_sgs+0x85/0xa0
[   54.769251]  virtio_queue_rq+0x292/0x480
[   54.769255]  __blk_mq_try_issue_directly+0x13e/0x1f0
[   54.769257]  blk_mq_request_issue_directly+0x48/0xa0
[   54.769259]  blk_mq_try_issue_list_directly+0x3c/0xb0
[   54.769261]  blk_mq_sched_insert_requests+0xb6/0x100
[   54.769263]  blk_mq_flush_plug_list+0x146/0x210
[   54.769264]  blk_flush_plug_list+0xba/0xe0
[   54.769266]  blk_mq_make_request+0x331/0x5b0
[   54.769268]  generic_make_request+0x10d/0x2e0
[   54.769270]  submit_bio+0x69/0x130
[   54.769273]  ext4_io_submit+0x44/0x50
[   54.769275]  ext4_writepages+0x56f/0xd30
[   54.769278]  ? cpumask_next_and+0x19/0x20
[   54.769280]  ? find_busiest_group+0x11a/0xa40
[   54.769283]  do_writepages+0x15/0x70
[   54.769288]  __writeback_single_inode+0x38/0x330
[   54.769290]  writeback_sb_inodes+0x219/0x4c0
[   54.769292]  __writeback_inodes_wb+0x82/0xb0
[   54.769293]  wb_writeback+0x267/0x300
[   54.769295]  wb_workfn+0x1aa/0x430
[   54.769298]  process_one_work+0x156/0x360
[   54.769299]  worker_thread+0x41/0x3b0
[   54.769300]  kthread+0xf3/0x130
[   54.769302]  ? process_one_work+0x360/0x360
[   54.769303]  ? kthread_bind+0x10/0x10
[   54.769305]  ret_from_fork+0x35/0x40
[   54.769307] ---[ end trace 923a87a9ce0e777a ]---

Thank you very much!

Dongli Zhang

On 2/13/20 4:37 AM, Halil Pasic wrote:
> Since nobody else is going to restart our hw_queue for us, the
> blk_mq_start_stopped_hw_queues() is in virtblk_done() is not sufficient
> necessarily sufficient to ensure that the queue will get started again.
> In case of global resource outage (-ENOMEM because mapping failure,
> because of swiotlb full) our virtqueue may be empty and we can get
> stuck with a stopped hw_queue.
> 
> Let us not stop the queue on arbitrary errors, but only on -EONSPC which
> indicates a full virtqueue, where the hw_queue is guaranteed to get
> started by virtblk_done() before when it makes sense to carry on
> submitting requests. Let us also remove a stale comment.
> 
> Signed-off-by: Halil Pasic 
> Cc: Jens Axboe 
> Fixes: f7728002c1c7 ("virtio_ring: fix return code on DMA mapping fails")
> ---
> 
> I'm in doubt with regards to the Fixes tag. The thing is, virtio-blk
> probably made an assumption on virtqueue_add: the failure is either
> because the virtqueue is full, or the failure is fatal. In both cases it
> seems acceptable to stop the queue, although the fatal case is arguable.
> Since commit f7728002c1c7 it the dma mapping has failed returns -ENOMEM
> and not -EIO, and thus we have a recoverable failure that ain't
> virtqueue full. So I lean towards to a fixes tag that references that
> commit, although it ain't broken. Alternatively one could say 'Fixes:
> e467cde23818 ("Block driver using virtio.")', if the aforementioned
> assumption shouldn't have made in the first place.
> ---
>  drivers/block/virtio_blk.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 54158766334b..adfe43f5ffe4 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -245,10 +245,12 @@ static blk_status_t virtio_queue_rq(struct 
> blk_mq_hw_ctx *hctx,
>   err = virtblk_add_req(vblk->vqs[qid].vq, vbr, vbr->sg, num);
>   if (err) {
>   virtqueue_kick(vblk->vqs[qid].vq);
> - blk_mq_stop_hw_queue(hctx);
> + /* Don't stop the queue if -ENOMEM: we may have failed to
> +  * bounce the buffer due to global resource outage.
> +  */
> + if (err == -ENOSPC)
> + blk_mq_stop_hw_queue(hctx);
>   spin_unlock_irqrestore(>vqs[qid].lock, flags);
> - /* Out of mem doesn't actually happen, since we fall ba

[PATCH 1/1] scsi: virtio_scsi: remove unused 'affinity_hint_set'

2019-06-19 Thread Dongli Zhang
The 'affinity_hint_set' is not used any longer since
commit 0d9f0a52c8b9 ("virtio_scsi: use virtio IRQ affinity").

Signed-off-by: Dongli Zhang 
---
 drivers/scsi/virtio_scsi.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 13f1b3b..1705398 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -74,9 +74,6 @@ struct virtio_scsi {
 
u32 num_queues;
 
-   /* If the affinity hint is set for virtqueues */
-   bool affinity_hint_set;
-
struct hlist_node node;
 
/* Protected by event_vq lock */
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/1] virtio_blk: replace 0 by HCTX_TYPE_DEFAULT to index blk_mq_tag_set->map

2019-04-08 Thread Dongli Zhang
ping?

The similar patchset has been queued by linux-scsi 5.2/scsi-queue.

virtio-blk is the last place where HCTX_TYPE_DEFAULT is hardcoded.

It would be more friendly to cscope if the hardcoding is avoided.

Thank you very much!

Dongli Zhang

On 03/12/2019 09:31 AM, Dongli Zhang wrote:
> Use HCTX_TYPE_DEFAULT instead of 0 to avoid hardcoding.
> 
> Signed-off-by: Dongli Zhang 
> ---
> This is the only place to cleanup under drivers/block/*. Another patch set
> is sent to scsi and then all of below are cleaned up:
> - block/*
> - drivers/*
> 
>  drivers/block/virtio_blk.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 4bc083b..bed6035 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -691,7 +691,8 @@ static int virtblk_map_queues(struct blk_mq_tag_set *set)
>  {
>   struct virtio_blk *vblk = set->driver_data;
>  
> - return blk_mq_virtio_map_queues(>map[0], vblk->vdev, 0);
> + return blk_mq_virtio_map_queues(>map[HCTX_TYPE_DEFAULT],
> + vblk->vdev, 0);
>  }
>  
>  #ifdef CONFIG_VIRTIO_BLK_SCSI
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/2] Limit number of hw queues by nr_cpu_ids for virtio-blk and virtio-scsi

2019-04-08 Thread Dongli Zhang
ping?

Thank you very much!

Dongli Zhang

On 03/27/2019 06:36 PM, Dongli Zhang wrote:
> When tag_set->nr_maps is 1, the block layer limits the number of hw queues
> by nr_cpu_ids. No matter how many hw queues are use by
> virtio-blk/virtio-scsi, as they both have (tag_set->nr_maps == 1), they
> can use at most nr_cpu_ids hw queues.
> 
> In addition, specifically for pci scenario, when the 'num-queues' specified
> by qemu is more than maxcpus, virtio-blk/virtio-scsi would not be able to
> allocate more than maxcpus vectors in order to have a vector for each
> queue. As a result, they fall back into MSI-X with one vector for config
> and one shared for queues.
> 
> Considering above reasons, this patch set limits the number of hw queues
> used by nr_cpu_ids for both virtio-blk and virtio-scsi.
> 
> -
> 
> Here is test result of virtio-scsi:
> 
> qemu cmdline:
> 
> -smp 2,maxcpus=4, \
> -device virtio-scsi-pci,id=scsi0,num_queues=8, \
> -device scsi-hd,drive=drive0,bus=scsi0.0,channel=0,scsi-id=0,lun=0, \
> -drive file=test.img,if=none,id=drive0
> 
> Although maxcpus=4 and num_queues=8, 4 queues are used while 2 interrupts
> are allocated.
> 
> # cat /proc/interrupts
> ... ...
>  24:  0  0   PCI-MSI 65536-edge  virtio0-config
>  25:  0369   PCI-MSI 65537-edge  virtio0-virtqueues
> ... ...
> 
> # /sys/block/sda/mq/
> 0  1  2  3   --> 4 queues although qemu sets num_queues=8
> 
> 
> With the patch set, there is per-queue interrupt.
> 
> # cat /proc/interrupts
>  24:  0  0   PCI-MSI 65536-edge  virtio0-config
>  25:  0  0   PCI-MSI 65537-edge  virtio0-control
>  26:  0  0   PCI-MSI 65538-edge  virtio0-event
>  27:296  0   PCI-MSI 65539-edge  virtio0-request
>  28:  0139   PCI-MSI 65540-edge  virtio0-request
>  29:  0  0   PCI-MSI 65541-edge  virtio0-request
>  30:  0  0   PCI-MSI 65542-edge  virtio0-request
> 
> # ls /sys/block/sda/mq
> 0  1  2  3
> 
> -
> 
> Here is test result of virtio-blk:
> 
> qemu cmdline:
> 
> -smp 2,maxcpus=4,
> -device virtio-blk-pci,drive=drive-virtio-disk0,id=virtio-disk0,num-queues=8
> -drive test.img,format=raw,if=none,id=drive-virtio-disk0
> 
> Although maxcpus=4 and num-queues=8, 4 queues are used while 2 interrupts
> are allocated.
> 
> # cat /proc/interrupts
> ... ...
>  24:  0  0   PCI-MSI 65536-edge  virtio0-config
>  25:  0 65   PCI-MSI 65537-edge  virtio0-virtqueues
> ... ...
> 
> # ls /sys/block/vda/mq
> 0  1  2  3---> 4 queues although qemu sets num_queues=8
> 
> 
> With the patch set, there is per-queue interrupt.
> 
> # cat /proc/interrupts
>  24:  0  0   PCI-MSI 65536-edge  virtio0-config
>  25: 64  0   PCI-MSI 65537-edge  virtio0-req.0
>  26:  0  10290   PCI-MSI 65538-edge  virtio0-req.1
>  27:  0  0   PCI-MSI 65539-edge  virtio0-req.2
>  28:  0  0   PCI-MSI 65540-edge  virtio0-req.3
> 
> # ls /sys/block/vda/mq/
> 0  1  2  3
> 
> 
> Reference: 
> https://lore.kernel.org/lkml/e4afe4c5-0262-4500-aeec-60f30734b4fc@default/
> 
> Thank you very much!
> 
> Dongli Zhang
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 2/2] scsi: virtio_scsi: limit number of hw queues by nr_cpu_ids

2019-03-27 Thread Dongli Zhang
When tag_set->nr_maps is 1, the block layer limits the number of hw queues
by nr_cpu_ids. No matter how many hw queues are used by virtio-scsi, as it
has (tag_set->nr_maps == 1), it can use at most nr_cpu_ids hw queues.

In addition, specifically for pci scenario, when the 'num_queues' specified
by qemu is more than maxcpus, virtio-scsi would not be able to allocate
more than maxcpus vectors in order to have a vector for each queue. As a
result, it falls back into MSI-X with one vector for config and one shared
for queues.

Considering above reasons, this patch limits the number of hw queues used
by virtio-scsi by nr_cpu_ids.

Signed-off-by: Dongli Zhang 
---
 drivers/scsi/virtio_scsi.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 8af0177..9c4a3e1 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -793,6 +793,7 @@ static int virtscsi_probe(struct virtio_device *vdev)
 
/* We need to know how many queues before we allocate. */
num_queues = virtscsi_config_get(vdev, num_queues) ? : 1;
+   num_queues = min_t(unsigned int, nr_cpu_ids, num_queues);
 
num_targets = virtscsi_config_get(vdev, max_target) + 1;
 
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 1/2] virtio-blk: limit number of hw queues by nr_cpu_ids

2019-03-27 Thread Dongli Zhang
When tag_set->nr_maps is 1, the block layer limits the number of hw queues
by nr_cpu_ids. No matter how many hw queues are used by virtio-blk, as it
has (tag_set->nr_maps == 1), it can use at most nr_cpu_ids hw queues.

In addition, specifically for pci scenario, when the 'num-queues' specified
by qemu is more than maxcpus, virtio-blk would not be able to allocate more
than maxcpus vectors in order to have a vector for each queue. As a result,
it falls back into MSI-X with one vector for config and one shared for
queues.

Considering above reasons, this patch limits the number of hw queues used
by virtio-blk by nr_cpu_ids.

Signed-off-by: Dongli Zhang 
---
 drivers/block/virtio_blk.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 4bc083b..b83cb45 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -513,6 +513,8 @@ static int init_vq(struct virtio_blk *vblk)
if (err)
num_vqs = 1;
 
+   num_vqs = min_t(unsigned int, nr_cpu_ids, num_vqs);
+
vblk->vqs = kmalloc_array(num_vqs, sizeof(*vblk->vqs), GFP_KERNEL);
if (!vblk->vqs)
return -ENOMEM;
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 0/2] Limit number of hw queues by nr_cpu_ids for virtio-blk and virtio-scsi

2019-03-27 Thread Dongli Zhang
When tag_set->nr_maps is 1, the block layer limits the number of hw queues
by nr_cpu_ids. No matter how many hw queues are use by
virtio-blk/virtio-scsi, as they both have (tag_set->nr_maps == 1), they
can use at most nr_cpu_ids hw queues.

In addition, specifically for pci scenario, when the 'num-queues' specified
by qemu is more than maxcpus, virtio-blk/virtio-scsi would not be able to
allocate more than maxcpus vectors in order to have a vector for each
queue. As a result, they fall back into MSI-X with one vector for config
and one shared for queues.

Considering above reasons, this patch set limits the number of hw queues
used by nr_cpu_ids for both virtio-blk and virtio-scsi.

-

Here is test result of virtio-scsi:

qemu cmdline:

-smp 2,maxcpus=4, \
-device virtio-scsi-pci,id=scsi0,num_queues=8, \
-device scsi-hd,drive=drive0,bus=scsi0.0,channel=0,scsi-id=0,lun=0, \
-drive file=test.img,if=none,id=drive0

Although maxcpus=4 and num_queues=8, 4 queues are used while 2 interrupts
are allocated.

# cat /proc/interrupts
... ...
 24:  0  0   PCI-MSI 65536-edge  virtio0-config
 25:  0369   PCI-MSI 65537-edge  virtio0-virtqueues
... ...

# /sys/block/sda/mq/
0  1  2  3   --> 4 queues although qemu sets num_queues=8


With the patch set, there is per-queue interrupt.

# cat /proc/interrupts
 24:  0  0   PCI-MSI 65536-edge  virtio0-config
 25:  0  0   PCI-MSI 65537-edge  virtio0-control
 26:  0  0   PCI-MSI 65538-edge  virtio0-event
 27:296  0   PCI-MSI 65539-edge  virtio0-request
 28:  0139   PCI-MSI 65540-edge  virtio0-request
 29:  0  0   PCI-MSI 65541-edge  virtio0-request
 30:  0  0   PCI-MSI 65542-edge  virtio0-request

# ls /sys/block/sda/mq
0  1  2  3

-

Here is test result of virtio-blk:

qemu cmdline:

-smp 2,maxcpus=4,
-device virtio-blk-pci,drive=drive-virtio-disk0,id=virtio-disk0,num-queues=8
-drive test.img,format=raw,if=none,id=drive-virtio-disk0

Although maxcpus=4 and num-queues=8, 4 queues are used while 2 interrupts
are allocated.

# cat /proc/interrupts
... ...
 24:  0  0   PCI-MSI 65536-edge  virtio0-config
 25:  0 65   PCI-MSI 65537-edge  virtio0-virtqueues
... ...

# ls /sys/block/vda/mq
0  1  2  3---> 4 queues although qemu sets num_queues=8


With the patch set, there is per-queue interrupt.

# cat /proc/interrupts
 24:  0  0   PCI-MSI 65536-edge  virtio0-config
 25: 64  0   PCI-MSI 65537-edge  virtio0-req.0
 26:  0  10290   PCI-MSI 65538-edge  virtio0-req.1
 27:  0  0   PCI-MSI 65539-edge  virtio0-req.2
 28:  0  0   PCI-MSI 65540-edge  virtio0-req.3

# ls /sys/block/vda/mq/
0  1  2  3


Reference: 
https://lore.kernel.org/lkml/e4afe4c5-0262-4500-aeec-60f30734b4fc@default/

Thank you very much!

Dongli Zhang

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: virtio-blk: should num_vqs be limited by num_possible_cpus()?

2019-03-20 Thread Dongli Zhang


On 3/20/19 8:53 PM, Jason Wang wrote:
> 
> On 2019/3/19 上午10:22, Dongli Zhang wrote:
>> Hi Jason,
>>
>> On 3/18/19 3:47 PM, Jason Wang wrote:
>>> On 2019/3/15 下午8:41, Cornelia Huck wrote:
>>>> On Fri, 15 Mar 2019 12:50:11 +0800
>>>> Jason Wang  wrote:
>>>>
>>>>> Or something like I proposed several years ago?
>>>>> https://do-db2.lkml.org/lkml/2014/12/25/169
>>>>>
>>>>> Btw, for virtio-net, I think we actually want to go for having a maximum
>>>>> number of supported queues like what hardware did. This would be useful
>>>>> for e.g cpu hotplug or XDP (requires per cpu TX queue). But the current
>>>>> vector allocation doesn't support this which will results all virtqueues
>>>>> to share a single vector. We may indeed need more flexible policy here.
>>>> I think it should be possible for the driver to give the transport
>>>> hints how to set up their queues/interrupt structures. (The driver
>>>> probably knows best about its requirements.) Perhaps whether a queue is
>>>> high or low frequency, or whether it should be low latency, or even
>>>> whether two queues could share a notification mechanism without
>>>> drawbacks. It's up to the transport to make use of that information, if
>>>> possible.
>>>
>>> Exactly and it was what the above series tried to do by providing hints of 
>>> e.g
>>> which queues want to share a notification.
>>>
>> I read about your patch set on providing more flexibility of queue-to-vector
>> mapping.
>>
>> One use case of the patch set is we would be able to enable more queues when
>> there is limited number of vectors.
>>
>> Another use case we may classify queues as hight priority or low priority as
>> mentioned by Cornelia.
>>
>> For virtio-blk, we may extend virtio-blk based on this patch set to enable
>> something similar to write_queues/poll_queues in nvme, when (set->nr_maps != 
>> 1).
>>
>>
>> Yet, the question I am asking in this email thread is for a difference 
>> scenario.
>>
>> The issue is not we are not having enough vectors (although this is why only 
>> 1
>> vector is allocated for all virtio-blk queues). As so far virtio-blk has
>> (set->nr_maps == 1), block layer would limit the number of hw queues by
>> nr_cpu_ids, we indeed do not need more than nr_cpu_ids hw queues in 
>> virtio-blk.
>>
>> That's why I ask why not change the flow as below options when the number of
>> supported hw queues is more than nr_cpu_ids (and set->nr_maps == 1. 
>> virtio-blk
>> does not set nr_maps and block layer would set it to 1 when the driver does 
>> not
>> specify with a value):
>>
>> option 1:
>> As what nvme and xen-netfront do, limit the hw queue number by nr_cpu_ids.
> 
> 
> How do they limit the hw queue number? A command?

The max #queue is also limited by other factors, e.g., kernel param
configuration, xen dom0 configuration or nvme hardware support.

Here we would ignore those factors for simplicity and only talk about the
relation between #queue and #cpu.


About nvme pci:

Regardless about new write_queues and poll_queues, the default queue type number
is limited by num_possible_cpus() as line 2120 and 252.

2113 static int nvme_setup_io_queues(struct nvme_dev *dev)
2114 {
2115 struct nvme_queue *adminq = >queues[0];
2116 struct pci_dev *pdev = to_pci_dev(dev->dev);
2117 int result, nr_io_queues;
2118 unsigned long size;
2119
2120 nr_io_queues = max_io_queues();
2121 result = nvme_set_queue_count(>ctrl, _io_queues);

 250 static unsigned int max_io_queues(void)
 251 {
 252 return num_possible_cpus() + write_queues + poll_queues;
 253 }

The cons of this is there might be many unused hw queues and vectors when
num_possible_cpus() is very very large while only a small number of cpu are
online. I am looking if there is way to improve this.



About xen-blkfront:

Indeed the max #queue is limited by num_online_cpus() when xen-blkfront module
is loaded as line 2733 and 2736.

2707 static int __init xlblk_init(void)
... ...
2710 int nr_cpus = num_online_cpus();
... ...
2733 if (xen_blkif_max_queues > nr_cpus) {
2734 pr_info("Invalid max_queues (%d), will use default max: 
%d.\n",
2735 xen_blkif_max_queues, nr_cpus);
2736 xen_blkif_max_queues = nr_cpus;
2737     }

The cons of this is the number of hw-queue/hctx is limited and cannot increase
after cpu hotplug. I am looking if there is way to i

Re: virtio-blk: should num_vqs be limited by num_possible_cpus()?

2019-03-18 Thread Dongli Zhang
Hi Jason,

On 3/18/19 3:47 PM, Jason Wang wrote:
> 
> On 2019/3/15 下午8:41, Cornelia Huck wrote:
>> On Fri, 15 Mar 2019 12:50:11 +0800
>> Jason Wang  wrote:
>>
>>> Or something like I proposed several years ago?
>>> https://do-db2.lkml.org/lkml/2014/12/25/169
>>>
>>> Btw, for virtio-net, I think we actually want to go for having a maximum
>>> number of supported queues like what hardware did. This would be useful
>>> for e.g cpu hotplug or XDP (requires per cpu TX queue). But the current
>>> vector allocation doesn't support this which will results all virtqueues
>>> to share a single vector. We may indeed need more flexible policy here.
>> I think it should be possible for the driver to give the transport
>> hints how to set up their queues/interrupt structures. (The driver
>> probably knows best about its requirements.) Perhaps whether a queue is
>> high or low frequency, or whether it should be low latency, or even
>> whether two queues could share a notification mechanism without
>> drawbacks. It's up to the transport to make use of that information, if
>> possible.
> 
> 
> Exactly and it was what the above series tried to do by providing hints of e.g
> which queues want to share a notification.
> 

I read about your patch set on providing more flexibility of queue-to-vector
mapping.

One use case of the patch set is we would be able to enable more queues when
there is limited number of vectors.

Another use case we may classify queues as hight priority or low priority as
mentioned by Cornelia.

For virtio-blk, we may extend virtio-blk based on this patch set to enable
something similar to write_queues/poll_queues in nvme, when (set->nr_maps != 1).


Yet, the question I am asking in this email thread is for a difference scenario.

The issue is not we are not having enough vectors (although this is why only 1
vector is allocated for all virtio-blk queues). As so far virtio-blk has
(set->nr_maps == 1), block layer would limit the number of hw queues by
nr_cpu_ids, we indeed do not need more than nr_cpu_ids hw queues in virtio-blk.

That's why I ask why not change the flow as below options when the number of
supported hw queues is more than nr_cpu_ids (and set->nr_maps == 1. virtio-blk
does not set nr_maps and block layer would set it to 1 when the driver does not
specify with a value):

option 1:
As what nvme and xen-netfront do, limit the hw queue number by nr_cpu_ids.

option 2:
If the vectors is not enough, use the max number vector (indeed nr_cpu_ids) as
number of hw queues.

option 3:
We should allow more vectors even the block layer would support at most
nr_cpu_ids queues.


I understand a new policy for queue-vector mapping is very helpful. I am just
asking the question from block layer's point of view.

Thank you very much!

Dongli Zhang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: virtio-blk: should num_vqs be limited by num_possible_cpus()?

2019-03-14 Thread Dongli Zhang



On 03/14/2019 08:13 PM, Cornelia Huck wrote:
> On Thu, 14 Mar 2019 14:12:32 +0800
> Dongli Zhang  wrote:
> 
>> On 3/13/19 5:39 PM, Cornelia Huck wrote:
>>> On Wed, 13 Mar 2019 11:26:04 +0800
>>> Dongli Zhang  wrote:
>>>   
>>>> On 3/13/19 1:33 AM, Cornelia Huck wrote:  
>>>>> On Tue, 12 Mar 2019 10:22:46 -0700 (PDT)
>>>>> Dongli Zhang  wrote:
> 
>>>>>> Is this by design on purpose, or can we fix with below?
>>>>>>
>>>>>>
>>>>>> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
>>>>>> index 4bc083b..df95ce3 100644
>>>>>> --- a/drivers/block/virtio_blk.c
>>>>>> +++ b/drivers/block/virtio_blk.c
>>>>>> @@ -513,6 +513,8 @@ static int init_vq(struct virtio_blk *vblk)
>>>>>>  if (err)
>>>>>>  num_vqs = 1;
>>>>>>  
>>>>>> +num_vqs = min(num_possible_cpus(), num_vqs);
>>>>>> +
>>>>>>  vblk->vqs = kmalloc_array(num_vqs, sizeof(*vblk->vqs), 
>>>>>> GFP_KERNEL);
>>>>>>  if (!vblk->vqs)
>>>>>>  return -ENOMEM;
>>>>>
>>>>> virtio-blk, however, is not pci-specific.
>>>>>
>>>>> If we are using the ccw transport on s390, a completely different
>>>>> interrupt mechanism is in use ('floating' interrupts, which are not
>>>>> per-cpu). A check like that should therefore not go into the generic
>>>>> driver.
>>>>> 
>>>>
>>>> So far there seems two options.
>>>>
>>>> The 1st option is to ask the qemu user to always specify "-num-queues" 
>>>> with the
>>>> same number of vcpus when running x86 guest with pci for virtio-blk or
>>>> virtio-scsi, in order to assign a vector for each queue.  
>>>
>>> That does seem like an extra burden for the user: IIUC, things work
>>> even if you have too many queues, it's just not optimal. It sounds like
>>> something that can be done by a management layer (e.g. libvirt), though.
>>>   
>>>> Or, is it fine for virtio folks to add a new hook to 'struct 
>>>> virtio_config_ops'
>>>> so that different platforms (e.g., pci or ccw) would use different ways to 
>>>> limit
>>>> the max number of queues in guest, with something like below?  
>>>
>>> That sounds better, as both transports and drivers can opt-in here.
>>>
>>> However, maybe it would be even better to try to come up with a better
>>> strategy of allocating msix vectors in virtio-pci. More vectors in the
>>> num_queues > num_cpus case, even if they still need to be shared?
>>> Individual vectors for n-1 cpus and then a shared one for the remaining
>>> queues?
>>>
>>> It might even be device-specific: Have some low-traffic status queues
>>> share a vector, and provide an individual vector for high-traffic
>>> queues. Would need some device<->transport interface, obviously.
>>>   
>>
>> This sounds a little bit similar to multiple hctx maps?
>>
>> So far, as virtio-blk only supports set->nr_maps = 1, no matter how many hw
>> queues are assigned for virtio-blk, blk_mq_alloc_tag_set() would use at most
>> nr_cpu_ids hw queues.
>>
>> 2981 int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
>> ... ...
>> 3021 /*
>> 3022  * There is no use for more h/w queues than cpus if we just have
>> 3023  * a single map
>> 3024  */
>> 3025 if (set->nr_maps == 1 && set->nr_hw_queues > nr_cpu_ids)
>> 3026 set->nr_hw_queues = nr_cpu_ids;
>>
>> Even the block layer would limit the number of hw queues by nr_cpu_ids when
>> (set->nr_maps == 1).
> 
> Correct me if I'm wrong, but there seem to be two kinds of limitations
> involved here:
> - Allocation of msix vectors by the virtio-pci transport. We end up
>   with shared vectors if we have more virtqueues than vcpus. Other
>   transports may or may not have similar issues, but essentially, this
>   is something that applies to all kind of virtio devices attached via
>   the virtio-pci transport.

It depends.

For virtio-net, we need to specify the number of available vectors on qemu side,
e.g.,:

-device virtio-net-pci,netdev=tapnet,mq=true,vectors=16

This parameter is specif

Re: virtio-blk: should num_vqs be limited by num_possible_cpus()?

2019-03-14 Thread Dongli Zhang



On 03/14/2019 08:32 PM, Michael S. Tsirkin wrote:
> On Tue, Mar 12, 2019 at 10:22:46AM -0700, Dongli Zhang wrote:
>> I observed that there is one msix vector for config and one shared vector
>> for all queues in below qemu cmdline, when the num-queues for virtio-blk
>> is more than the number of possible cpus:
>>
>> qemu: "-smp 4" while "-device 
>> virtio-blk-pci,drive=drive-0,id=virtblk0,num-queues=6"
> 
> So why do this?

I observed this when I was testing virtio-blk and block layer.

I just assign a very large 'num-queues' to virtio-blk and keep changing the
number of vcpu in order to study blk-mq.

The num-queues for nvme (qemu) is by default 64, while it is 1 for virtio-blk.

> 
>> # cat /proc/interrupts 
>>CPU0   CPU1   CPU2   CPU3
>> ... ...
>>  24:  0  0  0  0   PCI-MSI 65536-edge  
>> virtio0-config
>>  25:  0  0  0 59   PCI-MSI 65537-edge  
>> virtio0-virtqueues
>> ... ...
>>
>>
>> However, when num-queues is the same as number of possible cpus:
>>
>> qemu: "-smp 4" while "-device 
>> virtio-blk-pci,drive=drive-0,id=virtblk0,num-queues=4"
>>
>> # cat /proc/interrupts 
>>CPU0   CPU1   CPU2   CPU3
>> ... ... 
>>  24:  0  0  0  0   PCI-MSI 65536-edge  
>> virtio0-config
>>  25:  2  0  0  0   PCI-MSI 65537-edge  
>> virtio0-req.0
>>  26:  0 35  0  0   PCI-MSI 65538-edge  
>> virtio0-req.1
>>  27:  0  0 32  0   PCI-MSI 65539-edge  
>> virtio0-req.2
>>  28:  0  0  0  0   PCI-MSI 65540-edge  
>> virtio0-req.3
>> ... ...
>>
>> In above case, there is one msix vector per queue.
>>
>>
>> This is because the max number of queues is not limited by the number of
>> possible cpus.
>>
>> By default, nvme (regardless about write_queues and poll_queues) and
>> xen-blkfront limit the number of queues with num_possible_cpus().
>>
>>
>> Is this by design on purpose, or can we fix with below?
>>
>>
>> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
>> index 4bc083b..df95ce3 100644
>> --- a/drivers/block/virtio_blk.c
>> +++ b/drivers/block/virtio_blk.c
>> @@ -513,6 +513,8 @@ static int init_vq(struct virtio_blk *vblk)
>>  if (err)
>>  num_vqs = 1;
>>  
>> +num_vqs = min(num_possible_cpus(), num_vqs);
>> +
>>  vblk->vqs = kmalloc_array(num_vqs, sizeof(*vblk->vqs), GFP_KERNEL);
>>  if (!vblk->vqs)
>>  return -ENOMEM;
>> --
>>
>>
>> PS: The same issue is applicable to virtio-scsi as well.
>>
>> Thank you very much!
>>
>> Dongli Zhang
> 
> I don't think this will address the issue if there's vcpu hotplug though.
> Because it's not about num_possible_cpus it's about the # of active VCPUs,
> right? Does block hangle CPU hotplug generally?
> We could maybe address that by switching vq to msi vector mapping in
> a cpu hotplug notifier...
> 

It looks it is about num_possible_cpus/nr_cpu_ids for cpu hotplug.


For instance, below is when only 2 out of 6 cpus are initialized while
virtio-blk has 6 queues.

"-smp 2,maxcpus=6" and "-device
virtio-blk-pci,drive=drive0,id=disk0,num-queues=6,iothread=io1"

# cat /sys/devices/system/cpu/present
0-1
# cat /sys/devices/system/cpu/possible
0-5
# cat /proc/interrupts  | grep virtio
 24:  0  0   PCI-MSI 65536-edge  virtio0-config
 25:   1864  0   PCI-MSI 65537-edge  virtio0-req.0
 26:  0   1069   PCI-MSI 65538-edge  virtio0-req.1
 27:  0  0   PCI-MSI 65539-edge  virtio0-req.2
 28:  0  0   PCI-MSI 65540-edge  virtio0-req.3
 29:  0  0   PCI-MSI 65541-edge  virtio0-req.4
 30:  0  0   PCI-MSI 65542-edge  virtio0-req.5

6 + 1 irqs are assigned even 4 out of 6 cpus are still offline.


Below is about the nvme emulated by qemu. While 2 out of 6 cpus are initial
assigned, nvme has 64 queues by default.

"-smp 2,maxcpus=6" and "-device nvme,drive=drive1,serial=deadbeaf1"

# cat /sys/devices/system/cpu/present
0-1
# cat /sys/devices/system/cpu/possible
0-5
# cat /proc/interrupts | grep nvme
 31:  0 16   PCI-MSI 81920-edge  nvme0q0
 32: 35  0   PCI-MSI 81921-edge  nvme0q1
 33:  0 42   PCI-MSI 81922-edge  nvme0q2
 34:   

Re: virtio-blk: should num_vqs be limited by num_possible_cpus()?

2019-03-14 Thread Dongli Zhang



On 3/13/19 5:39 PM, Cornelia Huck wrote:
> On Wed, 13 Mar 2019 11:26:04 +0800
> Dongli Zhang  wrote:
> 
>> On 3/13/19 1:33 AM, Cornelia Huck wrote:
>>> On Tue, 12 Mar 2019 10:22:46 -0700 (PDT)
>>> Dongli Zhang  wrote:
>>>   
>>>> I observed that there is one msix vector for config and one shared vector
>>>> for all queues in below qemu cmdline, when the num-queues for virtio-blk
>>>> is more than the number of possible cpus:
>>>>
>>>> qemu: "-smp 4" while "-device 
>>>> virtio-blk-pci,drive=drive-0,id=virtblk0,num-queues=6"
>>>>
>>>> # cat /proc/interrupts 
>>>>CPU0   CPU1   CPU2   CPU3
>>>> ... ...
>>>>  24:  0  0  0  0   PCI-MSI 65536-edge  
>>>> virtio0-config
>>>>  25:  0  0  0 59   PCI-MSI 65537-edge  
>>>> virtio0-virtqueues
>>>> ... ...
>>>>
>>>>
>>>> However, when num-queues is the same as number of possible cpus:
>>>>
>>>> qemu: "-smp 4" while "-device 
>>>> virtio-blk-pci,drive=drive-0,id=virtblk0,num-queues=4"
>>>>
>>>> # cat /proc/interrupts 
>>>>CPU0   CPU1   CPU2   CPU3
>>>> ... ... 
>>>>  24:  0  0  0  0   PCI-MSI 65536-edge  
>>>> virtio0-config
>>>>  25:  2  0  0  0   PCI-MSI 65537-edge  
>>>> virtio0-req.0
>>>>  26:  0 35  0  0   PCI-MSI 65538-edge  
>>>> virtio0-req.1
>>>>  27:  0  0 32  0   PCI-MSI 65539-edge  
>>>> virtio0-req.2
>>>>  28:  0  0  0  0   PCI-MSI 65540-edge  
>>>> virtio0-req.3
>>>> ... ...
>>>>
>>>> In above case, there is one msix vector per queue.  
>>>
>>> Please note that this is pci-specific...
>>>   
>>>>
>>>>
>>>> This is because the max number of queues is not limited by the number of
>>>> possible cpus.
>>>>
>>>> By default, nvme (regardless about write_queues and poll_queues) and
>>>> xen-blkfront limit the number of queues with num_possible_cpus().  
>>>
>>> ...and these are probably pci-specific as well.  
>>
>> Not pci-specific, but per-cpu as well.
> 
> Ah, I meant that those are pci devices.
> 
>>
>>>   
>>>>
>>>>
>>>> Is this by design on purpose, or can we fix with below?
>>>>
>>>>
>>>> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
>>>> index 4bc083b..df95ce3 100644
>>>> --- a/drivers/block/virtio_blk.c
>>>> +++ b/drivers/block/virtio_blk.c
>>>> @@ -513,6 +513,8 @@ static int init_vq(struct virtio_blk *vblk)
>>>>if (err)
>>>>num_vqs = 1;
>>>>  
>>>> +  num_vqs = min(num_possible_cpus(), num_vqs);
>>>> +
>>>>vblk->vqs = kmalloc_array(num_vqs, sizeof(*vblk->vqs), GFP_KERNEL);
>>>>if (!vblk->vqs)
>>>>return -ENOMEM;  
>>>
>>> virtio-blk, however, is not pci-specific.
>>>
>>> If we are using the ccw transport on s390, a completely different
>>> interrupt mechanism is in use ('floating' interrupts, which are not
>>> per-cpu). A check like that should therefore not go into the generic
>>> driver.
>>>   
>>
>> So far there seems two options.
>>
>> The 1st option is to ask the qemu user to always specify "-num-queues" with 
>> the
>> same number of vcpus when running x86 guest with pci for virtio-blk or
>> virtio-scsi, in order to assign a vector for each queue.
> 
> That does seem like an extra burden for the user: IIUC, things work
> even if you have too many queues, it's just not optimal. It sounds like
> something that can be done by a management layer (e.g. libvirt), though.
> 
>> Or, is it fine for virtio folks to add a new hook to 'struct 
>> virtio_config_ops'
>> so that different platforms (e.g., pci or ccw) would use different ways to 
>> limit
>> the max number of queues in guest, with something like below?
> 
> That sounds better, as both transports and drivers can opt-in here.
> 
> However, maybe it wo

Re: virtio-blk: should num_vqs be limited by num_possible_cpus()?

2019-03-12 Thread Dongli Zhang



On 3/13/19 1:33 AM, Cornelia Huck wrote:
> On Tue, 12 Mar 2019 10:22:46 -0700 (PDT)
> Dongli Zhang  wrote:
> 
>> I observed that there is one msix vector for config and one shared vector
>> for all queues in below qemu cmdline, when the num-queues for virtio-blk
>> is more than the number of possible cpus:
>>
>> qemu: "-smp 4" while "-device 
>> virtio-blk-pci,drive=drive-0,id=virtblk0,num-queues=6"
>>
>> # cat /proc/interrupts 
>>CPU0   CPU1   CPU2   CPU3
>> ... ...
>>  24:  0  0  0  0   PCI-MSI 65536-edge  
>> virtio0-config
>>  25:  0  0  0 59   PCI-MSI 65537-edge  
>> virtio0-virtqueues
>> ... ...
>>
>>
>> However, when num-queues is the same as number of possible cpus:
>>
>> qemu: "-smp 4" while "-device 
>> virtio-blk-pci,drive=drive-0,id=virtblk0,num-queues=4"
>>
>> # cat /proc/interrupts 
>>CPU0   CPU1   CPU2   CPU3
>> ... ... 
>>  24:  0  0  0  0   PCI-MSI 65536-edge  
>> virtio0-config
>>  25:  2  0  0  0   PCI-MSI 65537-edge  
>> virtio0-req.0
>>  26:  0 35  0  0   PCI-MSI 65538-edge  
>> virtio0-req.1
>>  27:  0  0 32  0   PCI-MSI 65539-edge  
>> virtio0-req.2
>>  28:  0  0  0  0   PCI-MSI 65540-edge  
>> virtio0-req.3
>> ... ...
>>
>> In above case, there is one msix vector per queue.
> 
> Please note that this is pci-specific...
> 
>>
>>
>> This is because the max number of queues is not limited by the number of
>> possible cpus.
>>
>> By default, nvme (regardless about write_queues and poll_queues) and
>> xen-blkfront limit the number of queues with num_possible_cpus().
> 
> ...and these are probably pci-specific as well.

Not pci-specific, but per-cpu as well.

> 
>>
>>
>> Is this by design on purpose, or can we fix with below?
>>
>>
>> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
>> index 4bc083b..df95ce3 100644
>> --- a/drivers/block/virtio_blk.c
>> +++ b/drivers/block/virtio_blk.c
>> @@ -513,6 +513,8 @@ static int init_vq(struct virtio_blk *vblk)
>>  if (err)
>>  num_vqs = 1;
>>  
>> +num_vqs = min(num_possible_cpus(), num_vqs);
>> +
>>  vblk->vqs = kmalloc_array(num_vqs, sizeof(*vblk->vqs), GFP_KERNEL);
>>  if (!vblk->vqs)
>>  return -ENOMEM;
> 
> virtio-blk, however, is not pci-specific.
> 
> If we are using the ccw transport on s390, a completely different
> interrupt mechanism is in use ('floating' interrupts, which are not
> per-cpu). A check like that should therefore not go into the generic
> driver.
> 

So far there seems two options.

The 1st option is to ask the qemu user to always specify "-num-queues" with the
same number of vcpus when running x86 guest with pci for virtio-blk or
virtio-scsi, in order to assign a vector for each queue.

Or, is it fine for virtio folks to add a new hook to 'struct virtio_config_ops'
so that different platforms (e.g., pci or ccw) would use different ways to limit
the max number of queues in guest, with something like below?

So far both virtio-scsi and virtio-blk would benefit from the new hook.

---
 drivers/block/virtio_blk.c |  2 ++
 drivers/virtio/virtio_pci_common.c |  6 ++
 drivers/virtio/virtio_pci_common.h |  2 ++
 drivers/virtio/virtio_pci_legacy.c |  1 +
 drivers/virtio/virtio_pci_modern.c |  2 ++
 include/linux/virtio_config.h  | 11 +++
 6 files changed, 24 insertions(+)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 4bc083b..93cfeda 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -513,6 +513,8 @@ static int init_vq(struct virtio_blk *vblk)
if (err)
num_vqs = 1;

+   num_vqs = virtio_calc_num_vqs(vdev, num_vqs);
+
vblk->vqs = kmalloc_array(num_vqs, sizeof(*vblk->vqs), GFP_KERNEL);
if (!vblk->vqs)
return -ENOMEM;
diff --git a/drivers/virtio/virtio_pci_common.c 
b/drivers/virtio/virtio_pci_common.c
index d0584c0..ce021d1 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -409,6 +409,12 @@ int vp_find_vqs(struct virtio_device *vdev, unsigned nvqs,
return vp_find_vqs_intx(vdev, nvqs, vqs, callbacks, names, ctx);
 }

+/* the config->calc_num_vqs() implementation */
+unsigned short vp_calc_nu

virtio-blk: should num_vqs be limited by num_possible_cpus()?

2019-03-12 Thread Dongli Zhang
I observed that there is one msix vector for config and one shared vector
for all queues in below qemu cmdline, when the num-queues for virtio-blk
is more than the number of possible cpus:

qemu: "-smp 4" while "-device 
virtio-blk-pci,drive=drive-0,id=virtblk0,num-queues=6"

# cat /proc/interrupts 
   CPU0   CPU1   CPU2   CPU3
... ...
 24:  0  0  0  0   PCI-MSI 65536-edge  
virtio0-config
 25:  0  0  0 59   PCI-MSI 65537-edge  
virtio0-virtqueues
... ...


However, when num-queues is the same as number of possible cpus:

qemu: "-smp 4" while "-device 
virtio-blk-pci,drive=drive-0,id=virtblk0,num-queues=4"

# cat /proc/interrupts 
   CPU0   CPU1   CPU2   CPU3
... ... 
 24:  0  0  0  0   PCI-MSI 65536-edge  
virtio0-config
 25:  2  0  0  0   PCI-MSI 65537-edge  
virtio0-req.0
 26:  0 35  0  0   PCI-MSI 65538-edge  
virtio0-req.1
 27:  0  0 32  0   PCI-MSI 65539-edge  
virtio0-req.2
 28:  0  0  0  0   PCI-MSI 65540-edge  
virtio0-req.3
... ...

In above case, there is one msix vector per queue.


This is because the max number of queues is not limited by the number of
possible cpus.

By default, nvme (regardless about write_queues and poll_queues) and
xen-blkfront limit the number of queues with num_possible_cpus().


Is this by design on purpose, or can we fix with below?


diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 4bc083b..df95ce3 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -513,6 +513,8 @@ static int init_vq(struct virtio_blk *vblk)
if (err)
num_vqs = 1;
 
+   num_vqs = min(num_possible_cpus(), num_vqs);
+
vblk->vqs = kmalloc_array(num_vqs, sizeof(*vblk->vqs), GFP_KERNEL);
if (!vblk->vqs)
return -ENOMEM;
--


PS: The same issue is applicable to virtio-scsi as well.

Thank you very much!

Dongli Zhang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 1/1] virtio_blk: replace 0 by HCTX_TYPE_DEFAULT to index blk_mq_tag_set->map

2019-03-11 Thread Dongli Zhang
Use HCTX_TYPE_DEFAULT instead of 0 to avoid hardcoding.

Signed-off-by: Dongli Zhang 
---
This is the only place to cleanup under drivers/block/*. Another patch set
is sent to scsi and then all of below are cleaned up:
- block/*
- drivers/*

 drivers/block/virtio_blk.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 4bc083b..bed6035 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -691,7 +691,8 @@ static int virtblk_map_queues(struct blk_mq_tag_set *set)
 {
struct virtio_blk *vblk = set->driver_data;
 
-   return blk_mq_virtio_map_queues(>map[0], vblk->vdev, 0);
+   return blk_mq_virtio_map_queues(>map[HCTX_TYPE_DEFAULT],
+   vblk->vdev, 0);
 }
 
 #ifdef CONFIG_VIRTIO_BLK_SCSI
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 1/1] virtio: remove deprecated VIRTIO_PCI_CONFIG()

2018-12-09 Thread Dongli Zhang
VIRTIO_PCI_CONFIG() is deprecated. Use VIRTIO_PCI_CONFIG_OFF() instead.

Signed-off-by: Dongli Zhang 
---
Changed since v1:
  * leave the definition of VIRTIO_PCI_CONFIG() in header file to make 
userspace build happy

 drivers/virtio/virtio_pci_legacy.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_pci_legacy.c 
b/drivers/virtio/virtio_pci_legacy.c
index de062fb..eff9ddc 100644
--- a/drivers/virtio/virtio_pci_legacy.c
+++ b/drivers/virtio/virtio_pci_legacy.c
@@ -52,7 +52,8 @@ static void vp_get(struct virtio_device *vdev, unsigned 
offset,
 {
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
void __iomem *ioaddr = vp_dev->ioaddr +
-   VIRTIO_PCI_CONFIG(vp_dev) + offset;
+   VIRTIO_PCI_CONFIG_OFF(vp_dev->msix_enabled) +
+   offset;
u8 *ptr = buf;
int i;
 
@@ -67,7 +68,8 @@ static void vp_set(struct virtio_device *vdev, unsigned 
offset,
 {
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
void __iomem *ioaddr = vp_dev->ioaddr +
-   VIRTIO_PCI_CONFIG(vp_dev) + offset;
+   VIRTIO_PCI_CONFIG_OFF(vp_dev->msix_enabled) +
+   offset;
const u8 *ptr = buf;
int i;
 
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/1] virtio: remove deprecated VIRTIO_PCI_CONFIG()

2018-12-07 Thread Dongli Zhang



On 12/08/2018 02:01 AM, Michael S. Tsirkin wrote:
> On Fri, Dec 07, 2018 at 03:34:41PM +0800, Dongli Zhang wrote:
>> VIRTIO_PCI_CONFIG() is deprecated. Use VIRTIO_PCI_CONFIG_OFF() instead.
>>
>> Signed-off-by: Dongli Zhang 
>> ---
>>  drivers/virtio/virtio_pci_legacy.c | 6 --
>>  include/uapi/linux/virtio_pci.h| 2 --
>>  2 files changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/virtio/virtio_pci_legacy.c 
>> b/drivers/virtio/virtio_pci_legacy.c
>> index de062fb..eff9ddc 100644
>> --- a/drivers/virtio/virtio_pci_legacy.c
>> +++ b/drivers/virtio/virtio_pci_legacy.c
>> @@ -52,7 +52,8 @@ static void vp_get(struct virtio_device *vdev, unsigned 
>> offset,
>>  {
>>  struct virtio_pci_device *vp_dev = to_vp_device(vdev);
>>  void __iomem *ioaddr = vp_dev->ioaddr +
>> -VIRTIO_PCI_CONFIG(vp_dev) + offset;
>> +VIRTIO_PCI_CONFIG_OFF(vp_dev->msix_enabled) +
>> +offset;
>>  u8 *ptr = buf;
>>  int i;
>>  
>> @@ -67,7 +68,8 @@ static void vp_set(struct virtio_device *vdev, unsigned 
>> offset,
>>  {
>>  struct virtio_pci_device *vp_dev = to_vp_device(vdev);
>>  void __iomem *ioaddr = vp_dev->ioaddr +
>> -VIRTIO_PCI_CONFIG(vp_dev) + offset;
>> +VIRTIO_PCI_CONFIG_OFF(vp_dev->msix_enabled) +
>> +offset;
>>  const u8 *ptr = buf;
>>  int i;
>>
> 
> I agree that VIRTIO_PCI_CONFIG_OFF is a better interface. So above looks
> fine.
>   
>> diff --git a/include/uapi/linux/virtio_pci.h 
>> b/include/uapi/linux/virtio_pci.h
>> index 90007a1..2070232 100644
>> --- a/include/uapi/linux/virtio_pci.h
>> +++ b/include/uapi/linux/virtio_pci.h
>> @@ -78,8 +78,6 @@
>>  /* The remaining space is defined by each driver as the per-driver
>>   * configuration space */
>>  #define VIRTIO_PCI_CONFIG_OFF(msix_enabled) ((msix_enabled) ? 24 : 20)
>> -/* Deprecated: please use VIRTIO_PCI_CONFIG_OFF instead */
>> -#define VIRTIO_PCI_CONFIG(dev)  
>> VIRTIO_PCI_CONFIG_OFF((dev)->msix_enabled)
>>  
>>  /* Virtio ABI version, this must match exactly */
>>  #define VIRTIO_PCI_ABI_VERSION  0
> 
> This might break some userspace builds, I don't see why we should bother
> removing it. Any reason?

Apologies.

I thought about some compatibility issue for building third-party kernel module
at userspace, but did not realize it will break other userspace software builds.

I will keep the definition of VIRTIO_PCI_CONFIG() and resend again.

Thank you very much!

Dongli Zhang

> 
> 
>> -- 
>> 2.7.4
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 1/1] virtio: remove deprecated VIRTIO_PCI_CONFIG()

2018-12-06 Thread Dongli Zhang
VIRTIO_PCI_CONFIG() is deprecated. Use VIRTIO_PCI_CONFIG_OFF() instead.

Signed-off-by: Dongli Zhang 
---
 drivers/virtio/virtio_pci_legacy.c | 6 --
 include/uapi/linux/virtio_pci.h| 2 --
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_pci_legacy.c 
b/drivers/virtio/virtio_pci_legacy.c
index de062fb..eff9ddc 100644
--- a/drivers/virtio/virtio_pci_legacy.c
+++ b/drivers/virtio/virtio_pci_legacy.c
@@ -52,7 +52,8 @@ static void vp_get(struct virtio_device *vdev, unsigned 
offset,
 {
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
void __iomem *ioaddr = vp_dev->ioaddr +
-   VIRTIO_PCI_CONFIG(vp_dev) + offset;
+   VIRTIO_PCI_CONFIG_OFF(vp_dev->msix_enabled) +
+   offset;
u8 *ptr = buf;
int i;
 
@@ -67,7 +68,8 @@ static void vp_set(struct virtio_device *vdev, unsigned 
offset,
 {
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
void __iomem *ioaddr = vp_dev->ioaddr +
-   VIRTIO_PCI_CONFIG(vp_dev) + offset;
+   VIRTIO_PCI_CONFIG_OFF(vp_dev->msix_enabled) +
+   offset;
const u8 *ptr = buf;
int i;
 
diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index 90007a1..2070232 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -78,8 +78,6 @@
 /* The remaining space is defined by each driver as the per-driver
  * configuration space */
 #define VIRTIO_PCI_CONFIG_OFF(msix_enabled)((msix_enabled) ? 24 : 20)
-/* Deprecated: please use VIRTIO_PCI_CONFIG_OFF instead */
-#define VIRTIO_PCI_CONFIG(dev) VIRTIO_PCI_CONFIG_OFF((dev)->msix_enabled)
 
 /* Virtio ABI version, this must match exactly */
 #define VIRTIO_PCI_ABI_VERSION 0
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v9] virtio_blk: add discard and write zeroes support

2018-11-02 Thread Dongli Zhang
Hi Daniel,

Other than crosvm, is there any version of qemu (e.g., repositories developed in
progress on github) where I can try with this feature?

Thank you very much!

Dongli Zhang

On 11/02/2018 06:40 AM, Daniel Verkamp wrote:
> From: Changpeng Liu 
> 
> In commit 88c85538, "virtio-blk: add discard and write zeroes features
> to specification" (https://github.com/oasis-tcs/virtio-spec), the virtio
> block specification has been extended to add VIRTIO_BLK_T_DISCARD and
> VIRTIO_BLK_T_WRITE_ZEROES commands.  This patch enables support for
> discard and write zeroes in the virtio-blk driver when the device
> advertises the corresponding features, VIRTIO_BLK_F_DISCARD and
> VIRTIO_BLK_F_WRITE_ZEROES.
> 
> Signed-off-by: Changpeng Liu 
> Signed-off-by: Daniel Verkamp 
> ---
> dverkamp: I've picked up this patch and made a few minor changes (as
> listed below); most notably, I changed the kmalloc back to GFP_ATOMIC,
> since it can be called from a context where sleeping is not allowed.
> To prevent large allocations, I've also clamped the maximum number of
> discard segments to 256; this results in a 4K allocation and should be
> plenty of descriptors for most use cases.
> 
> I also removed most of the description from the commit message, since it
> was duplicating the comments from virtio_blk.h and quoting parts of the
> spec without adding any extra information.  I have tested this iteration
> of the patch using crosvm with modifications to enable the new features:
> https://chromium.googlesource.com/chromiumos/platform/crosvm/
> 
> v9 fixes a number of review issues; I didn't attempt to optimize the
> single-element write zeroes case, so it still does an allocation per
> request (I did not see any easy place to put the payload that would
> avoid the allocation).
> 
> CHANGELOG:
> v9: [dverkamp] fix LE types in discard struct; cleanups from Ming Lei
> v8: [dverkamp] replace shifts by 9 with SECTOR_SHIFT constant
> v7: [dverkamp] use GFP_ATOMIC for allocation that may not sleep; clarify
> descriptor flags field; comment wording cleanups.
> v6: don't set T_OUT bit to discard and write zeroes commands.
> v5: use new block layer API: blk_queue_flag_set.
> v4: several optimizations based on MST's comments, remove bit field
> usage for command descriptor.
> v3: define the virtio-blk protocol to add discard and write zeroes
> support, first version implementation based on proposed specification.
> v2: add write zeroes command support.
> v1: initial proposal implementation for discard command.
> ---
>  drivers/block/virtio_blk.c  | 83 -
>  include/uapi/linux/virtio_blk.h | 54 +
>  2 files changed, 135 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 086c6bb12baa..0f39efb4b3aa 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -18,6 +18,7 @@
>  
>  #define PART_BITS 4
>  #define VQ_NAME_LEN 16
> +#define MAX_DISCARD_SEGMENTS 256u
>  
>  static int major;
>  static DEFINE_IDA(vd_index_ida);
> @@ -172,10 +173,48 @@ static int virtblk_add_req(struct virtqueue *vq, struct 
> virtblk_req *vbr,
>   return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
>  }
>  
> +static int virtblk_setup_discard_write_zeroes(struct request *req, bool 
> unmap)
> +{
> + unsigned short segments = blk_rq_nr_discard_segments(req);
> + unsigned short n = 0;
> + struct virtio_blk_discard_write_zeroes *range;
> + struct bio *bio;
> + u32 flags = 0;
> +
> + if (unmap)
> + flags |= VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP;
> +
> + range = kmalloc_array(segments, sizeof(*range), GFP_ATOMIC);
> + if (!range)
> + return -ENOMEM;
> +
> + __rq_for_each_bio(bio, req) {
> + u64 sector = bio->bi_iter.bi_sector;
> + u32 num_sectors = bio->bi_iter.bi_size >> SECTOR_SHIFT;
> +
> + range[n].flags = cpu_to_le32(flags);
> + range[n].num_sectors = cpu_to_le32(num_sectors);
> + range[n].sector = cpu_to_le64(sector);
> + n++;
> + }
> +
> + req->special_vec.bv_page = virt_to_page(range);
> + req->special_vec.bv_offset = offset_in_page(range);
> + req->special_vec.bv_len = sizeof(*range) * segments;
> + req->rq_flags |= RQF_SPECIAL_PAYLOAD;
> +
> + return 0;
> +}
> +
>  static inline void virtblk_request_done(struct request *req)
>  {
>   struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
>  
> + if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
> + kfree(page_address(req->special_vec.