from:"Dongli Zhang"

Re: [PATCH RFC 1/1] x86/paravirt: introduce param to disable pv sched_clock

2023-10-19 Thread Dongli Zhang

Hi Vitaly, Sean and David,

On 10/19/23 08:40, Sean Christopherson wrote:
> On Thu, Oct 19, 2023, Vitaly Kuznetsov wrote:
>> Dongli Zhang  writes:
>>
>>> As mentioned in the linux kernel development document, "sched_clock() is
>>> used for scheduling and timestamping". While there is a default native
>>> implementation, many paravirtualizations have their own implementations.
>>>
>>> About KVM, it uses kvm_sched_clock_read() and there is no way to only
>>> disable KVM's sched_clock. The "no-kvmclock" may disable all
>>> paravirtualized kvmclock features.
> 
> ...
> 
>>> Please suggest and comment if other options are better:
>>>
>>> 1. Global param (this RFC patch).
>>>
>>> 2. The kvmclock specific param (e.g., "no-vmw-sched-clock" in vmware).
>>>
>>> Indeed I like the 2nd method.
>>>
>>> 3. Enforce native sched_clock only when TSC is invariant (hyper-v method).
>>>
>>> 4. Remove and cleanup pv sched_clock, and always use pv_sched_clock() for
>>> all (suggested by Peter Zijlstra in [3]). Some paravirtualizations may
>>> want to keep the pv sched_clock.
>>
>> Normally, it should be up to the hypervisor to tell the guest which
>> clock to use, i.e. if TSC is reliable or not. Let me put my question
>> this way: if TSC on the particular host is good for everything, why
>> does the hypervisor advertises 'kvmclock' to its guests?
> 
> I suspect there are two reasons.
> 
>   1. As is likely the case in our fleet, no one revisited the set of 
> advertised
>  PV features when defining the VM shapes for a new generation of 
> hardware, or
>  whoever did the reviews wasn't aware that advertising kvmclock is 
> actually
>  suboptimal.  All the PV clock stuff in KVM is quite labyrinthian, so it's
>  not hard to imagine it getting overlooked.
> 
>   2. Legacy VMs.  If VMs have been running with a PV clock for years, forcing
>  them to switch to a new clocksource is high-risk, low-reward.
> 
>> If for some 'historical reasons' we can't revoke features we can always
>> introduce a new PV feature bit saying that TSC is preferred.
>>
>> 1) Global param doesn't sound like a good idea to me: chances are that
>> people will be setting it on their guest images to workaround problems
>> on one hypervisor (or, rather, on one public cloud which is too lazy to
>> fix their hypervisor) while simultaneously creating problems on another.
>>
>> 2) KVM specific parameter can work, but as KVM's sched_clock is the same
>> as kvmclock, I'm not convinced it actually makes sense to separate the
>> two. Like if sched_clock is known to be bad but TSC is good, why do we
>> need to use PV clock at all? Having a parameter for debugging purposes
>> may be OK though...
>>
>> 3) This is Hyper-V specific, you can see that it uses a dedicated PV bit
>> (HV_ACCESS_TSC_INVARIANT) and not the architectural
>> CPUID.8007H:EDX[8]. I'm not sure we can blindly trust the later on
>> all hypervisors.
>>
>> 4) Personally, I'm not sure that relying on 'TSC is crap' detection is
>> 100% reliable. I can imagine cases when we can't detect that fact that
>> while synchronized across CPUs and not going backwards, it is, for
>> example, ticking with an unstable frequency and PV sched clock is
>> supposed to give the right correction (all of them are rdtsc() based
>> anyways, aren't they?).
> 
> Yeah, practically speaking, the only thing adding a knob to turn off using PV
> clocks for sched_clock will accomplish is creating an even bigger matrix of
> combinations that can cause problems, e.g. where guests end up using kvmclock
> timekeeping but not scheduling.
> 
> The explanation above and the links below fail to capture _the_ key point:
> Linux-as-a-guest already prioritizes the TSC over paravirt clocks as the 
> clocksource
> when the TSC is constant and nonstop (first spliced blob below).
> 
> What I suggested is that if the TSC is chosen over a PV clock as the 
> clocksource,
> then we have the kernel also override the sched_clock selection (second 
> spliced
> blob below).
> 
> That doesn't require the guest admin to opt-in, and doesn't create even more
> combinations to support.  It also provides for a smoother transition for when
> customers inevitably end up creating VMs on hosts that don't advertise 
> kvmclock
> (or any PV clock).

I would prefer to always leave the option to allow the guest admin to change the
decision, especially for diagnostic/workaround reason (although the kvmclock is
always buggy when tsc is bu

[PATCH RFC 1/1] x86/paravirt: introduce param to disable pv sched_clock

2023-10-18 Thread Dongli Zhang

As mentioned in the linux kernel development document, "sched_clock() is
used for scheduling and timestamping". While there is a default native
implementation, many paravirtualizations have their own implementations.

About KVM, it uses kvm_sched_clock_read() and there is no way to only
disable KVM's sched_clock. The "no-kvmclock" may disable all
paravirtualized kvmclock features.

 94 static inline void kvm_sched_clock_init(bool stable)
 95 {
 96 if (!stable)
 97 clear_sched_clock_stable();
 98 kvm_sched_clock_offset = kvm_clock_read();
 99 paravirt_set_sched_clock(kvm_sched_clock_read);
100
101 pr_info("kvm-clock: using sched offset of %llu cycles",
102 kvm_sched_clock_offset);
103
104 BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
105 sizeof(((struct pvclock_vcpu_time_info 
*)NULL)->system_time));
106 }

There is known issue that kvmclock may drift during vCPU hotplug [1].
Although a temporary fix is available [2], we may need a way to disable pv
sched_clock. Nowadays, the TSC is more stable and has less performance
overhead than kvmclock.

This is to propose to introduce a global param to disable pv sched_clock
for all paravirtualizations.

Please suggest and comment if other options are better:

1. Global param (this RFC patch).

2. The kvmclock specific param (e.g., "no-vmw-sched-clock" in vmware).

Indeed I like the 2nd method.

3. Enforce native sched_clock only when TSC is invariant (hyper-v method).

4. Remove and cleanup pv sched_clock, and always use pv_sched_clock() for
all (suggested by Peter Zijlstra in [3]). Some paravirtualizations may
want to keep the pv sched_clock.

To introduce a param may be easier to backport to old kernel version.

References:
[1] https://lore.kernel.org/all/20230926230649.67852-1-dongli.zh...@oracle.com/
[2] https://lore.kernel.org/all/20231018195638.1898375-1-sea...@google.com/
[3] 
https://lore.kernel.org/all/20231002211651.ga3...@noisy.programming.kicks-ass.net/

Thank you very much for the suggestion!

Signed-off-by: Dongli Zhang 
---
 arch/x86/include/asm/paravirt.h |  2 +-
 arch/x86/kernel/kvmclock.c  | 12 +++-
 arch/x86/kernel/paravirt.c  | 18 +-
 3 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 6c8ff12140ae..f36edf608b6b 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -24,7 +24,7 @@ u64 dummy_sched_clock(void);
 DECLARE_STATIC_CALL(pv_steal_clock, dummy_steal_clock);
 DECLARE_STATIC_CALL(pv_sched_clock, dummy_sched_clock);
 
-void paravirt_set_sched_clock(u64 (*func)(void));
+int paravirt_set_sched_clock(u64 (*func)(void));
 
 static __always_inline u64 paravirt_sched_clock(void)
 {
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index fb8f52149be9..0b8bf5677d44 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -93,13 +93,15 @@ static noinstr u64 kvm_sched_clock_read(void)
 
 static inline void kvm_sched_clock_init(bool stable)
 {
-   if (!stable)
-   clear_sched_clock_stable();
kvm_sched_clock_offset = kvm_clock_read();
-   paravirt_set_sched_clock(kvm_sched_clock_read);
 
-   pr_info("kvm-clock: using sched offset of %llu cycles",
-   kvm_sched_clock_offset);
+   if (!paravirt_set_sched_clock(kvm_sched_clock_read)) {
+   if (!stable)
+   clear_sched_clock_stable();
+
+   pr_info("kvm-clock: using sched offset of %llu cycles",
+   kvm_sched_clock_offset);
+   }
 
BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 97f1436c1a20..2cfef94317b0 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -118,9 +118,25 @@ static u64 native_steal_clock(int cpu)
 DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
 DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
 
-void paravirt_set_sched_clock(u64 (*func)(void))
+static bool no_pv_sched_clock;
+
+static int __init parse_no_pv_sched_clock(char *arg)
+{
+   no_pv_sched_clock = true;
+   return 0;
+}
+early_param("no_pv_sched_clock", parse_no_pv_sched_clock);
+
+int paravirt_set_sched_clock(u64 (*func)(void))
 {
+   if (no_pv_sched_clock) {
+   pr_info("sched_clock: not configurable\n");
+   return -EPERM;
+   }
+
static_call_update(pv_sched_clock, func);
+
+   return 0;
 }
 
 /* These are in entry.S */
-- 
2.34.1

[PATCH RFC v2 1/1] wiotlb: split buffer into 32-bit default and 64-bit extra zones

2022-08-20 Thread Dongli Zhang

Hello,

I used to send out RFC v1 to introduce an extra io_tlb_mem (created with
SWIOTLB_ANY) in addition to the default io_tlb_mem (32-bit).  The
dev->dma_io_tlb_mem is set to either default or the extra io_tlb_mem,
depending on dma mask. However, that is not good for setting
dev->dma_io_tlb_mem at swiotlb layer transparently as suggested by
Christoph Hellwig.

https://lore.kernel.org/all/20220609005553.30954-1-dongli.zh...@oracle.com/

Therefore, this is another RFC v2 implementation following a different
direction. The core ideas are:

1. The swiotlb is splited into two zones, io_tlb_mem->zone[0] (32-bit) and
io_tlb_mem->zone[1] (64-bit).

struct io_tlb_mem {
struct io_tlb_zone zone[SWIOTLB_NR];
struct dentry *debugfs;
bool late_alloc;
bool force_bounce;
bool for_alloc;
bool has_extra;
};

struct io_tlb_zone {
phys_addr_t start;
phys_addr_t end;
void *vaddr;
unsigned long nslabs;
unsigned long used;
unsigned int nareas;
unsigned int area_nslabs;
struct io_tlb_area *areas;
struct io_tlb_slot *slots;
};

2. By default, only io_tlb_mem->zone[0] is available. The
io_tlb_mem->zone[1] is allocated conditionally if:

- the "swiotlb=" is configured to allocate extra buffer, and
- the SWIOTLB_EXTRA is set in the flag (this is to make sure arch(s) other
  than x86/sev/xen will not enable it until it is fully tested by each
  arch, e.g., mips/powerpc). Currently it is enabled for x86 and xen.

3. During swiotlb map, whether zone[0] (32-bit) or zone[1] (64-bit
SWIOTLB_ANY)
is used depends on min_not_zero(*dev->dma_mask, dev->bus_dma_limit).

To test the RFC v2, here is the QEMU command line.

qemu-system-x86_64 -smp 8 -m 32G -enable-kvm -vnc :5 -hda disk.img \
-kernel path-to-linux/arch/x86_64/boot/bzImage \
-append "root=/dev/sda1 init=/sbin/init text console=ttyS0 loglevel=7 
swiotlb=32768,4194304,force" \
-net nic -net user,hostfwd=tcp::5025-:22 \
-device nvme,drive=nvme01,serial=helloworld -drive 
file=test.qcow2,if=none,id=nvme01 \
-serial stdio

There is below in syslog. The extra 8GB buffer is allocated.

[0.152251] software IO TLB: area num 8.
... ...
[3.706088] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[3.707334] software IO TLB: mapped default [mem 
0xbbfd7000-0xbffd7000] (64MB)
[3.708585] software IO TLB: mapped extra [mem 
0x00061cc0-0x00081cc0] (8192MB)

After the FIO is triggered over NVMe, the 64-bit buffer is used.

$ cat /sys/kernel/debug/swiotlb/io_tlb_nslabs_extra
4194304
$ cat /sys/kernel/debug/swiotlb/io_tlb_used_extra
327552

Would you mind helping if this is the right direction to go?

Thank you very much!

Cc: Konrad Wilk 
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 arch/arm/xen/mm.c  |   2 +-
 arch/mips/pci/pci-octeon.c |   5 +-
 arch/x86/include/asm/xen/swiotlb-xen.h |   2 +-
 arch/x86/kernel/pci-dma.c  |   6 +-
 drivers/xen/swiotlb-xen.c  |  18 +-
 include/linux/swiotlb.h|  73 +++--
 kernel/dma/swiotlb.c   | 499 +
 7 files changed, 388 insertions(+), 217 deletions(-)

diff --git a/arch/arm/xen/mm.c b/arch/arm/xen/mm.c
index 3d826c0..4edfa42 100644
--- a/arch/arm/xen/mm.c
+++ b/arch/arm/xen/mm.c
@@ -125,7 +125,7 @@ static int __init xen_mm_init(void)
return 0;
 
/* we can work with the default swiotlb */
-   if (!io_tlb_default_mem.nslabs) {
+   if (!io_tlb_default_mem.zone[SWIOTLB_DF].nslabs) {
rc = swiotlb_init_late(swiotlb_size_or_default(),
   xen_swiotlb_gfp(), NULL);
if (rc < 0)
diff --git a/arch/mips/pci/pci-octeon.c b/arch/mips/pci/pci-octeon.c
index e457a18..0bf0859 100644
--- a/arch/mips/pci/pci-octeon.c
+++ b/arch/mips/pci/pci-octeon.c
@@ -654,6 +654,9 @@ static int __init octeon_pci_setup(void)
octeon_pci_mem_resource.end =
octeon_pci_mem_resource.start + (1ul << 30);
} else {
+   struct io_tlb_mem *mem = _tlb_default_mem;
+   struct io_tlb_zone *zone = >zone[SWIOTLB_DF];
+
/* Remap the Octeon BAR 0 to map 128MB-(128MB+4KB) */
octeon_npi_write32(CVMX_NPI_PCI_CFG04, 128ul << 20);
octeon_npi_write32(CVMX_NPI_PCI_CFG05, 0);
@@ -664,7 +667,7 @@ static int __init octeon_pci_setup(void)
 
/* BAR1 movable regions contiguous to cover the swiotlb */
octeon_bar1_pci_phys =
-   io_tlb_default_mem.start & ~((1ull << 22) - 1);
+   zone->start & ~((1ull << 22) - 1);
 
for (index = 0; index < 32; index++) {
union cvmx_pci_bar1_indexx bar1_index;
diff --git a/arch/x86/includ

Re: [PATCH RFC v1 4/7] swiotlb: to implement io_tlb_high_mem

2022-06-10 Thread Dongli Zhang

Hi Christoph,

On 6/8/22 10:05 PM, Christoph Hellwig wrote:
> All this really needs to be hidden under the hood.
> 

Since this patch file has 200+ lines, would you please help clarify what does
'this' indicate?

The idea of this patch:

1. Convert the functions to initialize swiotlb into callee. The callee accepts
'true' or 'false' as arguments to indicate whether it is for default or new
swiotlb buffer, e.g., swiotlb_init_remap() into __swiotlb_init_remap().

2. At the caller side to decide if we are going to call the callee to create the
extra buffer.

Do you mean the callee if still *too high level* and all the
decision/allocation/initialization should be down at *lower level functions*?

E.g., if I re-work the "struct io_tlb_mem" to make the 64-bit buffer as the 2nd
array of io_tlb_mem->slots[32_or_64][index], the user will only notice it is the
default 'io_tlb_default_mem', but will not be able to know if it is allocated
from 32-bit or 64-bit buffer.

Thank you very much for the feedback.

Dongli Zhang

[PATCH RFC v1 0/7] swiotlb: extra 64-bit buffer for dev->dma_io_tlb_mem

2022-06-08 Thread Dongli Zhang

Hello,

I used to send out a patchset on 64-bit buffer and people thought it was
the same as Restricted DMA. However, the 64-bit buffer is still not supported.

https://lore.kernel.org/all/20210203233709.19819-1-dongli.zh...@oracle.com/

This RFC is to introduce the extra swiotlb buffer with SWIOTLB_ANY flag,
to support 64-bit swiotlb.

The core ideas are:

1. Create an extra io_tlb_mem with SWIOTLB_ANY flags.

2. The dev->dma_io_tlb_mem is set to either default or the extra io_tlb_mem,
   depending on dma mask.


Would you please help suggest for below questions in the RFC?

- Is it fine to create the extra io_tlb_mem?

- Which one is better: to create a separate variable for the extra
  io_tlb_mem, or make it an array of two io_tlb_mem?

- Should I set dev->dma_io_tlb_mem in each driver (e.g., virtio driver as
  in this patchset)based on the value of
  min_not_zero(*dev->dma_mask, dev->bus_dma_limit), or at higher level
  (e.g., post pci driver)?


This patchset is to demonstrate that the idea works. Since this is just a
RFC, I have only tested virtio-blk on qemu-7.0 by enforcing swiotlb. It is
not tested on AMD SEV environment.

qemu-system-x86_64 -cpu host -name debug-threads=on \
-smp 8 -m 16G -machine q35,accel=kvm -vnc :5 -hda boot.img \
-kernel mainline-linux/arch/x86_64/boot/bzImage \
-append "root=/dev/sda1 init=/sbin/init text console=ttyS0 loglevel=7 
swiotlb=327680,3145728,force" \
-device 
virtio-blk-pci,id=vblk0,num-queues=8,drive=drive0,disable-legacy=on,iommu_platform=true
 \
-drive file=test.raw,if=none,id=drive0,cache=none \
-net nic -net user,hostfwd=tcp::5025-:22 -serial stdio


The kernel command line "swiotlb=327680,3145728,force" is to allocate 6GB for
the extra swiotlb.

[2.826676] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[2.826693] software IO TLB: default mapped [mem 
0x3700-0x5f00] (640MB)
[2.826697] software IO TLB: high mapped [mem 
0x0002edc8-0x00046dc8] (6144MB)

The highmem swiotlb is being used by virtio-blk.

$ cat /sys/kernel/debug/swiotlb/swiotlb-hi/io_tlb_nslabs 
3145728
$ cat /sys/kernel/debug/swiotlb/swiotlb-hi/io_tlb_used 
8960


Dongli Zhang (7):
  swiotlb: introduce the highmem swiotlb buffer
  swiotlb: change the signature of remap function
  swiotlb-xen: support highmem for xen specific code
  swiotlb: to implement io_tlb_high_mem
  swiotlb: add interface to set dev->dma_io_tlb_mem
  virtio: use io_tlb_high_mem if it is active
  swiotlb: fix the slot_addr() overflow

arch/powerpc/kernel/dma-swiotlb.c  |   8 +-
arch/x86/include/asm/xen/swiotlb-xen.h |   2 +-
arch/x86/kernel/pci-dma.c  |   5 +-
drivers/virtio/virtio.c|   8 ++
drivers/xen/swiotlb-xen.c  |  16 +++-
include/linux/swiotlb.h|  14 ++-
kernel/dma/swiotlb.c   | 136 +---
7 files changed, 145 insertions(+), 44 deletions(-)

Thank you very much for feedback and suggestion!

Dongli Zhang

[PATCH RFC v1 1/7] swiotlb: introduce the highmem swiotlb buffer

2022-06-08 Thread Dongli Zhang

Currently, the virtio driver is not able to use 4+ GB memory when the
swiotlb is enforced, e.g., when amd sev is involved.

Fortunately, the SWIOTLB_ANY flag has been introduced since
commit 8ba2ed1be90f ("swiotlb: add a SWIOTLB_ANY flag to lift the low
memory restriction") to allocate swiotlb buffer from high memory.

While the default swiotlb is 'io_tlb_default_mem', the extra
'io_tlb_high_mem' is introduced to allocate with SWIOTLB_ANY flag in the
future patches. E.g., the user may configure the extra highmem swiotlb
buffer via "swiotlb=327680,4194304" to allocate 8GB memory.

In the future, the driver will be able to decide to use whether
'io_tlb_default_mem' or 'io_tlb_high_mem'.

The highmem swiotlb is enabled by user if io_tlb_high_mem is set. It can
be actively used if swiotlb_high_active() returns true.

The kernel command line "swiotlb=32768,3145728,force" is to allocate 64MB
for default swiotlb, and 6GB for the extra highmem swiotlb.

Cc: Konrad Wilk 
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 include/linux/swiotlb.h |  2 ++
 kernel/dma/swiotlb.c| 16 
 2 files changed, 18 insertions(+)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 7ed35dd3de6e..e67e605af2dd 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -109,6 +109,7 @@ struct io_tlb_mem {
} *slots;
 };
 extern struct io_tlb_mem io_tlb_default_mem;
+extern struct io_tlb_mem io_tlb_high_mem;
 
 static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
 {
@@ -164,6 +165,7 @@ static inline void swiotlb_adjust_size(unsigned long size)
 }
 #endif /* CONFIG_SWIOTLB */
 
+extern bool swiotlb_high_active(void);
 extern void swiotlb_print_info(void);
 
 #ifdef CONFIG_DMA_RESTRICTED_POOL
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index cb50f8d38360..569bc30e7b7a 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -66,10 +66,12 @@ static bool swiotlb_force_bounce;
 static bool swiotlb_force_disable;
 
 struct io_tlb_mem io_tlb_default_mem;
+struct io_tlb_mem io_tlb_high_mem;
 
 phys_addr_t swiotlb_unencrypted_base;
 
 static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
+static unsigned long high_nslabs;
 
 static int __init
 setup_io_tlb_npages(char *str)
@@ -81,6 +83,15 @@ setup_io_tlb_npages(char *str)
}
if (*str == ',')
++str;
+
+   if (isdigit(*str)) {
+   /* avoid tail segment of size < IO_TLB_SEGSIZE */
+   high_nslabs =
+   ALIGN(simple_strtoul(str, , 0), IO_TLB_SEGSIZE);
+   }
+   if (*str == ',')
+   ++str;
+
if (!strcmp(str, "force"))
swiotlb_force_bounce = true;
else if (!strcmp(str, "noforce"))
@@ -90,6 +101,11 @@ setup_io_tlb_npages(char *str)
 }
 early_param("swiotlb", setup_io_tlb_npages);
 
+bool swiotlb_high_active(void)
+{
+   return high_nslabs && io_tlb_high_mem.nslabs;
+}
+
 unsigned int swiotlb_max_segment(void)
 {
if (!io_tlb_default_mem.nslabs)
-- 
2.17.1

[PATCH RFC v1 7/7] swiotlb: fix the slot_addr() overflow

2022-06-08 Thread Dongli Zhang

Since the type of swiotlb slot index is a signed integer, the
"((idx) << IO_TLB_SHIFT)" will returns incorrect value. As a result, the
slot_addr() returns a value which is smaller than the expected one.

E.g., the 'tlb_addr' generated in swiotlb_tbl_map_single() may return a
value smaller than the expected one. As a result, the swiotlb_bounce()
will access a wrong swiotlb slot.

Cc: Konrad Wilk 
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 kernel/dma/swiotlb.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 0dcdd25ea95d..c64e557de55c 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -531,7 +531,8 @@ static void swiotlb_bounce(struct device *dev, phys_addr_t 
tlb_addr, size_t size
}
 }

-#define slot_addr(start, idx)  ((start) + ((idx) << IO_TLB_SHIFT))
+#define slot_addr(start, idx)  ((start) + \
+   (((unsigned long)idx) << IO_TLB_SHIFT))

 /*
  * Carefully handle integer overflow which can occur when boundary_mask == 
~0UL.
-- 
2.17.1

[PATCH RFC v1 3/7] swiotlb-xen: support highmem for xen specific code

2022-06-08 Thread Dongli Zhang

While for most of times the swiotlb-xen relies on the generic swiotlb api
to initialize and use swiotlb, this patch is to support highmem swiotlb
for swiotlb-xen specific code.

E.g., the xen_swiotlb_fixup() may request the hypervisor to provide
64-bit memory pages as swiotlb buffer.

Cc: Konrad Wilk 
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 drivers/xen/swiotlb-xen.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 339f46e21053..d15321e9f9db 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -38,7 +38,8 @@
 #include 
 
 #include 
-#define MAX_DMA_BITS 32
+#define MAX_DMA32_BITS 32
+#define MAX_DMA64_BITS 64
 
 /*
  * Quick lookup value of the bus address of the IOTLB.
@@ -109,19 +110,25 @@ int xen_swiotlb_fixup(void *buf, unsigned long nslabs, 
bool high)
int rc;
unsigned int order = get_order(IO_TLB_SEGSIZE << IO_TLB_SHIFT);
unsigned int i, dma_bits = order + PAGE_SHIFT;
+   unsigned int max_dma_bits = MAX_DMA32_BITS;
dma_addr_t dma_handle;
phys_addr_t p = virt_to_phys(buf);
 
BUILD_BUG_ON(IO_TLB_SEGSIZE & (IO_TLB_SEGSIZE - 1));
BUG_ON(nslabs % IO_TLB_SEGSIZE);
 
+   if (high) {
+   dma_bits = MAX_DMA64_BITS;
+   max_dma_bits = MAX_DMA64_BITS;
+   }
+
i = 0;
do {
do {
rc = xen_create_contiguous_region(
p + (i << IO_TLB_SHIFT), order,
dma_bits, _handle);
-   } while (rc && dma_bits++ < MAX_DMA_BITS);
+   } while (rc && dma_bits++ < max_dma_bits);
if (rc)
return rc;
 
@@ -381,7 +388,8 @@ xen_swiotlb_sync_sg_for_device(struct device *dev, struct 
scatterlist *sgl,
 static int
 xen_swiotlb_dma_supported(struct device *hwdev, u64 mask)
 {
-   return xen_phys_to_dma(hwdev, io_tlb_default_mem.end - 1) <= mask;
+   return xen_phys_to_dma(hwdev, io_tlb_default_mem.end - 1) <= mask ||
+  xen_phys_to_dma(hwdev, io_tlb_high_mem.end - 1) <= mask;
 }
 
 const struct dma_map_ops xen_swiotlb_dma_ops = {
-- 
2.17.1

[PATCH RFC v1 4/7] swiotlb: to implement io_tlb_high_mem

2022-06-08 Thread Dongli Zhang

This patch is to implement the extra 'io_tlb_high_mem'. In the future, the
device drivers may choose to use either 'io_tlb_default_mem' or
'io_tlb_high_mem' as dev->dma_io_tlb_mem.

The highmem buffer is regarded as active if
(high_nslabs && io_tlb_high_mem.nslabs) returns true.

Cc: Konrad Wilk 
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 arch/powerpc/kernel/dma-swiotlb.c |   8 ++-
 arch/x86/kernel/pci-dma.c |   5 +-
 include/linux/swiotlb.h   |   2 +-
 kernel/dma/swiotlb.c  | 103 +-
 4 files changed, 84 insertions(+), 34 deletions(-)

diff --git a/arch/powerpc/kernel/dma-swiotlb.c 
b/arch/powerpc/kernel/dma-swiotlb.c
index ba256c37bcc0..f18694881264 100644
--- a/arch/powerpc/kernel/dma-swiotlb.c
+++ b/arch/powerpc/kernel/dma-swiotlb.c
@@ -20,9 +20,11 @@ void __init swiotlb_detect_4g(void)
 
 static int __init check_swiotlb_enabled(void)
 {
-   if (ppc_swiotlb_enable)
-   swiotlb_print_info();
-   else
+   if (ppc_swiotlb_enable) {
+   swiotlb_print_info(false);
+   if (swiotlb_high_active())
+   swiotlb_print_info(true);
+   } else
swiotlb_exit();
 
return 0;
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 30bbe4abb5d6..1504b349b312 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -196,7 +196,10 @@ static int __init pci_iommu_init(void)
/* An IOMMU turned us off. */
if (x86_swiotlb_enable) {
pr_info("PCI-DMA: Using software bounce buffering for IO 
(SWIOTLB)\n");
-   swiotlb_print_info();
+
+   swiotlb_print_info(false);
+   if (swiotlb_high_active())
+   swiotlb_print_info(true);
} else {
swiotlb_exit();
}
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index e61c074c55eb..8196bf961aab 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -166,7 +166,7 @@ static inline void swiotlb_adjust_size(unsigned long size)
 #endif /* CONFIG_SWIOTLB */
 
 extern bool swiotlb_high_active(void);
-extern void swiotlb_print_info(void);
+extern void swiotlb_print_info(bool high);
 
 #ifdef CONFIG_DMA_RESTRICTED_POOL
 struct page *swiotlb_alloc(struct device *dev, size_t size);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 793ca7f9..ff82b281ce01 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -101,6 +101,21 @@ setup_io_tlb_npages(char *str)
 }
 early_param("swiotlb", setup_io_tlb_npages);
 
+static struct io_tlb_mem *io_tlb_mem_get(bool high)
+{
+   return high ? _tlb_high_mem : _tlb_default_mem;
+}
+
+static unsigned long nslabs_get(bool high)
+{
+   return high ? high_nslabs : default_nslabs;
+}
+
+static char *swiotlb_name_get(bool high)
+{
+   return high ? "high" : "default";
+}
+
 bool swiotlb_high_active(void)
 {
return high_nslabs && io_tlb_high_mem.nslabs;
@@ -133,17 +148,18 @@ void __init swiotlb_adjust_size(unsigned long size)
pr_info("SWIOTLB bounce buffer size adjusted to %luMB", size >> 20);
 }
 
-void swiotlb_print_info(void)
+void swiotlb_print_info(bool high)
 {
-   struct io_tlb_mem *mem = _tlb_default_mem;
+   struct io_tlb_mem *mem = io_tlb_mem_get(high);
 
if (!mem->nslabs) {
pr_warn("No low mem\n");
return;
}
 
-   pr_info("mapped [mem %pa-%pa] (%luMB)\n", >start, >end,
-  (mem->nslabs << IO_TLB_SHIFT) >> 20);
+   pr_info("%s mapped [mem %pa-%pa] (%luMB)\n",
+   swiotlb_name_get(high), >start, >end,
+   (mem->nslabs << IO_TLB_SHIFT) >> 20);
 }
 
 static inline unsigned long io_tlb_offset(unsigned long val)
@@ -184,15 +200,9 @@ static void *swiotlb_mem_remap(struct io_tlb_mem *mem, 
unsigned long bytes)
 }
 #endif
 
-/*
- * Early SWIOTLB allocation may be too early to allow an architecture to
- * perform the desired operations.  This function allows the architecture to
- * call SWIOTLB when the operations are possible.  It needs to be called
- * before the SWIOTLB memory is used.
- */
-void __init swiotlb_update_mem_attributes(void)
+static void __init __swiotlb_update_mem_attributes(bool high)
 {
-   struct io_tlb_mem *mem = _tlb_default_mem;
+   struct io_tlb_mem *mem = io_tlb_mem_get(high);
void *vaddr;
unsigned long bytes;
 
@@ -207,6 +217,19 @@ void __init swiotlb_update_mem_attributes(void)
mem->vaddr = vaddr;
 }
 
+/*
+ * Early SWIOTLB allocation may be too early to allow an architecture to
+ * perform the desired operations.  This function allows the architecture to
+ * call SWIOTLB when the operations are possible.  It needs to be called
+ * before the SWIOTLB

[PATCH RFC v1 5/7] swiotlb: add interface to set dev->dma_io_tlb_mem

2022-06-08 Thread Dongli Zhang

The interface re-configures dev->dma_io_tlb_mem conditionally, if the
'io_tlb_high_mem' is active.

Cc: Konrad Wilk 
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 include/linux/swiotlb.h |  6 ++
 kernel/dma/swiotlb.c| 10 ++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 8196bf961aab..78217d8bbee2 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -131,6 +131,7 @@ unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
 bool is_swiotlb_active(struct device *dev);
 void __init swiotlb_adjust_size(unsigned long size);
+bool swiotlb_use_high(struct device *dev);
 #else
 static inline void swiotlb_init(bool addressing_limited, unsigned int flags)
 {
@@ -163,6 +164,11 @@ static inline bool is_swiotlb_active(struct device *dev)
 static inline void swiotlb_adjust_size(unsigned long size)
 {
 }
+
+static bool swiotlb_use_high(struct device *dev);
+{
+   return false;
+}
 #endif /* CONFIG_SWIOTLB */
 
 extern bool swiotlb_high_active(void);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index ff82b281ce01..0dcdd25ea95d 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -121,6 +121,16 @@ bool swiotlb_high_active(void)
return high_nslabs && io_tlb_high_mem.nslabs;
 }
 
+bool swiotlb_use_high(struct device *dev)
+{
+   if (!swiotlb_high_active())
+   return false;
+
+   dev->dma_io_tlb_mem = _tlb_high_mem;
+   return true;
+}
+EXPORT_SYMBOL_GPL(swiotlb_use_high);
+
 unsigned int swiotlb_max_segment(void)
 {
if (!io_tlb_default_mem.nslabs)
-- 
2.17.1

[PATCH RFC v1 2/7] swiotlb: change the signature of remap function

2022-06-08 Thread Dongli Zhang

Add new argument 'high' to remap function, so that it will be able to
remap the swiotlb buffer based on whether the swiotlb is 32-bit or
64-bit.

Currently the only remap function is xen_swiotlb_fixup().

Cc: Konrad Wilk 
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 arch/x86/include/asm/xen/swiotlb-xen.h | 2 +-
 drivers/xen/swiotlb-xen.c  | 2 +-
 include/linux/swiotlb.h| 4 ++--
 kernel/dma/swiotlb.c   | 8 
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/xen/swiotlb-xen.h 
b/arch/x86/include/asm/xen/swiotlb-xen.h
index 77a2d19cc990..a54eae15605e 100644
--- a/arch/x86/include/asm/xen/swiotlb-xen.h
+++ b/arch/x86/include/asm/xen/swiotlb-xen.h
@@ -8,7 +8,7 @@ extern int pci_xen_swiotlb_init_late(void);
 static inline int pci_xen_swiotlb_init_late(void) { return -ENXIO; }
 #endif
 
-int xen_swiotlb_fixup(void *buf, unsigned long nslabs);
+int xen_swiotlb_fixup(void *buf, unsigned long nslabs, bool high);
 int xen_create_contiguous_region(phys_addr_t pstart, unsigned int order,
unsigned int address_bits,
dma_addr_t *dma_handle);
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 67aa74d20162..339f46e21053 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -104,7 +104,7 @@ static int is_xen_swiotlb_buffer(struct device *dev, 
dma_addr_t dma_addr)
 }
 
 #ifdef CONFIG_X86
-int xen_swiotlb_fixup(void *buf, unsigned long nslabs)
+int xen_swiotlb_fixup(void *buf, unsigned long nslabs, bool high)
 {
int rc;
unsigned int order = get_order(IO_TLB_SEGSIZE << IO_TLB_SHIFT);
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index e67e605af2dd..e61c074c55eb 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -36,9 +36,9 @@ struct scatterlist;
 
 unsigned long swiotlb_size_or_default(void);
 void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
-   int (*remap)(void *tlb, unsigned long nslabs));
+   int (*remap)(void *tlb, unsigned long nslabs, bool high));
 int swiotlb_init_late(size_t size, gfp_t gfp_mask,
-   int (*remap)(void *tlb, unsigned long nslabs));
+   int (*remap)(void *tlb, unsigned long nslabs, bool high));
 extern void __init swiotlb_update_mem_attributes(void);
 
 phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t phys,
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 569bc30e7b7a..793ca7f9 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -245,7 +245,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, 
phys_addr_t start,
  * structures for the software IO TLB used to implement the DMA API.
  */
 void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
-   int (*remap)(void *tlb, unsigned long nslabs))
+   int (*remap)(void *tlb, unsigned long nslabs, bool high))
 {
struct io_tlb_mem *mem = _tlb_default_mem;
unsigned long nslabs = default_nslabs;
@@ -274,7 +274,7 @@ void __init swiotlb_init_remap(bool addressing_limit, 
unsigned int flags,
return;
}
 
-   if (remap && remap(tlb, nslabs) < 0) {
+   if (remap && remap(tlb, nslabs, false) < 0) {
memblock_free(tlb, PAGE_ALIGN(bytes));
 
nslabs = ALIGN(nslabs >> 1, IO_TLB_SEGSIZE);
@@ -307,7 +307,7 @@ void __init swiotlb_init(bool addressing_limit, unsigned 
int flags)
  * This should be just like above, but with some error catching.
  */
 int swiotlb_init_late(size_t size, gfp_t gfp_mask,
-   int (*remap)(void *tlb, unsigned long nslabs))
+   int (*remap)(void *tlb, unsigned long nslabs, bool high))
 {
struct io_tlb_mem *mem = _tlb_default_mem;
unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
@@ -337,7 +337,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
return -ENOMEM;
 
if (remap)
-   rc = remap(vstart, nslabs);
+   rc = remap(vstart, nslabs, false);
if (rc) {
free_pages((unsigned long)vstart, order);
 
-- 
2.17.1

[PATCH RFC v1 6/7] virtio: use io_tlb_high_mem if it is active

2022-06-08 Thread Dongli Zhang

When the swiotlb is enforced (e.g., when amd sev is involved), the virito
driver will not be able to use 4+ GB memory. Therefore, the virtio driver
uses 'io_tlb_high_mem' as swiotlb.

Cc: Konrad Wilk 
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 drivers/virtio/virtio.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index ef04a96942bf..d9ebe3940e2d 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -5,6 +5,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 
 /* Unique numbering for virtio devices. */
@@ -241,6 +243,12 @@ static int virtio_dev_probe(struct device *_d)
u64 device_features;
u64 driver_features;
u64 driver_features_legacy;
+   struct device *parent = dev->dev.parent;
+   u64 dma_mask = min_not_zero(*parent->dma_mask,
+   parent->bus_dma_limit);
+
+   if (dma_mask == DMA_BIT_MASK(64))
+   swiotlb_use_high(parent);
 
/* We have a driver! */
virtio_add_status(dev, VIRTIO_CONFIG_S_DRIVER);
-- 
2.17.1

Re: [PATCH 12/15] swiotlb: provide swiotlb_init variants that remap the buffer

2022-04-04 Thread Dongli Zhang

abs))
>  {
>   unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
>   unsigned long bytes;
> @@ -303,6 +322,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask)
>   if (swiotlb_force_disable)
>   return 0;
>  
> +retry:
>   order = get_order(nslabs << IO_TLB_SHIFT);
>   nslabs = SLABS_PER_PAGE << order;
>   bytes = nslabs << IO_TLB_SHIFT;
> @@ -323,6 +343,16 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask)
>   (PAGE_SIZE << order) >> 20);
>   nslabs = SLABS_PER_PAGE << order;
>   }
> + if (remap)
> + rc = remap(vstart, nslabs);
> + if (rc) {
> + free_pages((unsigned long)vstart, order);
> + 

"warning: 1 line adds whitespace errors." above when I was applying the patch
for test.

Dongli Zhang

> + nslabs = ALIGN(nslabs >> 1, IO_TLB_SEGSIZE);
> + if (nslabs < IO_TLB_MIN_SLABS)
> + return rc;
> + goto retry;
> + }
>   rc = swiotlb_late_init_with_tbl(vstart, nslabs);
>   if (rc)
>   free_pages((unsigned long)vstart, order);
>

[PATCH 1/1] xen/blkfront: fix comment for need_copy

2022-03-17 Thread Dongli Zhang

The 'need_copy' is set when rq_data_dir(req) returns WRITE, in order to
copy the written data to persistent page.

".need_copy = rq_data_dir(req) && info->feature_persistent,"

Signed-off-by: Dongli Zhang 
---
 drivers/block/xen-blkfront.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 03b5fb341e58..dbc32d0a4b1a 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -576,7 +576,7 @@ struct setup_rw_req {
struct blkif_request *ring_req;
grant_ref_t gref_head;
unsigned int id;
-   /* Only used when persistent grant is used and it's a read request */
+   /* Only used when persistent grant is used and it's a write request */
bool need_copy;
unsigned int bvec_off;
char *bvec_data;
-- 
2.17.1

Re: [PATCH 10/12] swiotlb: add a SWIOTLB_ANY flag to lift the low memory restriction

2022-03-04 Thread Dongli Zhang

Hi Michael,

On 3/4/22 10:12 AM, Michael Kelley (LINUX) wrote:
> From: Christoph Hellwig  Sent: Tuesday, March 1, 2022 2:53 AM
>>
>> Power SVM wants to allocate a swiotlb buffer that is not restricted to low 
>> memory for
>> the trusted hypervisor scheme.  Consolidate the support for this into the 
>> swiotlb_init
>> interface by adding a new flag.
> 
> Hyper-V Isolated VMs want to do the same thing of not restricting the swiotlb
> buffer to low memory.  That's what Tianyu Lan's patch set[1] is proposing.
> Hyper-V synthetic devices have no DMA addressing limitations, and the
> likelihood of using a PCI pass-thru device with addressing limitations in an
> Isolated VM seems vanishingly small.
> 
> So could use of the SWIOTLB_ANY flag be generalized?  Let Hyper-V init
> code set the flag before swiotlb_init() is called.  Or provide a CONFIG
> variable that Hyper-V Isolated VMs could set.

I used to send 64-bit swiotlb, while at that time people thought it was the same
as Restricted DMA patchset.

https://lore.kernel.org/all/20210203233709.19819-1-dongli.zh...@oracle.com/

However, I do not think Restricted DMA patchset is going to supports 64-bit (or
high memory) DMA. Is this what you are looking for?

Dongli Zhang

> 
> Michael
> 
> [1] 
> https://urldefense.com/v3/__https://lore.kernel.org/lkml/20220209122302.213882-1-ltyker...@gmail.com/__;!!ACWV5N9M2RV99hQ!fUx4fMgdQIrqJDDy-pbv9xMeyHX0rC6iN8176LWjylI2_lsjy03gysm0-lAbV1Yb7_g$
>  
> 
>>
>> Signed-off-by: Christoph Hellwig 
>> ---
>>  arch/powerpc/include/asm/svm.h   |  4 
>>  arch/powerpc/include/asm/swiotlb.h   |  1 +
>>  arch/powerpc/kernel/dma-swiotlb.c|  1 +
>>  arch/powerpc/mm/mem.c|  5 +
>>  arch/powerpc/platforms/pseries/svm.c | 26 +-
>>  include/linux/swiotlb.h  |  1 +
>>  kernel/dma/swiotlb.c |  9 +++--
>>  7 files changed, 12 insertions(+), 35 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/svm.h b/arch/powerpc/include/asm/svm.h
>> index 7546402d796af..85580b30aba48 100644
>> --- a/arch/powerpc/include/asm/svm.h
>> +++ b/arch/powerpc/include/asm/svm.h
>> @@ -15,8 +15,6 @@ static inline bool is_secure_guest(void)
>>  return mfmsr() & MSR_S;
>>  }
>>
>> -void __init svm_swiotlb_init(void);
>> -
>>  void dtl_cache_ctor(void *addr);
>>  #define get_dtl_cache_ctor()(is_secure_guest() ? dtl_cache_ctor : 
>> NULL)
>>
>> @@ -27,8 +25,6 @@ static inline bool is_secure_guest(void)
>>  return false;
>>  }
>>
>> -static inline void svm_swiotlb_init(void) {}
>> -
>>  #define get_dtl_cache_ctor() NULL
>>
>>  #endif /* CONFIG_PPC_SVM */
>> diff --git a/arch/powerpc/include/asm/swiotlb.h
>> b/arch/powerpc/include/asm/swiotlb.h
>> index 3c1a1cd161286..4203b5e0a88ed 100644
>> --- a/arch/powerpc/include/asm/swiotlb.h
>> +++ b/arch/powerpc/include/asm/swiotlb.h
>> @@ -9,6 +9,7 @@
>>  #include 
>>
>>  extern unsigned int ppc_swiotlb_enable;
>> +extern unsigned int ppc_swiotlb_flags;
>>
>>  #ifdef CONFIG_SWIOTLB
>>  void swiotlb_detect_4g(void);
>> diff --git a/arch/powerpc/kernel/dma-swiotlb.c b/arch/powerpc/kernel/dma-
>> swiotlb.c
>> index fc7816126a401..ba256c37bcc0f 100644
>> --- a/arch/powerpc/kernel/dma-swiotlb.c
>> +++ b/arch/powerpc/kernel/dma-swiotlb.c
>> @@ -10,6 +10,7 @@
>>  #include 
>>
>>  unsigned int ppc_swiotlb_enable;
>> +unsigned int ppc_swiotlb_flags;
>>
>>  void __init swiotlb_detect_4g(void)
>>  {
>> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c index
>> e1519e2edc656..a4d65418c30a9 100644
>> --- a/arch/powerpc/mm/mem.c
>> +++ b/arch/powerpc/mm/mem.c
>> @@ -249,10 +249,7 @@ void __init mem_init(void)
>>   * back to to-down.
>>   */
>>  memblock_set_bottom_up(true);
>> -if (is_secure_guest())
>> -svm_swiotlb_init();
>> -else
>> -swiotlb_init(ppc_swiotlb_enable, 0);
>> +swiotlb_init(ppc_swiotlb_enable, ppc_swiotlb_flags);
>>  #endif
>>
>>  high_memory = (void *) __va(max_low_pfn * PAGE_SIZE); diff --git
>> a/arch/powerpc/platforms/pseries/svm.c b/arch/powerpc/platforms/pseries/svm.c
>> index c5228f4969eb2..3b4045d508ec8 100644
>> --- a/arch/powerpc/platforms/pseries/svm.c
>> +++ b/arch/powerpc/platforms/pseries/svm.c
>> @@ -28,7 +28,7 @@ static int __init init_svm(void)
>>   * need to use the SWIOTLB buffer for DMA even if dma_capable() says
>>   * otherwise.
>>

Re: [PATCH v4 2/2] xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32

2022-03-02 Thread Dongli Zhang

Hi Boris,

On 3/2/22 6:11 PM, Boris Ostrovsky wrote:
> 
> On 3/2/22 7:31 PM, Dongli Zhang wrote:
>> Hi Boris,
>>
>> On 3/2/22 4:20 PM, Boris Ostrovsky wrote:
>>> On 3/2/22 11:40 AM, Dongli Zhang wrote:
>>>>    void __init xen_hvm_init_time_ops(void)
>>>>    {
>>>> +    static bool hvm_time_initialized;
>>>> +
>>>> +    if (hvm_time_initialized)
>>>> +    return;
>>>> +
>>>>    /*
>>>>     * vector callback is needed otherwise we cannot receive interrupts
>>>>     * on cpu > 0 and at this point we don't know how many cpus are
>>>>     * available.
>>>>     */
>>>>    if (!xen_have_vector_callback)
>>>> -    return;
>>>> +    goto exit;
>>>
>>> Why not just return? Do we expect the value of xen_have_vector_callback to
>>> change?
>> I just want to keep above sync with 
>>
>>>
>>> -boris
>>>
>>>
>>>>      if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
>>>>    pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
>>>> +    goto exit;
>>>> +    }
>> ... here.
>>
>> That is, I want the main logic of xen_hvm_init_time_ops() to run for at most
>> once. Both of above two if statements will "go to exit".
> 
> 
> I didn't notice this actually.
> 
> 
> I think both of them should return early, there is no reason to set
> hvm_time_initialized to true when, in fact, we have not initialized anything.
> And to avoid printing the warning twice we can just replace it with 
> pr_info_once().
> 
> 
> I can fix it up when committing so no need to resend. So unless you disagree

Thank you very much for fixing it during committing.

Dongli Zhang

Re: [PATCH v4 2/2] xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32

2022-03-02 Thread Dongli Zhang

Hi Boris,

On 3/2/22 4:20 PM, Boris Ostrovsky wrote:
> 
> On 3/2/22 11:40 AM, Dongli Zhang wrote:
>>   void __init xen_hvm_init_time_ops(void)
>>   {
>> +    static bool hvm_time_initialized;
>> +
>> +    if (hvm_time_initialized)
>> +    return;
>> +
>>   /*
>>    * vector callback is needed otherwise we cannot receive interrupts
>>    * on cpu > 0 and at this point we don't know how many cpus are
>>    * available.
>>    */
>>   if (!xen_have_vector_callback)
>> -    return;
>> +    goto exit;
> 
> 
> Why not just return? Do we expect the value of xen_have_vector_callback to 
> change?

I just want to keep above sync with 

> 
> 
> -boris
> 
> 
>>     if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
>>   pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
>> +    goto exit;
>> +    }

... here.

That is, I want the main logic of xen_hvm_init_time_ops() to run for at most
once. Both of above two if statements will "go to exit".

Thank you very much!

Dongli Zhang

>> +
>> +    /*
>> + * Only MAX_VIRT_CPUS 'vcpu_info' are embedded inside 'shared_info'.
>> + * The __this_cpu_read(xen_vcpu) is still NULL when Xen HVM guest
>> + * boots on vcpu >= MAX_VIRT_CPUS (e.g., kexec), To access
>> + * __this_cpu_read(xen_vcpu) via xen_clocksource_read() will panic.
>> + *
>> + * The xen_hvm_init_time_ops() should be called again later after
>> + * __this_cpu_read(xen_vcpu) is available.
>> + */
>> +    if (!__this_cpu_read(xen_vcpu)) {
>> +    pr_info("Delay xen_init_time_common() as kernel is running on
>> vcpu=%d\n",
>> +    xen_vcpu_nr(0));
>>   return;
>>   }
>>   @@ -577,6 +597,9 @@ void __init xen_hvm_init_time_ops(void)
>>   x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;
>>     x86_platform.set_wallclock = xen_set_wallclock;
>> +
>> +exit:
>> +    hvm_time_initialized = true;
>>   }
>>   #endif
>>

[PATCH v4 2/2] xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32

2022-03-02 Thread Dongli Zhang

The sched_clock() can be used very early since commit 857baa87b642
("sched/clock: Enable sched clock early"). In addition, with commit
38669ba205d1 ("x86/xen/time: Output xen sched_clock time from 0"), kdump
kernel in Xen HVM guest may panic at very early stage when accessing
&__this_cpu_read(xen_vcpu)->time as in below:

setup_arch()
 -> init_hypervisor_platform()
 -> x86_init.hyper.init_platform = xen_hvm_guest_init()
 -> xen_hvm_init_time_ops()
 -> xen_clocksource_read()
 -> src = &__this_cpu_read(xen_vcpu)->time;

This is because Xen HVM supports at most MAX_VIRT_CPUS=32 'vcpu_info'
embedded inside 'shared_info' during early stage until xen_vcpu_setup() is
used to allocate/relocate 'vcpu_info' for boot cpu at arbitrary address.

However, when Xen HVM guest panic on vcpu >= 32, since
xen_vcpu_info_reset(0) would set per_cpu(xen_vcpu, cpu) = NULL when
vcpu >= 32, xen_clocksource_read() on vcpu >= 32 would panic.

This patch calls xen_hvm_init_time_ops() again later in
xen_hvm_smp_prepare_boot_cpu() after the 'vcpu_info' for boot vcpu is
registered when the boot vcpu is >= 32.

This issue can be reproduced on purpose via below command at the guest
side when kdump/kexec is enabled:

"taskset -c 33 echo c > /proc/sysrq-trigger"

The bugfix for PVM is not implemented due to the lack of testing
environment.

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
Changed since v1:
  - Add commit message to explain why xen_hvm_init_time_ops() is delayed
for any vcpus. (Suggested by Boris Ostrovsky)
  - Add a comment in xen_hvm_smp_prepare_boot_cpu() referencing the related
code in xen_hvm_guest_init(). (suggested by Juergen Gross)
Changed since v2:
  - Delay for all VCPUs. (Suggested by Boris Ostrovsky)
  - Add commit message that why PVM is not supported by this patch
  - Test if kexec/kdump works with mainline xen (HVM and PVM)
Changed since v3:
  - Re-use v2 but move the login into xen_hvm_init_time_ops() (Suggested
by Boris Ostrovsky)

 arch/x86/xen/smp_hvm.c |  6 ++
 arch/x86/xen/time.c| 25 -
 2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/smp_hvm.c b/arch/x86/xen/smp_hvm.c
index 6ff3c887e0b9..b70afdff419c 100644
--- a/arch/x86/xen/smp_hvm.c
+++ b/arch/x86/xen/smp_hvm.c
@@ -19,6 +19,12 @@ static void __init xen_hvm_smp_prepare_boot_cpu(void)
 */
xen_vcpu_setup(0);
 
+   /*
+* Called again in case the kernel boots on vcpu >= MAX_VIRT_CPUS.
+* Refer to comments in xen_hvm_init_time_ops().
+*/
+   xen_hvm_init_time_ops();
+
/*
 * The alternative logic (which patches the unlock/lock) runs before
 * the smp bootup up code is activated. Hence we need to set this up
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 55b3407358a9..dcf292cc859e 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -558,16 +558,36 @@ static void xen_hvm_setup_cpu_clockevents(void)
 
 void __init xen_hvm_init_time_ops(void)
 {
+   static bool hvm_time_initialized;
+
+   if (hvm_time_initialized)
+   return;
+
/*
 * vector callback is needed otherwise we cannot receive interrupts
 * on cpu > 0 and at this point we don't know how many cpus are
 * available.
 */
if (!xen_have_vector_callback)
-   return;
+   goto exit;
 
if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
+   goto exit;
+   }
+
+   /*
+* Only MAX_VIRT_CPUS 'vcpu_info' are embedded inside 'shared_info'.
+* The __this_cpu_read(xen_vcpu) is still NULL when Xen HVM guest
+* boots on vcpu >= MAX_VIRT_CPUS (e.g., kexec), To access
+* __this_cpu_read(xen_vcpu) via xen_clocksource_read() will panic.
+*
+* The xen_hvm_init_time_ops() should be called again later after
+* __this_cpu_read(xen_vcpu) is available.
+*/
+   if (!__this_cpu_read(xen_vcpu)) {
+   pr_info("Delay xen_init_time_common() as kernel is running on 
vcpu=%d\n",
+   xen_vcpu_nr(0));
return;
}
 
@@ -577,6 +597,9 @@ void __init xen_hvm_init_time_ops(void)
x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;
 
x86_platform.set_wallclock = xen_set_wallclock;
+
+exit:
+   hvm_time_initialized = true;
 }
 #endif
 
-- 
2.17.1

[PATCH v4 1/2] x86/xen/time: fix indentation issue

2022-03-02 Thread Dongli Zhang

No functional change.

Signed-off-by: Dongli Zhang 
---
 arch/x86/xen/time.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index d9c945ee1100..55b3407358a9 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -45,7 +45,7 @@ static unsigned long xen_tsc_khz(void)
 
 static u64 xen_clocksource_read(void)
 {
-struct pvclock_vcpu_time_info *src;
+   struct pvclock_vcpu_time_info *src;
u64 ret;
 
preempt_disable_notrace();
-- 
2.17.1

[PATCH v4 0/2] xen: fix HVM kexec kernel panic

2022-03-02 Thread Dongli Zhang

This is the v4 of the patch to fix xen kexec kernel panic issue when the
kexec is triggered on VCPU >= 32.

PANIC: early exception 0x0e IP 10:a96679b6 error 0 cr2 0x20
[0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 
5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
... ...
[0.00] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
... ...
[0.00] RSP: :aae03e10 EFLAGS: 00010082 ORIG_RAX: 

[0.00] RAX:  RBX: 0001 RCX: 0002
[0.00] RDX: 0003 RSI: aac37515 RDI: 0020
[0.00] RBP: 00011000 R08:  R09: 0001
[0.00] R10: aae03df8 R11: aae03c68 R12: 4004
[0.00] R13: aae03e50 R14:  R15: 
[0.00] FS:  () GS:ab588000() 
knlGS:
[0.00] CS:  0010 DS:  ES:  CR0: 80050033
[0.00] CR2: 0020 CR3: ea41 CR4: 000406a0
[0.00] DR0:  DR1:  DR2: 
[0.00] DR3:  DR6: fffe0ff0 DR7: 0400
[0.00] Call Trace:
[0.00]  
[0.00]  ? xen_clocksource_read+0x24/0x40
[0.00]  ? xen_init_time_common+0x5/0x49
[0.00]  ? xen_hvm_init_time_ops+0x23/0x45
[0.00]  ? xen_hvm_guest_init+0x221/0x25c
[0.00]  ? 0xa960
[0.00]  ? setup_arch+0x440/0xbd6
[0.00]  ? start_kernel+0x6c/0x695
[0.00]  ? secondary_startup_64_no_verify+0xd5/0xdb
[0.00]  


Changed since v1:
  - Add commit message to explain why xen_hvm_init_time_ops() is delayed
for any vcpus. (Suggested by Boris Ostrovsky)
  - Add a comment in xen_hvm_smp_prepare_boot_cpu() referencing the related
code in xen_hvm_guest_init(). (suggested by Juergen Gross)
Changed since v2:
  - Delay for all VCPUs. (Suggested by Boris Ostrovsky)
  - Add commit message that why PVM is not supported by this patch
  - Test if kexec/kdump works with mainline xen (HVM and PVM)
Changed since v3:
  - Re-use v2 but move the login into xen_hvm_init_time_ops() (Suggested
by Boris Ostrovsky) 


I have tested with HVM VM on both old xen and mainline xen.

About the mainline xen, the 'soft_reset' works after I reset d->creation_reset
as suggested by Jan Beulich.

https://lore.kernel.org/all/d3814109-f4ba-9edb-1575-ab94faaeb...@suse.com/


Thank you very much!

Dongli Zhang

Re: [PATCH v3 0/1] xen: fix HVM kexec kernel panic

2022-02-28 Thread Dongli Zhang

Hi Boris,

On 2/28/22 5:18 PM, Dongli Zhang wrote:
> Hi Boris,
> 
> On 2/28/22 12:45 PM, Boris Ostrovsky wrote:
>>
>>
>> On 2/25/22 8:17 PM, Dongli Zhang wrote:
>>> Hi Boris,
>>>
>>> On 2/25/22 2:39 PM, Boris Ostrovsky wrote:
>>>>
>>>> On 2/24/22 4:50 PM, Dongli Zhang wrote:
>>>>> This is the v3 of the patch to fix xen kexec kernel panic issue when the
>>>>> kexec is triggered on VCPU >= 32.
>>>>>
>>>>> PANIC: early exception 0x0e IP 10:a96679b6 error 0 cr2 0x20
>>>>> [    0.00] CPU: 0 PID: 0 Comm: swapper Not tainted
>>>>> 5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
>>>>> [    0.00] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 12/15/2020
>>>>> [    0.00] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
>>>>> ... ...
>>>>> [    0.00] RSP: :aae03e10 EFLAGS: 00010082 ORIG_RAX:
>>>>> 
>>>>> [    0.00] RAX:  RBX: 0001 RCX:
>>>>> 0002
>>>>> [    0.00] RDX: 0003 RSI: aac37515 RDI:
>>>>> 0020
>>>>> [    0.00] RBP: 00011000 R08:  R09:
>>>>> 0001
>>>>> [    0.00] R10: aae03df8 R11: aae03c68 R12:
>>>>> 4004
>>>>> [    0.00] R13: aae03e50 R14:  R15:
>>>>> 
>>>>> [    0.00] FS:  () GS:ab588000()
>>>>> knlGS:
>>>>> [    0.00] CS:  0010 DS:  ES:  CR0: 80050033
>>>>> [    0.00] CR2: 0020 CR3: ea41 CR4:
>>>>> 000406a0
>>>>> [    0.00] DR0:  DR1:  DR2:
>>>>> 
>>>>> [    0.00] DR3:  DR6: fffe0ff0 DR7:
>>>>> 0400
>>>>> [    0.00] Call Trace:
>>>>> [    0.00]  
>>>>> [    0.00]  ? xen_clocksource_read+0x24/0x40
>>>>
>>>>
>>>> This is done to set xen_sched_clock_offset which I think will not be used 
>>>> for a
>>>> while, until sched_clock is called (and the other two uses are for
>>>> suspend/resume)
>>>>
>>>>
>>>> Can we simply defer 'xen_sched_clock_offset = xen_clocksource_read();' 
>>>> until
>>>> after all vcpu areas are properly set? Or are there other uses of
>>>> xen_clocksource_read() before ?
>>>>
>>>
>>> I have tested that below patch will panic kdump kernel.
>>>
>>
>>
>>
>> Oh well, so much for that then. Yes, sched_clock() is at least called from
>> printk path.
>>
>>
>> I guess we will have to go with v2 then, we don't want to start seeing time
>> going back, even if only with older hypervisors. The only thing I might ask 
>> is
>> that you roll the logic inside xen_hvm_init_time_ops(). Something like
>>
>>
>> xen_hvm_init_time_ops()
>> {
>> /*
>>  * Wait until per_cpu(xen_vcpu, 0) is initialized which may happen
>>  * later (e.g. when kdump kernel runs on >=MAX_VIRT_CPUS vcpu)
>>  */
>> if (__this_cpu_read(xen_vcpu_nr(0)) == NULL)
>>     return;
>>
> 
> I think you meant __this_cpu_read(xen_vcpu).
> 
> I will call xen_hvm_init_time_ops() at both places, and move the logic into
> xen_hvm_init_time_ops().
> 
> Thank you very much!
> 
> Dongli Zhang
> 


How about we do not move the logic into xen_hvm_init_time_ops()?

Suppose we move the logic into xen_hvm_init_time_ops() line 573, the line line
570 might be printed twice.


559 void __init xen_hvm_init_time_ops(void)
560 {
561 /*
562  * vector callback is needed otherwise we cannot receive interrupts
563  * on cpu > 0 and at this point we don't know how many cpus are
564  * available.
565  */
566 if (!xen_have_vector_callback)
567 return;
568
569 if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
570 pr_info("Xen doesn't support pvclock on HVM, disable pv 
timer");
571 return;
572 }
573
574 xen_init_time_common();
575
576 x86_init.timers.setup_percpu_clockev = xen_time_init;
577 x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;
578
579 x86_platform.set_wallclock = xen_set_wallclock;
580 }

I feel the code looks better if we keep the logic at caller side. Would you mind
letting me know your feedback?

Thank you very much!

Dongli Zhang

Re: [PATCH v3 0/1] xen: fix HVM kexec kernel panic

2022-02-28 Thread Dongli Zhang

Hi Boris,

On 2/28/22 12:45 PM, Boris Ostrovsky wrote:
> 
> 
> On 2/25/22 8:17 PM, Dongli Zhang wrote:
>> Hi Boris,
>>
>> On 2/25/22 2:39 PM, Boris Ostrovsky wrote:
>>>
>>> On 2/24/22 4:50 PM, Dongli Zhang wrote:
>>>> This is the v3 of the patch to fix xen kexec kernel panic issue when the
>>>> kexec is triggered on VCPU >= 32.
>>>>
>>>> PANIC: early exception 0x0e IP 10:a96679b6 error 0 cr2 0x20
>>>> [    0.00] CPU: 0 PID: 0 Comm: swapper Not tainted
>>>> 5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
>>>> [    0.00] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 12/15/2020
>>>> [    0.00] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
>>>> ... ...
>>>> [    0.00] RSP: :aae03e10 EFLAGS: 00010082 ORIG_RAX:
>>>> 
>>>> [    0.00] RAX:  RBX: 0001 RCX:
>>>> 0002
>>>> [    0.00] RDX: 0003 RSI: aac37515 RDI:
>>>> 0020
>>>> [    0.00] RBP: 00011000 R08:  R09:
>>>> 0001
>>>> [    0.00] R10: aae03df8 R11: aae03c68 R12:
>>>> 4004
>>>> [    0.00] R13: aae03e50 R14:  R15:
>>>> 
>>>> [    0.00] FS:  () GS:ab588000()
>>>> knlGS:
>>>> [    0.00] CS:  0010 DS:  ES:  CR0: 80050033
>>>> [    0.00] CR2: 0020 CR3: ea41 CR4:
>>>> 000406a0
>>>> [    0.00] DR0:  DR1:  DR2:
>>>> 
>>>> [    0.00] DR3:  DR6: fffe0ff0 DR7:
>>>> 0400
>>>> [    0.00] Call Trace:
>>>> [    0.00]  
>>>> [    0.00]  ? xen_clocksource_read+0x24/0x40
>>>
>>>
>>> This is done to set xen_sched_clock_offset which I think will not be used 
>>> for a
>>> while, until sched_clock is called (and the other two uses are for
>>> suspend/resume)
>>>
>>>
>>> Can we simply defer 'xen_sched_clock_offset = xen_clocksource_read();' until
>>> after all vcpu areas are properly set? Or are there other uses of
>>> xen_clocksource_read() before ?
>>>
>>
>> I have tested that below patch will panic kdump kernel.
>>
> 
> 
> 
> Oh well, so much for that then. Yes, sched_clock() is at least called from
> printk path.
> 
> 
> I guess we will have to go with v2 then, we don't want to start seeing time
> going back, even if only with older hypervisors. The only thing I might ask is
> that you roll the logic inside xen_hvm_init_time_ops(). Something like
> 
> 
> xen_hvm_init_time_ops()
> {
> /*
>  * Wait until per_cpu(xen_vcpu, 0) is initialized which may happen
>  * later (e.g. when kdump kernel runs on >=MAX_VIRT_CPUS vcpu)
>  */
> if (__this_cpu_read(xen_vcpu_nr(0)) == NULL)
>     return;
> 

I think you meant __this_cpu_read(xen_vcpu).

I will call xen_hvm_init_time_ops() at both places, and move the logic into
xen_hvm_init_time_ops().

Thank you very much!

Dongli Zhang

Re: [PATCH v3 0/1] xen: fix HVM kexec kernel panic

2022-02-25 Thread Dongli Zhang

Hi Boris,

On 2/25/22 2:39 PM, Boris Ostrovsky wrote:
> 
> On 2/24/22 4:50 PM, Dongli Zhang wrote:
>> This is the v3 of the patch to fix xen kexec kernel panic issue when the
>> kexec is triggered on VCPU >= 32.
>>
>> PANIC: early exception 0x0e IP 10:a96679b6 error 0 cr2 0x20
>> [    0.00] CPU: 0 PID: 0 Comm: swapper Not tainted
>> 5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
>> [    0.00] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 12/15/2020
>> [    0.00] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
>> ... ...
>> [    0.00] RSP: :aae03e10 EFLAGS: 00010082 ORIG_RAX:
>> 
>> [    0.00] RAX:  RBX: 0001 RCX: 
>> 0002
>> [    0.00] RDX: 0003 RSI: aac37515 RDI: 
>> 0020
>> [    0.00] RBP: 00011000 R08:  R09: 
>> 0001
>> [    0.00] R10: aae03df8 R11: aae03c68 R12: 
>> 4004
>> [    0.00] R13: aae03e50 R14:  R15: 
>> 
>> [    0.00] FS:  () GS:ab588000()
>> knlGS:
>> [    0.00] CS:  0010 DS:  ES:  CR0: 80050033
>> [    0.00] CR2: 0020 CR3: ea41 CR4: 
>> 000406a0
>> [    0.00] DR0:  DR1:  DR2: 
>> 
>> [    0.00] DR3:  DR6: fffe0ff0 DR7: 
>> 0400
>> [    0.00] Call Trace:
>> [    0.00]  
>> [    0.00]  ? xen_clocksource_read+0x24/0x40
> 
> 
> This is done to set xen_sched_clock_offset which I think will not be used for 
> a
> while, until sched_clock is called (and the other two uses are for 
> suspend/resume)
> 
> 
> Can we simply defer 'xen_sched_clock_offset = xen_clocksource_read();' until
> after all vcpu areas are properly set? Or are there other uses of
> xen_clocksource_read() before ?
> 

I have tested that below patch will panic kdump kernel.

diff --git a/arch/x86/xen/smp_hvm.c b/arch/x86/xen/smp_hvm.c
index 6ff3c887e0b9..6a0c99941ae1 100644
--- a/arch/x86/xen/smp_hvm.c
+++ b/arch/x86/xen/smp_hvm.c
@@ -19,6 +19,8 @@ static void __init xen_hvm_smp_prepare_boot_cpu(void)
 */
xen_vcpu_setup(0);

+   xen_init_sched_clock_offset();
+
/*
 * The alternative logic (which patches the unlock/lock) runs before
 * the smp bootup up code is activated. Hence we need to set this up
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index d9c945ee1100..8a2eafa0c215 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -520,9 +520,14 @@ static void __init xen_time_init(void)
pvclock_gtod_register_notifier(_pvclock_gtod_notifier);
 }

-static void __init xen_init_time_common(void)
+void xen_init_sched_clock_offset(void)
 {
xen_sched_clock_offset = xen_clocksource_read();
+}
+
+static void __init xen_init_time_common(void)
+{
+   xen_sched_clock_offset = 0;
static_call_update(pv_steal_clock, xen_steal_clock);
paravirt_set_sched_clock(xen_sched_clock);

diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index fd0fec6e92f4..9f7656214dfb 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -69,6 +69,7 @@ void xen_teardown_timer(int cpu);
 void xen_setup_cpu_clockevents(void);
 void xen_save_time_memory_area(void);
 void xen_restore_time_memory_area(void);
+void xen_init_sched_clock_offset(void);
 void xen_init_time_ops(void);
 void xen_hvm_init_time_ops(void);



Unfortunately, I am not able to obtain the panic callstack from kdump time this
time. I have only below.

PANIC: early exception 0x0e IP 10:a6c679b6 error 0 cr2 0x20
PANIC: early exception 0x0e IP 10:a6c679b6 error 0 cr2 0x20



The sched_clock() can be used very early since commit 857baa87b642
("sched/clock: Enable sched clock early"). Any printk should use sched_clock()
to obtain the timestamp.

vprintk_store()
-> local_clock()
   -> sched_clock()
  -> paravirt_sched_clock()
 -> xen_sched_clock()
-> xen_clocksource_read()


AFAIR, we started to encounter the issue since commit 857baa87b642
("sched/clock: Enable sched clock early"). kdump used to work well before that
commit.

Thank you very much!

Dongli Zhang

Re: [BUG REPORT] soft_reset (kexec/kdump) does not work with mainline xen

2022-02-25 Thread Dongli Zhang

Hi Jan,

On 2/24/22 11:15 PM, Jan Beulich wrote:
> On 24.02.2022 23:27, Dongli Zhang wrote:
>> Hello,
>>
>> This is to report that the soft_reset (kexec/kdump) has not been working for 
>> me
>> since long time ago.
>>
>> I have tested again with the most recent mainline xen and the most recent
>> mainline kernel.
>>
>> While it works with my old xen version, it does not work with mainline xen.
>>
>>
>> This is the log of my HVM guest.
>>
>> Waiting for domain test-vm (domid 1) to die [pid 1265]
>> Domain 1 has shut down, reason code 5 0x5
>> Action for shutdown reason code 5 is soft-reset
>> Done. Rebooting now
>> xc: error: Failed to set d1's policy (err leaf 0x, subleaf 
>> 0x, msr 0x) (17 = File exists): Internal error
> 
> I don't suppose you tried you track down the origin of this EEXIST? I think
> it's pretty obvious, as in the handling of XEN_DOMCTL_set_cpu_policy we have
> 
> if ( d->creation_finished )
> ret = -EEXIST; /* No changing once the domain is running. */
> 
> Question is how to address it: One approach could be to clear
> d->creation_finished in domain_soft_reset(). But I think it would be more
> clean if the tool stack avoided trying to set the CPUID policy (again) on
> the guest when it soft-resets, as it's still the same guest after all.
> Cc-ing Andrew and Anthony for possible thoughts.
> 

The soft_reset on HVM is successful after I reset d->creation_finished at the
beginning of domain_soft_reset(). So far I am able to use this as workaround to
test kexec/kdump.

However, while my image's console works well on old xen versions, the console on
mainline xen version does not work well.

I connect to the console with "xl console " immediately after the domU is
panic (and kdump is triggered). I am not able to have the syslogs of kdump
kernel on mainline xen. The same image works on old xen version.

Thank you very much!

Dongli Zhang

[BUG REPORT] soft_reset (kexec/kdump) does not work with mainline xen

2022-02-24 Thread Dongli Zhang

Hello,

This is to report that the soft_reset (kexec/kdump) has not been working for me
since long time ago.

I have tested again with the most recent mainline xen and the most recent
mainline kernel.

While it works with my old xen version, it does not work with mainline xen.


This is the log of my HVM guest.

Waiting for domain test-vm (domid 1) to die [pid 1265]
Domain 1 has shut down, reason code 5 0x5
Action for shutdown reason code 5 is soft-reset
Done. Rebooting now
xc: error: Failed to set d1's policy (err leaf 0x, subleaf 0x, 
msr 0x) (17 = File exists): Internal error
libxl: error: libxl_cpuid.c:490:libxl__cpuid_legacy: Domain 1:Failed to apply 
CPUID policy: File exists
libxl: error: libxl_create.c:1613:domcreate_rebuild_done: Domain 1:cannot 
(re-)build domain: -3
libxl: error: libxl_xshelp.c:201:libxl__xs_read_mandatory: xenstore read 
failed: `/libxl/1/type': No such file or directory
libxl: warning: libxl_dom.c:51:libxl__domain_type: unable to get domain type 
for domid=1, assuming HVM


Neither of below works.

# kexec -l /boot/vmlinuz-5.17.0-rc4xen-00054-gf71077a4d84b-dirty 
--initrd=/boot/initrd.img-5.17.0-rc4xen-00054-gf71077a4d84b-dirty 
--reuse-cmdline
# kexec -e

or

# taskset -c 0 echo c > /proc/sysrq-trigger


Thank you very much!

Dongli Zhang

[PATCH v3 0/1] xen: fix HVM kexec kernel panic

2022-02-24 Thread Dongli Zhang

This is the v3 of the patch to fix xen kexec kernel panic issue when the
kexec is triggered on VCPU >= 32.

PANIC: early exception 0x0e IP 10:a96679b6 error 0 cr2 0x20
[0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 
5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
[0.00] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 12/15/2020
[0.00] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
... ...
[0.00] RSP: :aae03e10 EFLAGS: 00010082 ORIG_RAX: 

[0.00] RAX:  RBX: 0001 RCX: 0002
[0.00] RDX: 0003 RSI: aac37515 RDI: 0020
[0.00] RBP: 00011000 R08:  R09: 0001
[0.00] R10: aae03df8 R11: aae03c68 R12: 4004
[0.00] R13: aae03e50 R14:  R15: 
[0.00] FS:  () GS:ab588000() 
knlGS:
[0.00] CS:  0010 DS:  ES:  CR0: 80050033
[0.00] CR2: 0020 CR3: ea41 CR4: 000406a0
[0.00] DR0:  DR1:  DR2: 
[0.00] DR3:  DR6: fffe0ff0 DR7: 0400
[0.00] Call Trace:
[0.00]  
[0.00]  ? xen_clocksource_read+0x24/0x40
[0.00]  ? xen_init_time_common+0x5/0x49
[0.00]  ? xen_hvm_init_time_ops+0x23/0x45
[0.00]  ? xen_hvm_guest_init+0x221/0x25c
[0.00]  ? 0xa960
[0.00]  ? setup_arch+0x440/0xbd6
[0.00]  ? start_kernel+0x6c/0x695
[0.00]  ? secondary_startup_64_no_verify+0xd5/0xdb
[0.00]  


Changed since v1:
  - Add commit message to explain why xen_hvm_init_time_ops() is delayed
for any vcpus. (Suggested by Boris Ostrovsky)
  - Add a comment in xen_hvm_smp_prepare_boot_cpu() referencing the related
code in xen_hvm_guest_init(). (suggested by Juergen Gross)
Changed since v2:
  - Delay for all VCPUs. (Suggested by Boris Ostrovsky)
  - Add commit message that why PVM is not supported by this patch
  - Test if kexec/kdump works with mainline xen (HVM and PVM)


I have delayed the xen_hvm_init_time_ops() for all VCPUs. Unfortunately,
now I am able to reproduce the clock backward as shown below on some old
versions of xen. I am not able to reproduce on most recent mainline xen.

[0.359687] pcpu-alloc: [0] 16 17 18 19 20 21 22 23 [0] 24 25 26 27 28 29 30 
31
[0.359694] pcpu-alloc: [0] 32 33 34 35 36 37 38 39 [0] 40 41 42 43 44 45 46 
47
[0.359701] pcpu-alloc: [0] 48 49 50 51 52 53 54 55 [0] 56 57 58 59 60 61 62 
63

... clock backward after the clocksource is switched from native to xen...

[0.04] Fallback order for Node 0: 0
[0.002967] Built 1 zonelists, mobility grouping on.  Total pages: 3527744
[0.007129] Policy zone: Normal
[0.008937] Kernel command line: 
BOOT_IMAGE=/boot/vmlinuz-5.17.0-rc4xen-00054-gf71077a4d84b-dirty 
root=UUID=2a5975ab-a059-4697-9aee-7a53ddfeea21 ro text console=ttyS0,115200n8 
console=tty1 earlyprintk=ttyS0,115200n8 loglevel=7 no_timer_check 
reboot=s32 splash crashkernel=512M-:192M vt.handoff=1
[0.023880] Unknown kernel command line parameters "text splash 
BOOT_IMAGE=/boot/vmlinuz-5.17.0-rc4xen-00054-gf71077a4d84b-dirty", will be 
passed to user space.
[0.032647] printk: log_buf_len individual max cpu contribution: 4096 bytes
[0.036828] printk: log_buf_len total cpu_extra contributions: 258048 bytes
[0.041049] printk: log_buf_len min size: 262144 bytes
[0.044481] printk: log_buf_len: 524288 bytes


Since now I am able to reproduce the clock backward on old xen version,
please let me know if I should re-use the v2 of this patch, as it has been
running well in our env well for very long time.

https://lore.kernel.org/all/20211028012543.8776-1-dongli.zh...@oracle.com/


BTW, I have tested that 'soft_reset' does not work with mainline xen, even
when I directly trigger kexec with below commands.

# kexec -l /boot/vmlinuz-5.17.0-rc4xen-00054-gf71077a4d84b-dirty \
--initrd=/boot/initrd.img-5.17.0-rc4xen-00054-gf71077a4d84b-dirty \
--reuse-cmdline
# kexec -e


Thank you very much!

Dongli Zhang

[PATCH v3 1/1] xen: delay xen_hvm_init_time_ops() to xen_hvm_smp_prepare_boot_cpu()

2022-02-24 Thread Dongli Zhang

The sched_clock() can be used very early since commit 857baa87b642
("sched/clock: Enable sched clock early"). In addition, with commit
38669ba205d1 ("x86/xen/time: Output xen sched_clock time from 0"), kdump
kernel in Xen HVM guest may panic at very early stage when accessing
&__this_cpu_read(xen_vcpu)->time as in below:

setup_arch()
 -> init_hypervisor_platform()
 -> x86_init.hyper.init_platform = xen_hvm_guest_init()
 -> xen_hvm_init_time_ops()
 -> xen_clocksource_read()
 -> src = &__this_cpu_read(xen_vcpu)->time;

This is because Xen HVM supports at most MAX_VIRT_CPUS=32 'vcpu_info'
embedded inside 'shared_info' during early stage until xen_vcpu_setup() is
used to allocate/relocate 'vcpu_info' for boot cpu at arbitrary address.

However, when Xen HVM guest panic on vcpu >= 32, since
xen_vcpu_info_reset(0) would set per_cpu(xen_vcpu, cpu) = NULL when
vcpu >= 32, xen_clocksource_read() on vcpu >= 32 would panic.

This patch always delays xen_hvm_init_time_ops() to later in
xen_hvm_smp_prepare_boot_cpu() after the 'vcpu_info' for boot vcpu is
registered.

This issue can be reproduced on purpose via below command at the guest
side when kdump/kexec is enabled:

"taskset -c 33 echo c > /proc/sysrq-trigger"

Unfortunately, the 'soft_reset' (kexec) does not work with mainline xen
version so that I can test this patch only with HVM guest on old xen
hypervisor where 'soft_reset' is working. The bugfix for PVM is not
implemented due to the lack of testing environment.

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
Changed since v1:
  - Add commit message to explain why xen_hvm_init_time_ops() is delayed
for any vcpus. (Suggested by Boris Ostrovsky)
  - Add a comment in xen_hvm_smp_prepare_boot_cpu() referencing the related
code in xen_hvm_guest_init(). (suggested by Juergen Gross)
Changed since v2:
  - Delay for all VCPUs. (Suggested by Boris Ostrovsky)
  - Add commit message that why PVM is not supported by this patch
  - Test if kexec/kdump works with mainline xen (HVM and PVM)

 arch/x86/xen/enlighten_hvm.c |  1 -
 arch/x86/xen/smp_hvm.c   | 17 +
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
index 517a9d8d8f94..53f306ec1d3b 100644
--- a/arch/x86/xen/enlighten_hvm.c
+++ b/arch/x86/xen/enlighten_hvm.c
@@ -216,7 +216,6 @@ static void __init xen_hvm_guest_init(void)
WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_hvm, xen_cpu_dead_hvm));
xen_unplug_emulated_devices();
x86_init.irqs.intr_init = xen_init_IRQ;
-   xen_hvm_init_time_ops();
xen_hvm_init_mmu_ops();
 
 #ifdef CONFIG_KEXEC_CORE
diff --git a/arch/x86/xen/smp_hvm.c b/arch/x86/xen/smp_hvm.c
index 6ff3c887e0b9..9a5efc1a1633 100644
--- a/arch/x86/xen/smp_hvm.c
+++ b/arch/x86/xen/smp_hvm.c
@@ -19,6 +19,23 @@ static void __init xen_hvm_smp_prepare_boot_cpu(void)
 */
xen_vcpu_setup(0);
 
+   /*
+* xen_hvm_init_time_ops() used to be called at very early stage
+* by xen_hvm_guest_init(). While only MAX_VIRT_CPUS 'vcpu_info'
+* are embedded inside 'shared_info', the VM would use them until
+* xen_vcpu_setup() is used to allocate/relocate them at arbitrary
+* address.
+*
+* However, when Xen HVM guest boots on vcpu >= MAX_VIRT_CPUS
+* (e.g., kexec kernel), per_cpu(xen_vcpu, cpu) is NULL at early
+* stage. To access per_cpu(xen_vcpu, cpu) via
+* xen_clocksource_read() would panic the kernel.
+*
+* Therefore we always delay xen_hvm_init_time_ops() to
+* xen_hvm_smp_prepare_boot_cpu() to avoid the panic.
+*/
+   xen_hvm_init_time_ops();
+
/*
 * The alternative logic (which patches the unlock/lock) runs before
 * the smp bootup up code is activated. Hence we need to set this up
-- 
2.17.1

Re: [PATCH] cpu/hotplug: Allow the CPU in CPU_UP_PREPARE state to be brought up again.

2021-11-23 Thread Dongli Zhang




On 11/23/21 3:50 PM, Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
wrote:
> 
> 
>> -Original Message-----
>> From: Dongli Zhang [mailto:dongli.zh...@oracle.com]
>> Sent: Wednesday, November 24, 2021 5:22 AM
>> To: Sebastian Andrzej Siewior ; Longpeng (Mike, Cloud
>> Infrastructure Service Product Dept.) 
>> Cc: linux-ker...@vger.kernel.org; Gonglei (Arei) ;
>> x...@kernel.org; xen-devel@lists.xenproject.org; Peter Zijlstra
>> ; Ingo Molnar ; Valentin Schneider
>> ; Boris Ostrovsky ;
>> Juergen Gross ; Stefano Stabellini ;
>> Thomas Gleixner ; Ingo Molnar ; 
>> Borislav
>> Petkov ; Dave Hansen ; H. Peter
>> Anvin 
>> Subject: Re: [PATCH] cpu/hotplug: Allow the CPU in CPU_UP_PREPARE state to be
>> brought up again.
>>
>> Tested-by: Dongli Zhang 
>>
>>
>> The bug fixed by commit 53fafdbb8b21 ("KVM: x86: switch KVMCLOCK base to
>> monotonic raw clock") may leave the cpu_hotplug_state at CPU_UP_PREPARE. As a
>> result, to online this CPU again (even after removal) is always failed.
>>
>> I have tested that this patch works well to workaround the issue, by 
>> introducing
>> either a mdeley(11000) or while(1); to start_secondary(). That is, to online
>> the
>> same CPU again is successful even after initial do_boot_cpu() failure.
>>
>> 1. add mdelay(11000) or while(1); to the start_secondary().
>>
>> 2. to online CPU is failed at do_boot_cpu().
>>
> 
> Thanks for your testing :)
> 
> Does the cpu4 spin in wait_for_master_cpu() in your case ?

I did two tests.

TEST 1.

I added "mdelay(11000);" as the first line in start_secondary(). Once the issue
was encountered, the RIP of CPU=4 was 8c242021 (from QEMU's "info
registers -a") which was in the range of wait_for_master_cpu().

# cat /proc/kallsyms | grep 8c2420
8c242010 t wait_for_master_cpu
8c242030 T load_fixmap_gdt
8c242060 T native_write_cr4
8c2420c0 T cr4_init


TEST 2.

I added "while(true);" as the first line in start_secondary(). Once the issue
was encountered, the RIP of CPU=4 was 91654c0a (from QEMU's "info
registers -a") which was in the range of start_secondary().

# cat /proc/kallsyms | grep 91654c0
91654c00 t start_secondary

Dongli Zhang


> 
>> 3. to online CPU again is failed without this patch.
>>
>> # echo 1 > /sys/devices/system/cpu/cpu4/online
>> -su: echo: write error: Input/output error
>>
>> 4. to online CPU again is successful with this patch.
>>
>> Thank you very much!
>>
>> Dongli Zhang
>>
>> On 11/22/21 7:47 AM, Sebastian Andrzej Siewior wrote:
>>> From: "Longpeng(Mike)" 
>>>
>>> A CPU will not show up in virtualized environment which includes an
>>> Enclave. The VM splits its resources into a primary VM and a Enclave
>>> VM. While the Enclave is active, the hypervisor will ignore all requests
>>> to bring up a CPU and this CPU will remain in CPU_UP_PREPARE state.
>>> The kernel will wait up to ten seconds for CPU to show up
>>> (do_boot_cpu()) and then rollback the hotplug state back to
>>> CPUHP_OFFLINE leaving the CPU state in CPU_UP_PREPARE. The CPU state is
>>> set back to CPUHP_TEARDOWN_CPU during the CPU_POST_DEAD stage.
>>>
>>> After the Enclave VM terminates, the primary VM can bring up the CPU
>>> again.
>>>
>>> Allow to bring up the CPU if it is in the CPU_UP_PREPARE state.
>>>
>>> [bigeasy: Rewrite commit description.]
>>>
>>> Signed-off-by: Longpeng(Mike) 
>>> Signed-off-by: Sebastian Andrzej Siewior 
>>> Link:
>> https://urldefense.com/v3/__https://lore.kernel.org/r/20210901051143.2752-1
>> -longpeng2@huawei.com__;!!ACWV5N9M2RV99hQ!d4sCCXMQV7ekFwpd21vo1_9K-m5h4VZ-g
>> E8Z62PLL58DT4VJ6StH57TR_KpBdbwhBE0$
>>> ---
>>>
>>> For XEN: this changes the behaviour as it allows to invoke
>>> cpu_initialize_context() again should it have have earlier. I *think*
>>> this is okay and would to bring up the CPU again should the memory
>>> allocation in cpu_initialize_context() fail.
>>>
>>>  kernel/smpboot.c | 7 +++
>>>  1 file changed, 7 insertions(+)
>>>
>>> diff --git a/kernel/smpboot.c b/kernel/smpboot.c
>>> index f6bc0bc8a2aab..34958d7fe2c1c 100644
>>> --- a/kernel/smpboot.c
>>> +++ b/kernel/smpboot.c
>>> @@ -392,6 +392,13 @@ int cpu_check_up_prepare(int cpu)
>>>  */
>>> return -EAGAIN;
>>>
>>> +   case CPU_UP_PREPARE:
>>> +   /*
>>> +* Timeout while waiting for the CPU to show up. Allow to try
>>> +* again later.
>>> +*/
>>> +   return 0;
>>> +
>>> default:
>>>
>>> /* Should not happen.  Famous last words. */
>>>

Re: [PATCH] cpu/hotplug: Allow the CPU in CPU_UP_PREPARE state to be brought up again.

2021-11-23 Thread Dongli Zhang

Tested-by: Dongli Zhang 


The bug fixed by commit 53fafdbb8b21 ("KVM: x86: switch KVMCLOCK base to
monotonic raw clock") may leave the cpu_hotplug_state at CPU_UP_PREPARE. As a
result, to online this CPU again (even after removal) is always failed.

I have tested that this patch works well to workaround the issue, by introducing
either a mdeley(11000) or while(1); to start_secondary(). That is, to online the
same CPU again is successful even after initial do_boot_cpu() failure.

1. add mdelay(11000) or while(1); to the start_secondary().

2. to online CPU is failed at do_boot_cpu().

3. to online CPU again is failed without this patch.

# echo 1 > /sys/devices/system/cpu/cpu4/online
-su: echo: write error: Input/output error

4. to online CPU again is successful with this patch.

Thank you very much!

Dongli Zhang

On 11/22/21 7:47 AM, Sebastian Andrzej Siewior wrote:
> From: "Longpeng(Mike)" 
> 
> A CPU will not show up in virtualized environment which includes an
> Enclave. The VM splits its resources into a primary VM and a Enclave
> VM. While the Enclave is active, the hypervisor will ignore all requests
> to bring up a CPU and this CPU will remain in CPU_UP_PREPARE state.
> The kernel will wait up to ten seconds for CPU to show up
> (do_boot_cpu()) and then rollback the hotplug state back to
> CPUHP_OFFLINE leaving the CPU state in CPU_UP_PREPARE. The CPU state is
> set back to CPUHP_TEARDOWN_CPU during the CPU_POST_DEAD stage.
> 
> After the Enclave VM terminates, the primary VM can bring up the CPU
> again.
> 
> Allow to bring up the CPU if it is in the CPU_UP_PREPARE state.
> 
> [bigeasy: Rewrite commit description.]
> 
> Signed-off-by: Longpeng(Mike) 
> Signed-off-by: Sebastian Andrzej Siewior 
> Link: 
> https://urldefense.com/v3/__https://lore.kernel.org/r/20210901051143.2752-1-longpeng2@huawei.com__;!!ACWV5N9M2RV99hQ!d4sCCXMQV7ekFwpd21vo1_9K-m5h4VZ-gE8Z62PLL58DT4VJ6StH57TR_KpBdbwhBE0$
>  
> ---
> 
> For XEN: this changes the behaviour as it allows to invoke
> cpu_initialize_context() again should it have have earlier. I *think*
> this is okay and would to bring up the CPU again should the memory
> allocation in cpu_initialize_context() fail.
> 
>  kernel/smpboot.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/kernel/smpboot.c b/kernel/smpboot.c
> index f6bc0bc8a2aab..34958d7fe2c1c 100644
> --- a/kernel/smpboot.c
> +++ b/kernel/smpboot.c
> @@ -392,6 +392,13 @@ int cpu_check_up_prepare(int cpu)
>*/
>   return -EAGAIN;
>  
> + case CPU_UP_PREPARE:
> + /*
> +  * Timeout while waiting for the CPU to show up. Allow to try
> +  * again later.
> +  */
> + return 0;
> +
>   default:
>  
>   /* Should not happen.  Famous last words. */
>

Re: [PATCH v2 1/1] xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32

2021-11-04 Thread Dongli Zhang

Hi Boris,

On 11/1/21 10:34 AM, Boris Ostrovsky wrote:
> 
> On 10/27/21 9:25 PM, Dongli Zhang wrote:
>> The sched_clock() can be used very early since
>> commit 857baa87b642 ("sched/clock: Enable sched clock early"). In addition,
>> with commit 38669ba205d1 ("x86/xen/time: Output xen sched_clock time from
>> 0"), kdump kernel in Xen HVM guest may panic at very early stage when
>> accessing &__this_cpu_read(xen_vcpu)->time as in below:
>>
>> setup_arch()
>>   -> init_hypervisor_platform()
>>   -> x86_init.hyper.init_platform = xen_hvm_guest_init()
>>   -> xen_hvm_init_time_ops()
>>   -> xen_clocksource_read()
>>   -> src = &__this_cpu_read(xen_vcpu)->time;
>>
>> This is because Xen HVM supports at most MAX_VIRT_CPUS=32 'vcpu_info'
>> embedded inside 'shared_info' during early stage until xen_vcpu_setup() is
>> used to allocate/relocate 'vcpu_info' for boot cpu at arbitrary address.
>>
>> However, when Xen HVM guest panic on vcpu >= 32, since
>> xen_vcpu_info_reset(0) would set per_cpu(xen_vcpu, cpu) = NULL when
>> vcpu >= 32, xen_clocksource_read() on vcpu >= 32 would panic.
>>
>> This patch delays xen_hvm_init_time_ops() to later in
>> xen_hvm_smp_prepare_boot_cpu() after the 'vcpu_info' for boot vcpu is
>> registered when the boot vcpu is >= 32.
>>
>> Another option is to always delay xen_hvm_init_time_ops() for any vcpus
>> (including vcpu=0). Since to delay xen_hvm_init_time_ops() may lead to
>> clock backward issue,
> 
> 
> This is referring to
> https://lists.xenproject.org/archives/html/xen-devel/2021-10/msg01516.html I
> assume?

No.

So far there are clock forward (e.g., from 0 to 65345) issue and clock backward
issue (e.g., from 2.432 to 0).

The clock forward issue can be resolved by above link to enforce clock update
during vcpu info registration. However, so far I am only able to reproduce clock
forward when "taskset -c 33 echo c > /proc/sysrq-trigger".

That is, I am not able to see any clock forward drift during regular boot (on
CPU=0), without the fix at hypervisor side.

The clock backward issue is because the native clock source is used if we delay
the initialization of xen clock source. We will see a backward when the source
is switched from native to xen.

> 
> 
>>   it is preferred to avoid that for regular boot (The
>> pv_sched_clock=native_sched_clock() is used at the very beginning until
>> xen_sched_clock() is registered). That requires to adjust
>> xen_sched_clock_offset. That's why we only delay xen_hvm_init_time_ops()
>> for vcpu>=32.
> 
> 
> We delay only on VCPU>=32 because we want to avoid the clock going backwards 
> due
> to hypervisor problem pointed to be the link above, not because we need to
> adjust xen_sched_clock_offset (which we could if we wanted).

I will add that.

Just to clarify that so far I am not able to reproduce the clock backward issue
during regular boot (on CPU=0), when the fix is not available at hypervisor 
side.

> 
> 
>>
>> This issue can be reproduced on purpose via below command at the guest
>> side when kdump/kexec is enabled:
>>
>> "taskset -c 33 echo c > /proc/sysrq-trigger"
>>
>> Reference:
>> https://lists.xenproject.org/archives/html/xen-devel/2021-10/msg00571.html
>> Cc: Joe Jin 
>> Signed-off-by: Dongli Zhang 
>> ---
>> Changed since v1:
>>    - Add commit message to explain why xen_hvm_init_time_ops() is delayed
>>  for any vcpus. (Suggested by Boris Ostrovsky)
>>    - Add a comment in xen_hvm_smp_prepare_boot_cpu() referencing the related
>>  code in xen_hvm_guest_init(). (suggested by Juergen Gross)
>>
>>   arch/x86/xen/enlighten_hvm.c | 20 +++-
>>   arch/x86/xen/smp_hvm.c   |  8 
>>   2 files changed, 27 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
>> index e68ea5f4ad1c..7734dec52794 100644
>> --- a/arch/x86/xen/enlighten_hvm.c
>> +++ b/arch/x86/xen/enlighten_hvm.c
>> @@ -216,7 +216,25 @@ static void __init xen_hvm_guest_init(void)
>>   WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_hvm, xen_cpu_dead_hvm));
>>   xen_unplug_emulated_devices();
>>   x86_init.irqs.intr_init = xen_init_IRQ;
>> -    xen_hvm_init_time_ops();
>> +
>> +    /*
>> + * Only MAX_VIRT_CPUS 'vcpu_info' are embedded inside 'shared_info'
>> + * and the VM would use them until xen_vcpu_setup() is used to
>> + * allocate/relocate them at arbitrary address.
>> +

[PATCH v2 1/1] xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32

2021-10-27 Thread Dongli Zhang

The sched_clock() can be used very early since
commit 857baa87b642 ("sched/clock: Enable sched clock early"). In addition,
with commit 38669ba205d1 ("x86/xen/time: Output xen sched_clock time from
0"), kdump kernel in Xen HVM guest may panic at very early stage when
accessing &__this_cpu_read(xen_vcpu)->time as in below:

setup_arch()
 -> init_hypervisor_platform()
 -> x86_init.hyper.init_platform = xen_hvm_guest_init()
 -> xen_hvm_init_time_ops()
 -> xen_clocksource_read()
 -> src = &__this_cpu_read(xen_vcpu)->time;

This is because Xen HVM supports at most MAX_VIRT_CPUS=32 'vcpu_info'
embedded inside 'shared_info' during early stage until xen_vcpu_setup() is
used to allocate/relocate 'vcpu_info' for boot cpu at arbitrary address.

However, when Xen HVM guest panic on vcpu >= 32, since
xen_vcpu_info_reset(0) would set per_cpu(xen_vcpu, cpu) = NULL when
vcpu >= 32, xen_clocksource_read() on vcpu >= 32 would panic.

This patch delays xen_hvm_init_time_ops() to later in
xen_hvm_smp_prepare_boot_cpu() after the 'vcpu_info' for boot vcpu is
registered when the boot vcpu is >= 32.

Another option is to always delay xen_hvm_init_time_ops() for any vcpus
(including vcpu=0). Since to delay xen_hvm_init_time_ops() may lead to
clock backward issue, it is preferred to avoid that for regular boot (The
pv_sched_clock=native_sched_clock() is used at the very beginning until
xen_sched_clock() is registered). That requires to adjust
xen_sched_clock_offset. That's why we only delay xen_hvm_init_time_ops()
for vcpu>=32.

This issue can be reproduced on purpose via below command at the guest
side when kdump/kexec is enabled:

"taskset -c 33 echo c > /proc/sysrq-trigger"

Reference:
https://lists.xenproject.org/archives/html/xen-devel/2021-10/msg00571.html
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
Changed since v1:
  - Add commit message to explain why xen_hvm_init_time_ops() is delayed
for any vcpus. (Suggested by Boris Ostrovsky)
  - Add a comment in xen_hvm_smp_prepare_boot_cpu() referencing the related
code in xen_hvm_guest_init(). (suggested by Juergen Gross)

 arch/x86/xen/enlighten_hvm.c | 20 +++-
 arch/x86/xen/smp_hvm.c   |  8 
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
index e68ea5f4ad1c..7734dec52794 100644
--- a/arch/x86/xen/enlighten_hvm.c
+++ b/arch/x86/xen/enlighten_hvm.c
@@ -216,7 +216,25 @@ static void __init xen_hvm_guest_init(void)
WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_hvm, xen_cpu_dead_hvm));
xen_unplug_emulated_devices();
x86_init.irqs.intr_init = xen_init_IRQ;
-   xen_hvm_init_time_ops();
+
+   /*
+* Only MAX_VIRT_CPUS 'vcpu_info' are embedded inside 'shared_info'
+* and the VM would use them until xen_vcpu_setup() is used to
+* allocate/relocate them at arbitrary address.
+*
+* However, when Xen HVM guest panic on vcpu >= MAX_VIRT_CPUS,
+* per_cpu(xen_vcpu, cpu) is still NULL at this stage. To access
+* per_cpu(xen_vcpu, cpu) via xen_clocksource_read() would panic.
+*
+* Therefore we delay xen_hvm_init_time_ops() to
+* xen_hvm_smp_prepare_boot_cpu() when boot vcpu is >= MAX_VIRT_CPUS.
+*/
+   if (xen_vcpu_nr(0) >= MAX_VIRT_CPUS)
+   pr_info("Delay xen_hvm_init_time_ops() as kernel is running on 
vcpu=%d\n",
+   xen_vcpu_nr(0));
+   else
+   xen_hvm_init_time_ops();
+
xen_hvm_init_mmu_ops();
 
 #ifdef CONFIG_KEXEC_CORE
diff --git a/arch/x86/xen/smp_hvm.c b/arch/x86/xen/smp_hvm.c
index 6ff3c887e0b9..f99043df8bb5 100644
--- a/arch/x86/xen/smp_hvm.c
+++ b/arch/x86/xen/smp_hvm.c
@@ -19,6 +19,14 @@ static void __init xen_hvm_smp_prepare_boot_cpu(void)
 */
xen_vcpu_setup(0);
 
+   /*
+* The xen_hvm_init_time_ops() is delayed from
+* xen_hvm_guest_init() to here to avoid panic when the kernel
+* boots from vcpu>=MAX_VIRT_CPUS (32).
+*/
+   if (xen_vcpu_nr(0) >= MAX_VIRT_CPUS)
+   xen_hvm_init_time_ops();
+
/*
 * The alternative logic (which patches the unlock/lock) runs before
 * the smp bootup up code is activated. Hence we need to set this up
-- 
2.17.1

[PATCH v2 1/1] xen: update system time immediately when VCPUOP_register_vcpu_info

2021-10-25 Thread Dongli Zhang

The guest may access the pv vcpu_time_info immediately after
VCPUOP_register_vcpu_info. This is to borrow the idea of
VCPUOP_register_vcpu_time_memory_area, where the
force_update_vcpu_system_time() is called immediately when the new memory
area is registered.

Otherwise, we may observe clock drift at the VM side if the VM accesses
the clocksource immediately after VCPUOP_register_vcpu_info().

Reference:
https://lists.xenproject.org/archives/html/xen-devel/2021-10/msg00571.html
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
Changed since v1:
  - Implement force_update_vcpu_system_time() for ARM
(suggested by Jan Beulich)
While I have tested ARM compilation with aarch64-linux-gnu-gcc, I did
not test on ARM platform.

 xen/arch/arm/time.c| 5 +
 xen/common/domain.c| 2 ++
 xen/include/asm-arm/time.h | 2 ++
 3 files changed, 9 insertions(+)

diff --git a/xen/arch/arm/time.c b/xen/arch/arm/time.c
index 7dbd363537..dec53b5f7d 100644
--- a/xen/arch/arm/time.c
+++ b/xen/arch/arm/time.c
@@ -351,6 +351,11 @@ void update_vcpu_system_time(struct vcpu *v)
 /* XXX update shared_info->wc_* */
 }
 
+void force_update_vcpu_system_time(struct vcpu *v)
+{
+update_vcpu_system_time(v);
+}
+
 void domain_set_time_offset(struct domain *d, int64_t time_offset_seconds)
 {
 d->time_offset.seconds = time_offset_seconds;
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 8b53c49d1e..d71fcab88c 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1704,6 +1704,8 @@ long do_vcpu_op(int cmd, unsigned int vcpuid, 
XEN_GUEST_HANDLE_PARAM(void) arg)
 rc = map_vcpu_info(v, info.mfn, info.offset);
 domain_unlock(d);
 
+force_update_vcpu_system_time(v);
+
 break;
 }
 
diff --git a/xen/include/asm-arm/time.h b/xen/include/asm-arm/time.h
index 6b8fd839dd..4b401c1110 100644
--- a/xen/include/asm-arm/time.h
+++ b/xen/include/asm-arm/time.h
@@ -105,6 +105,8 @@ extern uint64_t ns_to_ticks(s_time_t ns);
 
 void preinit_xen_time(void);
 
+void force_update_vcpu_system_time(struct vcpu *v);
+
 #endif /* __ARM_TIME_H__ */
 /*
  * Local variables:
-- 
2.17.1

Re: [PATCH linux 1/2] xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32

2021-10-24 Thread Dongli Zhang

Hi Boris,

On 10/12/21 10:17 AM, Boris Ostrovsky wrote:
> 
> On 10/12/21 3:24 AM, Dongli Zhang wrote:
>> The sched_clock() can be used very early since upstream
>> commit 857baa87b642 ("sched/clock: Enable sched clock early"). In addition,
>> with upstream commit 38669ba205d1 ("x86/xen/time: Output xen sched_clock
>> time from 0"), kdump kernel in Xen HVM guest may panic at very early stage
>> when accessing &__this_cpu_read(xen_vcpu)->time as in below:
> 
> 
> Please drop "upstream". It's always upstream here.
> 
> 
>> +
>> +    /*
>> + * Only MAX_VIRT_CPUS 'vcpu_info' are embedded inside 'shared_info'
>> + * and the VM would use them until xen_vcpu_setup() is used to
>> + * allocate/relocate them at arbitrary address.
>> + *
>> + * However, when Xen HVM guest panic on vcpu >= MAX_VIRT_CPUS,
>> + * per_cpu(xen_vcpu, cpu) is still NULL at this stage. To access
>> + * per_cpu(xen_vcpu, cpu) via xen_clocksource_read() would panic.
>> + *
>> + * Therefore we delay xen_hvm_init_time_ops() to
>> + * xen_hvm_smp_prepare_boot_cpu() when boot vcpu is >= MAX_VIRT_CPUS.
>> + */
>> +    if (xen_vcpu_nr(0) >= MAX_VIRT_CPUS)
> 
> 
> What about always deferring this when panicing? Would that work?
> 
> 
> Deciding whether to defer based on cpu number feels a bit awkward.
> 
> 
> -boris
> 

I did some tests and I do not think this works well. I prefer to delay the
initialization only for VCPU >= 32.

This is the syslog if we always delay xen_hvm_init_time_ops(), regardless
whether VCPU >= 32.

[0.032372] Booting paravirtualized kernel on Xen HVM
[0.032376] clocksource: refined-jiffies: mask: 0x max_cycles:
0x, max_idle_ns: 1910969940391419 ns
[0.037683] setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:64
nr_node_ids:2
[0.041876] percpu: Embedded 49 pages/cpu s162968 r8192 d29544 u262144

--> There is a clock backwards from 0.041876 to 0.10.

[0.10] Built 2 zonelists, mobility grouping on.  Total pages: 2015744
[0.12] Policy zone: Normal
[0.14] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-rc6xen+
root=UUID=2a5975ab-a059-4697-9aee-7a53ddfeea21 ro text console=ttyS0,115200n8
console=tty1 crashkernel=512M-:192M

This is because the initial pv_sched_clock is native_sched_clock(), and it
switches to xen_sched_clock() in xen_hvm_init_time_ops(). Is it fine to always
have a clock backward for non-kdump kernel?

To avoid the clock backward, we may register a dummy clocksource which always
returns 0, before xen_hvm_init_time_ops(). I do not think this is reasonable.

Thank you very much!

Dongli Zhang

[PATCH 1/1] xen/netfront: stop tx queues during live migration

2021-10-22 Thread Dongli Zhang

The tx queues are not stopped during the live migration. As a result, the
ndo_start_xmit() may access netfront_info->queues which is freed by
talk_to_netback()->xennet_destroy_queues().

This patch is to netif_device_detach() at the beginning of xen-netfront
resuming, and netif_device_attach() at the end of resuming.

 CPU ACPU B

 talk_to_netback()
 -> if (info->queues)
xennet_destroy_queues(info);
to free netfront_info->queues

xennet_start_xmit()
to access netfront_info->queues

  -> err = xennet_create_queues(info, _queues);

The idea is borrowed from virtio-net.

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
Since I am not able to reproduce the corner case on purpose, I create a
patch to reproduce.
https://raw.githubusercontent.com/finallyjustice/patchset/master/xen-netfront-send-GARP-during-live-migration.patch

 drivers/net/xen-netfront.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index e31b98403f31..fc41ba95f81d 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -1730,6 +1730,10 @@ static int netfront_resume(struct xenbus_device *dev)
 
dev_dbg(>dev, "%s\n", dev->nodename);
 
+   netif_tx_lock_bh(info->netdev);
+   netif_device_detach(info->netdev);
+   netif_tx_unlock_bh(info->netdev);
+
xennet_disconnect_backend(info);
return 0;
 }
@@ -2349,6 +2353,10 @@ static int xennet_connect(struct net_device *dev)
 * domain a kick because we've probably just requeued some
 * packets.
 */
+   netif_tx_lock_bh(np->netdev);
+   netif_device_attach(np->netdev);
+   netif_tx_unlock_bh(np->netdev);
+
netif_carrier_on(np->netdev);
for (j = 0; j < num_queues; ++j) {
queue = >queues[j];
-- 
2.17.1

Re: [PATCH xen 2/2] xen: update system time immediately when VCPUOP_register_vcpu_info

2021-10-12 Thread Dongli Zhang

Hi Jan,

On 10/12/21 8:49 AM, Jan Beulich wrote:
> On 12.10.2021 17:43, Dongli Zhang wrote:
>> Hi Jan,
>>
>> On 10/12/21 1:40 AM, Jan Beulich wrote:
>>> On 12.10.2021 09:24, Dongli Zhang wrote:
>>>> The guest may access the pv vcpu_time_info immediately after
>>>> VCPUOP_register_vcpu_info. This is to borrow the idea of
>>>> VCPUOP_register_vcpu_time_memory_area, where the
>>>> force_update_vcpu_system_time() is called immediately when the new memory
>>>> area is registered.
>>>>
>>>> Otherwise, we may observe clock drift at the VM side if the VM accesses
>>>> the clocksource immediately after VCPUOP_register_vcpu_info().
>>>>
>>>> Cc: Joe Jin 
>>>> Signed-off-by: Dongli Zhang 
>>>
>>> While I agree with the change in principle, ...
>>>
>>>> --- a/xen/common/domain.c
>>>> +++ b/xen/common/domain.c
>>>> @@ -1695,6 +1695,8 @@ long do_vcpu_op(int cmd, unsigned int vcpuid, 
>>>> XEN_GUEST_HANDLE_PARAM(void) arg)
>>>>  rc = map_vcpu_info(v, info.mfn, info.offset);
>>>>  domain_unlock(d);
>>>>  
>>>> +force_update_vcpu_system_time(v);
>>>
>>> ... I'm afraid you're breaking the Arm build here. Arm will first need
>>> to gain this function.
>>>
>>
>> Since I am not familiar with the Xen ARM, would you please let me your
>> suggestion if I just leave ARM as TODO to the ARM developers to verify
>> and implement force_update_vcpu_system_time()?
> 
> I'd much prefer to avoid this. I don't think the function can be that
> difficult to introduce. And I'm sure the Arm maintainers will apply
> extra care during review if you point out that you weren't able to
> actually test this.
> 

I do not see pvclock used by arm/arm64 in linux kernel for xen.

In addition, the implementation at xen hypervisor side is empty.

348 /* VCPU PV clock. */
349 void update_vcpu_system_time(struct vcpu *v)
350 {
351 /* XXX update shared_info->wc_* */
352 }

I will add a wrapper for it.

The bad thing is I see riscv is supported by xen and we may need to add the
function for riscv as well.

Thank you very much!

Dongli Zhang

Re: [PATCH 0/2] Fix the Xen HVM kdump/kexec boot panic issue

2021-10-12 Thread Dongli Zhang

Hi Juergen,

On 10/12/21 1:47 AM, Juergen Gross wrote:
> On 12.10.21 09:24, Dongli Zhang wrote:
>> When the kdump/kexec is enabled at HVM VM side, to panic kernel will trap
>> to xen side with reason=soft_reset. As a result, the xen will reboot the VM
>> with the kdump kernel.
>>
>> Unfortunately, when the VM is panic with below command line ...
>>
>> "taskset -c 33 echo c > /proc/sysrq-trigger"
>>
>> ... the kdump kernel is panic at early stage ...
>>
>> PANIC: early exception 0x0e IP 10:a8c66876 error 0 cr2 0x20
>> [    0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 5.15.0-rc5xen #1
>> [    0.00] Hardware name: Xen HVM domU
>> [    0.00] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
>> ... ...
>> [    0.00] RSP: :aa203e20 EFLAGS: 00010082 ORIG_RAX:
>> 
>> [    0.00] RAX: 0003 RBX: 0001 RCX: 
>> dfff
>> [    0.00] RDX: 0003 RSI: dfff RDI: 
>> 0020
>> [    0.00] RBP: 00011000 R08:  R09: 
>> 0001
>> [    0.00] R10: aa203e00 R11: aa203c70 R12: 
>> 4004
>> [    0.00] R13: aa203e5c R14: aa203e58 R15: 
>> 
>> [    0.00] FS:  () GS:aa95e000()
>> knlGS:
>> [    0.00] CS:  0010 DS:  ES:  CR0: 80050033
>> [    0.00] CR2: 0020 CR3: ec9e CR4: 
>> 000406a0
>> [    0.00] DR0:  DR1:  DR2: 
>> 
>> [    0.00] DR3:  DR6: fffe0ff0 DR7: 
>> 0400
>> [    0.00] Call Trace:
>> [    0.00]  ? xen_init_time_common+0x11/0x55
>> [    0.00]  ? xen_hvm_init_time_ops+0x23/0x45
>> [    0.00]  ? xen_hvm_guest_init+0x214/0x251
>> [    0.00]  ? 0xa8c0
>> [    0.00]  ? setup_arch+0x440/0xbd6
>> [    0.00]  ? start_kernel+0x6a/0x689
>> [    0.00]  ? secondary_startup_64_no_verify+0xc2/0xcb
>>
>> This is because Xen HVM supports at most MAX_VIRT_CPUS=32 'vcpu_info'
>> embedded inside 'shared_info' during early stage until xen_vcpu_setup() is
>> used to allocate/relocate 'vcpu_info' for boot cpu at arbitrary address.
>>
>>
>> The 1st patch is to fix the issue at VM kernel side. However, we may
>> observe clock drift at VM side due to the issue at xen hypervisor side.
>> This is because the pv vcpu_time_info is not updated when
>> VCPUOP_register_vcpu_info.
>>
>> The 2nd patch is to force_update_vcpu_system_time() at xen side when
>> VCPUOP_register_vcpu_info, to avoid the VM clock drift during kdump kernel
>> boot.
> 
> Please don't mix patches for multiple projects in one series.
> 
> In cases like this it is fine to mention the other project's patch
> verbally instead.
> 

I will split the patchset in v2 and email to different projects.

The core ideas of this combined patchset are:

1. Fix at HVM domU side (kdump kernel panic)

2. Fix at Xen hypervisor side (clock drift issue in kdump kernel)

3. To report (or seek for help) that soft_reset does not work with mainline-xen
so that I am not able to test my patchset with the most recent mainline xen.

Thank you very much!

Dongli Zhang

Re: [PATCH xen 2/2] xen: update system time immediately when VCPUOP_register_vcpu_info

2021-10-12 Thread Dongli Zhang

Hi Jan,

On 10/12/21 1:40 AM, Jan Beulich wrote:
> On 12.10.2021 09:24, Dongli Zhang wrote:
>> The guest may access the pv vcpu_time_info immediately after
>> VCPUOP_register_vcpu_info. This is to borrow the idea of
>> VCPUOP_register_vcpu_time_memory_area, where the
>> force_update_vcpu_system_time() is called immediately when the new memory
>> area is registered.
>>
>> Otherwise, we may observe clock drift at the VM side if the VM accesses
>> the clocksource immediately after VCPUOP_register_vcpu_info().
>>
>> Cc: Joe Jin 
>> Signed-off-by: Dongli Zhang 
> 
> While I agree with the change in principle, ...
> 
>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -1695,6 +1695,8 @@ long do_vcpu_op(int cmd, unsigned int vcpuid, 
>> XEN_GUEST_HANDLE_PARAM(void) arg)
>>  rc = map_vcpu_info(v, info.mfn, info.offset);
>>  domain_unlock(d);
>>  
>> +force_update_vcpu_system_time(v);
> 
> ... I'm afraid you're breaking the Arm build here. Arm will first need
> to gain this function.
> 

Since I am not familiar with the Xen ARM, would you please let me your
suggestion if I just leave ARM as TODO to the ARM developers to verify
and implement force_update_vcpu_system_time()?

I have tested that the below can build with arm64/aarch64.

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 40d67ec342..644c65ecd3 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1695,6 +1695,13 @@ long do_vcpu_op(int cmd, unsigned int vcpuid,
XEN_GUEST_HANDLE_PARAM(void) arg)
 rc = map_vcpu_info(v, info.mfn, info.offset);
 domain_unlock(d);

+#ifdef CONFIG_X86
+/*
+ * TODO: ARM does not have force_update_vcpu_system_time().
+ */
+force_update_vcpu_system_time(v);
+#endif
+
 break;
 }



Thank you very much!

Dongli Zhang

[PATCH linux 1/2] xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32

2021-10-12 Thread Dongli Zhang

The sched_clock() can be used very early since upstream
commit 857baa87b642 ("sched/clock: Enable sched clock early"). In addition,
with upstream commit 38669ba205d1 ("x86/xen/time: Output xen sched_clock
time from 0"), kdump kernel in Xen HVM guest may panic at very early stage
when accessing &__this_cpu_read(xen_vcpu)->time as in below:

setup_arch()
 -> init_hypervisor_platform()
 -> x86_init.hyper.init_platform = xen_hvm_guest_init()
 -> xen_hvm_init_time_ops()
 -> xen_clocksource_read()
 -> src = &__this_cpu_read(xen_vcpu)->time;

This is because Xen HVM supports at most MAX_VIRT_CPUS=32 'vcpu_info'
embedded inside 'shared_info' during early stage until xen_vcpu_setup() is
used to allocate/relocate 'vcpu_info' for boot cpu at arbitrary address.

However, when Xen HVM guest panic on vcpu >= 32, since
xen_vcpu_info_reset(0) would set per_cpu(xen_vcpu, cpu) = NULL when
vcpu >= 32, xen_clocksource_read() on vcpu >= 32 would panic.

This patch delays xen_hvm_init_time_ops() to later in
xen_hvm_smp_prepare_boot_cpu() after the 'vcpu_info' for boot vcpu is
registered when the boot vcpu is >= 32.

This issue can be reproduced on purpose via below command at the guest
side when kdump/kexec is enabled:

"taskset -c 33 echo c > /proc/sysrq-trigger"

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 arch/x86/xen/enlighten_hvm.c | 20 +++-
 arch/x86/xen/smp_hvm.c   |  3 +++
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
index e68ea5f4ad1c..152279416d9a 100644
--- a/arch/x86/xen/enlighten_hvm.c
+++ b/arch/x86/xen/enlighten_hvm.c
@@ -216,7 +216,25 @@ static void __init xen_hvm_guest_init(void)
WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_hvm, xen_cpu_dead_hvm));
xen_unplug_emulated_devices();
x86_init.irqs.intr_init = xen_init_IRQ;
-   xen_hvm_init_time_ops();
+
+   /*
+* Only MAX_VIRT_CPUS 'vcpu_info' are embedded inside 'shared_info'
+* and the VM would use them until xen_vcpu_setup() is used to
+* allocate/relocate them at arbitrary address.
+*
+* However, when Xen HVM guest panic on vcpu >= MAX_VIRT_CPUS,
+* per_cpu(xen_vcpu, cpu) is still NULL at this stage. To access
+* per_cpu(xen_vcpu, cpu) via xen_clocksource_read() would panic.
+*
+* Therefore we delay xen_hvm_init_time_ops() to
+* xen_hvm_smp_prepare_boot_cpu() when boot vcpu is >= MAX_VIRT_CPUS.
+*/
+   if (xen_vcpu_nr(0) >= MAX_VIRT_CPUS)
+   pr_info("Delay xen_hvm_init_time_ops() as kernel is running on 
vcpu=%d\n",
+   xen_vcpu_nr(0));
+   else
+   xen_hvm_init_time_ops();
+
xen_hvm_init_mmu_ops();
 
 #ifdef CONFIG_KEXEC_CORE
diff --git a/arch/x86/xen/smp_hvm.c b/arch/x86/xen/smp_hvm.c
index 6ff3c887e0b9..60cd4fafd188 100644
--- a/arch/x86/xen/smp_hvm.c
+++ b/arch/x86/xen/smp_hvm.c
@@ -19,6 +19,9 @@ static void __init xen_hvm_smp_prepare_boot_cpu(void)
 */
xen_vcpu_setup(0);
 
+   if (xen_vcpu_nr(0) >= MAX_VIRT_CPUS)
+   xen_hvm_init_time_ops();
+
/*
 * The alternative logic (which patches the unlock/lock) runs before
 * the smp bootup up code is activated. Hence we need to set this up
-- 
2.17.1

[PATCH 0/2] Fix the Xen HVM kdump/kexec boot panic issue

2021-10-12 Thread Dongli Zhang

When the kdump/kexec is enabled at HVM VM side, to panic kernel will trap
to xen side with reason=soft_reset. As a result, the xen will reboot the VM
with the kdump kernel.

Unfortunately, when the VM is panic with below command line ...

"taskset -c 33 echo c > /proc/sysrq-trigger"

... the kdump kernel is panic at early stage ...

PANIC: early exception 0x0e IP 10:a8c66876 error 0 cr2 0x20
[0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 5.15.0-rc5xen #1
[0.00] Hardware name: Xen HVM domU
[0.00] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
... ...
[0.00] RSP: :aa203e20 EFLAGS: 00010082 ORIG_RAX: 

[0.00] RAX: 0003 RBX: 0001 RCX: dfff
[0.00] RDX: 0003 RSI: dfff RDI: 0020
[0.00] RBP: 00011000 R08:  R09: 0001
[0.00] R10: aa203e00 R11: aa203c70 R12: 4004
[0.00] R13: aa203e5c R14: aa203e58 R15: 
[0.00] FS:  () GS:aa95e000() 
knlGS:
[0.00] CS:  0010 DS:  ES:  CR0: 80050033
[0.00] CR2: 0020 CR3: ec9e CR4: 000406a0
[0.00] DR0:  DR1:  DR2: 
[0.00] DR3:  DR6: fffe0ff0 DR7: 0400
[0.00] Call Trace:
[0.00]  ? xen_init_time_common+0x11/0x55
[0.00]  ? xen_hvm_init_time_ops+0x23/0x45
[0.00]  ? xen_hvm_guest_init+0x214/0x251
[0.00]  ? 0xa8c0
[0.00]  ? setup_arch+0x440/0xbd6
[0.00]  ? start_kernel+0x6a/0x689
[0.00]  ? secondary_startup_64_no_verify+0xc2/0xcb

This is because Xen HVM supports at most MAX_VIRT_CPUS=32 'vcpu_info'
embedded inside 'shared_info' during early stage until xen_vcpu_setup() is
used to allocate/relocate 'vcpu_info' for boot cpu at arbitrary address.


The 1st patch is to fix the issue at VM kernel side. However, we may
observe clock drift at VM side due to the issue at xen hypervisor side.
This is because the pv vcpu_time_info is not updated when
VCPUOP_register_vcpu_info.

The 2nd patch is to force_update_vcpu_system_time() at xen side when
VCPUOP_register_vcpu_info, to avoid the VM clock drift during kdump kernel
boot.


I did test the fix by backporting the 2nd patch to a prior old xen version.
This is because I am not able to use soft_reset successfully with mainline
xen. I have encountered below error when testing soft_reset with mainline
xen. Please let me know if there is any know issue/solution.

# xl -v create -F vm.cfg
... ...
... ...
Domain 1 has shut down, reason code 5 0x5
Action for shutdown reason code 5 is soft-reset
Done. Rebooting now
xc: error: Failed to set d1's policy (err leaf 0x, subleaf 0x, 
msr 0x) (17 = File exists): Internal error
libxl: error: libxl_cpuid.c:488:libxl__cpuid_legacy: Domain 1:Failed to apply 
CPUID policy: File exists
libxl: error: libxl_create.c:1573:domcreate_rebuild_done: Domain 1:cannot 
(re-)build domain: -3
libxl: error: libxl_xshelp.c:201:libxl__xs_read_mandatory: xenstore read 
failed: `/libxl/1/type': No such file or directory
libxl: warning: libxl_dom.c:53:libxl__domain_type: unable to get domain type 
for domid=1, assuming HVM


Thank you very much!

Dongli Zhang

[PATCH xen 2/2] xen: update system time immediately when VCPUOP_register_vcpu_info

2021-10-12 Thread Dongli Zhang

The guest may access the pv vcpu_time_info immediately after
VCPUOP_register_vcpu_info. This is to borrow the idea of
VCPUOP_register_vcpu_time_memory_area, where the
force_update_vcpu_system_time() is called immediately when the new memory
area is registered.

Otherwise, we may observe clock drift at the VM side if the VM accesses
the clocksource immediately after VCPUOP_register_vcpu_info().

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 xen/common/domain.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 40d67ec342..c879f6723b 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1695,6 +1695,8 @@ long do_vcpu_op(int cmd, unsigned int vcpuid, 
XEN_GUEST_HANDLE_PARAM(void) arg)
 rc = map_vcpu_info(v, info.mfn, info.offset);
 domain_unlock(d);
 
+force_update_vcpu_system_time(v);
+
 break;
 }
 
-- 
2.17.1

[PATCH RFC v1 6/6] xen-swiotlb: enable 64-bit xen-swiotlb

2021-02-03 Thread Dongli Zhang

This patch is to enable the 64-bit xen-swiotlb buffer.

For Xen PVM DMA address, the 64-bit device will be able to allocate from
64-bit swiotlb buffer.

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 drivers/xen/swiotlb-xen.c | 117 --
 1 file changed, 74 insertions(+), 43 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index e18cae693cdc..c9ab07809e32 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -108,27 +108,36 @@ static int is_xen_swiotlb_buffer(struct device *dev, 
dma_addr_t dma_addr)
unsigned long bfn = XEN_PFN_DOWN(dma_to_phys(dev, dma_addr));
unsigned long xen_pfn = bfn_to_local_pfn(bfn);
phys_addr_t paddr = (phys_addr_t)xen_pfn << XEN_PAGE_SHIFT;
+   int i;
 
/* If the address is outside our domain, it CAN
 * have the same virtual address as another address
 * in our domain. Therefore _only_ check address within our domain.
 */
-   if (pfn_valid(PFN_DOWN(paddr))) {
-   return paddr >= virt_to_phys(xen_io_tlb_start[SWIOTLB_LO]) &&
-  paddr < virt_to_phys(xen_io_tlb_end[SWIOTLB_LO]);
-   }
+   if (!pfn_valid(PFN_DOWN(paddr)))
+   return 0;
+
+   for (i = 0; i < swiotlb_nr; i++)
+   if (paddr >= virt_to_phys(xen_io_tlb_start[i]) &&
+   paddr < virt_to_phys(xen_io_tlb_end[i]))
+   return 1;
+
return 0;
 }
 
 static int
-xen_swiotlb_fixup(void *buf, size_t size, unsigned long nslabs)
+xen_swiotlb_fixup(void *buf, size_t size, unsigned long nslabs,
+ enum swiotlb_t type)
 {
int i, rc;
int dma_bits;
dma_addr_t dma_handle;
phys_addr_t p = virt_to_phys(buf);
 
-   dma_bits = get_order(IO_TLB_SEGSIZE << IO_TLB_SHIFT) + PAGE_SHIFT;
+   if (type == SWIOTLB_HI)
+   dma_bits = max_dma_bits[SWIOTLB_HI];
+   else
+   dma_bits = get_order(IO_TLB_SEGSIZE << IO_TLB_SHIFT) + 
PAGE_SHIFT;
 
i = 0;
do {
@@ -139,7 +148,7 @@ xen_swiotlb_fixup(void *buf, size_t size, unsigned long 
nslabs)
p + (i << IO_TLB_SHIFT),
get_order(slabs << IO_TLB_SHIFT),
dma_bits, _handle);
-   } while (rc && dma_bits++ < max_dma_bits[SWIOTLB_LO]);
+   } while (rc && dma_bits++ < max_dma_bits[type]);
if (rc)
return rc;
 
@@ -147,16 +156,17 @@ xen_swiotlb_fixup(void *buf, size_t size, unsigned long 
nslabs)
} while (i < nslabs);
return 0;
 }
-static unsigned long xen_set_nslabs(unsigned long nr_tbl)
+
+static unsigned long xen_set_nslabs(unsigned long nr_tbl, enum swiotlb_t type)
 {
if (!nr_tbl) {
-   xen_io_tlb_nslabs[SWIOTLB_LO] = (64 * 1024 * 1024 >> 
IO_TLB_SHIFT);
-   xen_io_tlb_nslabs[SWIOTLB_LO] = 
ALIGN(xen_io_tlb_nslabs[SWIOTLB_LO],
+   xen_io_tlb_nslabs[type] = (64 * 1024 * 1024 >> IO_TLB_SHIFT);
+   xen_io_tlb_nslabs[type] = ALIGN(xen_io_tlb_nslabs[type],
  IO_TLB_SEGSIZE);
} else
-   xen_io_tlb_nslabs[SWIOTLB_LO] = nr_tbl;
+   xen_io_tlb_nslabs[type] = nr_tbl;
 
-   return xen_io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT;
+   return xen_io_tlb_nslabs[type] << IO_TLB_SHIFT;
 }
 
 enum xen_swiotlb_err {
@@ -180,23 +190,24 @@ static const char *xen_swiotlb_error(enum xen_swiotlb_err 
err)
}
return "";
 }
-int __ref xen_swiotlb_init(int verbose, bool early)
+
+static int xen_swiotlb_init_type(int verbose, bool early, enum swiotlb_t type)
 {
unsigned long bytes, order;
int rc = -ENOMEM;
enum xen_swiotlb_err m_ret = XEN_SWIOTLB_UNKNOWN;
unsigned int repeat = 3;
 
-   xen_io_tlb_nslabs[SWIOTLB_LO] = swiotlb_nr_tbl(SWIOTLB_LO);
+   xen_io_tlb_nslabs[type] = swiotlb_nr_tbl(type);
 retry:
-   bytes = xen_set_nslabs(xen_io_tlb_nslabs[SWIOTLB_LO]);
-   order = get_order(xen_io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT);
+   bytes = xen_set_nslabs(xen_io_tlb_nslabs[type], type);
+   order = get_order(xen_io_tlb_nslabs[type] << IO_TLB_SHIFT);
 
/*
 * IO TLB memory already allocated. Just use it.
 */
-   if (io_tlb_start[SWIOTLB_LO] != 0) {
-   xen_io_tlb_start[SWIOTLB_LO] = 
phys_to_virt(io_tlb_start[SWIOTLB_LO]);
+   if (io_tlb_start[type] != 0) {
+   xen_io_tlb_start[type] = phys_to_virt(io_tlb_start[type]);
goto end;
}
 
@@ -204,81 +215,95 @@ int __ref xen_swiotlb_init(int verbose, bool early)
 * Get IO TLB memory from any location.
 */
if (early) {
-

[PATCH RFC v1 5/6] xen-swiotlb: convert variables to arrays

2021-02-03 Thread Dongli Zhang

This patch converts several xen-swiotlb related variables to arrays, in
order to maintain stat/status for different swiotlb buffers. Here are
variables involved:

- xen_io_tlb_start and xen_io_tlb_end
- xen_io_tlb_nslabs
- MAX_DMA_BITS

There is no functional change and this is to prepare to enable 64-bit
xen-swiotlb.

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 drivers/xen/swiotlb-xen.c | 75 +--
 1 file changed, 40 insertions(+), 35 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 662638093542..e18cae693cdc 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -39,15 +39,17 @@
 #include 
 
 #include 
-#define MAX_DMA_BITS 32
 /*
  * Used to do a quick range check in swiotlb_tbl_unmap_single and
  * swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated by 
this
  * API.
  */
 
-static char *xen_io_tlb_start, *xen_io_tlb_end;
-static unsigned long xen_io_tlb_nslabs;
+static char *xen_io_tlb_start[SWIOTLB_MAX], *xen_io_tlb_end[SWIOTLB_MAX];
+static unsigned long xen_io_tlb_nslabs[SWIOTLB_MAX];
+
+static int max_dma_bits[] = {32, 64};
+
 /*
  * Quick lookup value of the bus address of the IOTLB.
  */
@@ -112,8 +114,8 @@ static int is_xen_swiotlb_buffer(struct device *dev, 
dma_addr_t dma_addr)
 * in our domain. Therefore _only_ check address within our domain.
 */
if (pfn_valid(PFN_DOWN(paddr))) {
-   return paddr >= virt_to_phys(xen_io_tlb_start) &&
-  paddr < virt_to_phys(xen_io_tlb_end);
+   return paddr >= virt_to_phys(xen_io_tlb_start[SWIOTLB_LO]) &&
+  paddr < virt_to_phys(xen_io_tlb_end[SWIOTLB_LO]);
}
return 0;
 }
@@ -137,7 +139,7 @@ xen_swiotlb_fixup(void *buf, size_t size, unsigned long 
nslabs)
p + (i << IO_TLB_SHIFT),
get_order(slabs << IO_TLB_SHIFT),
dma_bits, _handle);
-   } while (rc && dma_bits++ < MAX_DMA_BITS);
+   } while (rc && dma_bits++ < max_dma_bits[SWIOTLB_LO]);
if (rc)
return rc;
 
@@ -148,12 +150,13 @@ xen_swiotlb_fixup(void *buf, size_t size, unsigned long 
nslabs)
 static unsigned long xen_set_nslabs(unsigned long nr_tbl)
 {
if (!nr_tbl) {
-   xen_io_tlb_nslabs = (64 * 1024 * 1024 >> IO_TLB_SHIFT);
-   xen_io_tlb_nslabs = ALIGN(xen_io_tlb_nslabs, IO_TLB_SEGSIZE);
+   xen_io_tlb_nslabs[SWIOTLB_LO] = (64 * 1024 * 1024 >> 
IO_TLB_SHIFT);
+   xen_io_tlb_nslabs[SWIOTLB_LO] = 
ALIGN(xen_io_tlb_nslabs[SWIOTLB_LO],
+ IO_TLB_SEGSIZE);
} else
-   xen_io_tlb_nslabs = nr_tbl;
+   xen_io_tlb_nslabs[SWIOTLB_LO] = nr_tbl;
 
-   return xen_io_tlb_nslabs << IO_TLB_SHIFT;
+   return xen_io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT;
 }
 
 enum xen_swiotlb_err {
@@ -184,16 +187,16 @@ int __ref xen_swiotlb_init(int verbose, bool early)
enum xen_swiotlb_err m_ret = XEN_SWIOTLB_UNKNOWN;
unsigned int repeat = 3;
 
-   xen_io_tlb_nslabs = swiotlb_nr_tbl(SWIOTLB_LO);
+   xen_io_tlb_nslabs[SWIOTLB_LO] = swiotlb_nr_tbl(SWIOTLB_LO);
 retry:
-   bytes = xen_set_nslabs(xen_io_tlb_nslabs);
-   order = get_order(xen_io_tlb_nslabs << IO_TLB_SHIFT);
+   bytes = xen_set_nslabs(xen_io_tlb_nslabs[SWIOTLB_LO]);
+   order = get_order(xen_io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT);
 
/*
 * IO TLB memory already allocated. Just use it.
 */
if (io_tlb_start[SWIOTLB_LO] != 0) {
-   xen_io_tlb_start = phys_to_virt(io_tlb_start[SWIOTLB_LO]);
+   xen_io_tlb_start[SWIOTLB_LO] = 
phys_to_virt(io_tlb_start[SWIOTLB_LO]);
goto end;
}
 
@@ -201,76 +204,78 @@ int __ref xen_swiotlb_init(int verbose, bool early)
 * Get IO TLB memory from any location.
 */
if (early) {
-   xen_io_tlb_start = memblock_alloc(PAGE_ALIGN(bytes),
+   xen_io_tlb_start[SWIOTLB_LO] = memblock_alloc(PAGE_ALIGN(bytes),
  PAGE_SIZE);
-   if (!xen_io_tlb_start)
+   if (!xen_io_tlb_start[SWIOTLB_LO])
panic("%s: Failed to allocate %lu bytes align=0x%lx\n",
  __func__, PAGE_ALIGN(bytes), PAGE_SIZE);
} else {
 #define SLABS_PER_PAGE (1 << (PAGE_SHIFT - IO_TLB_SHIFT))
 #define IO_TLB_MIN_SLABS ((1<<20) >> IO_TLB_SHIFT)
while ((SLABS_PER_PAGE << order) > IO_TLB_MIN_SLABS) {
-   xen_io_tlb_start = (void 
*)xen_get_swiotlb_free_pages(order);
-   if (xen_io_tlb_start)
+

[PATCH RFC v1 4/6] swiotlb: enable 64-bit swiotlb

2021-02-03 Thread Dongli Zhang

This patch is to enable the 64-bit swiotlb buffer.

The state of the art swiotlb pre-allocates <=32-bit memory in order to meet
the DMA mask requirement for some 32-bit legacy device. Considering most
devices nowadays support 64-bit DMA and IOMMU is available, the swiotlb is
not used for most of the times, except:

1. The xen PVM domain requires the DMA addresses to both (1) less than the
dma mask, and (2) continuous in machine address. Therefore, the 64-bit
device may still require swiotlb on PVM domain.

2. From source code the AMD SME/SEV will enable SWIOTLB_FORCE. As a result
it is always required to allocate from swiotlb buffer even the device dma
mask is 64-bit.

sme_early_init()
-> if (sev_active())
   swiotlb_force = SWIOTLB_FORCE;

Therefore, this patch introduces the 2nd swiotlb buffer for 64-bit DMA
access. For instance, the swiotlb_tbl_map_single() allocates from the 2nd
64-bit buffer if the device DMA mask is
min_not_zero(*hwdev->dma_mask, hwdev->bus_dma_limit).

The example to configure 64-bit swiotlb is "swiotlb=65536,524288,force"
or "swiotlb=,524288,force", where 524288 is the size of 64-bit buffer.

With the patch, the kernel will be able to allocate >4GB swiotlb buffer.
This patch is only for swiotlb, not including xen-swiotlb.

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 arch/mips/cavium-octeon/dma-octeon.c |   3 +-
 arch/powerpc/kernel/dma-swiotlb.c|   2 +-
 arch/powerpc/platforms/pseries/svm.c |   2 +-
 arch/x86/kernel/pci-swiotlb.c|   5 +-
 arch/x86/pci/sta2x11-fixup.c |   2 +-
 drivers/gpu/drm/i915/gem/i915_gem_internal.c |   4 +-
 drivers/gpu/drm/i915/i915_scatterlist.h  |   2 +-
 drivers/gpu/drm/nouveau/nouveau_ttm.c|   2 +-
 drivers/mmc/host/sdhci.c |   2 +-
 drivers/pci/xen-pcifront.c   |   2 +-
 drivers/xen/swiotlb-xen.c|   9 +-
 include/linux/swiotlb.h  |  28 +-
 kernel/dma/swiotlb.c | 339 +++
 13 files changed, 238 insertions(+), 164 deletions(-)

diff --git a/arch/mips/cavium-octeon/dma-octeon.c 
b/arch/mips/cavium-octeon/dma-octeon.c
index df70308db0e6..3480555d908a 100644
--- a/arch/mips/cavium-octeon/dma-octeon.c
+++ b/arch/mips/cavium-octeon/dma-octeon.c
@@ -245,6 +245,7 @@ void __init plat_swiotlb_setup(void)
panic("%s: Failed to allocate %zu bytes align=%lx\n",
  __func__, swiotlbsize, PAGE_SIZE);
 
-   if (swiotlb_init_with_tbl(octeon_swiotlb, swiotlb_nslabs, 1) == -ENOMEM)
+   if (swiotlb_init_with_tbl(octeon_swiotlb, swiotlb_nslabs,
+ SWIOTLB_LO, 1) == -ENOMEM)
panic("Cannot allocate SWIOTLB buffer");
 }
diff --git a/arch/powerpc/kernel/dma-swiotlb.c 
b/arch/powerpc/kernel/dma-swiotlb.c
index fc7816126a40..88113318c53f 100644
--- a/arch/powerpc/kernel/dma-swiotlb.c
+++ b/arch/powerpc/kernel/dma-swiotlb.c
@@ -20,7 +20,7 @@ void __init swiotlb_detect_4g(void)
 static int __init check_swiotlb_enabled(void)
 {
if (ppc_swiotlb_enable)
-   swiotlb_print_info();
+   swiotlb_print_info(SWIOTLB_LO);
else
swiotlb_exit();
 
diff --git a/arch/powerpc/platforms/pseries/svm.c 
b/arch/powerpc/platforms/pseries/svm.c
index 9f8842d0da1f..77910e4ffad8 100644
--- a/arch/powerpc/platforms/pseries/svm.c
+++ b/arch/powerpc/platforms/pseries/svm.c
@@ -52,7 +52,7 @@ void __init svm_swiotlb_init(void)
bytes = io_tlb_nslabs << IO_TLB_SHIFT;
 
vstart = memblock_alloc(PAGE_ALIGN(bytes), PAGE_SIZE);
-   if (vstart && !swiotlb_init_with_tbl(vstart, io_tlb_nslabs, false))
+   if (vstart && !swiotlb_init_with_tbl(vstart, io_tlb_nslabs, SWIOTLB_LO, 
false))
return;
 
if (io_tlb_start[SWIOTLB_LO])
diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
index c2cfa5e7c152..950624fd95a4 100644
--- a/arch/x86/kernel/pci-swiotlb.c
+++ b/arch/x86/kernel/pci-swiotlb.c
@@ -67,12 +67,15 @@ void __init pci_swiotlb_init(void)
 
 void __init pci_swiotlb_late_init(void)
 {
+   int i;
+
/* An IOMMU turned us off. */
if (!swiotlb)
swiotlb_exit();
else {
printk(KERN_INFO "PCI-DMA: "
   "Using software bounce buffering for IO (SWIOTLB)\n");
-   swiotlb_print_info();
+   for (i = 0; i < swiotlb_nr; i++)
+   swiotlb_print_info(i);
}
 }
diff --git a/arch/x86/pci/sta2x11-fixup.c b/arch/x86/pci/sta2x11-fixup.c
index 7d2525691854..c440520b2055 100644
--- a/arch/x86/pci/sta2x11-fixup.c
+++ b/arch/x86/pci/sta2x11-fixup.c
@@ -57,7 +57,7 @@ static void sta2x11_new_instance(struct pci_dev *pdev)
int size = STA2X11_SWIOTLB_SIZE;
/* First instance

[PATCH RFC v1 2/6] swiotlb: convert variables to arrays

2021-02-03 Thread Dongli Zhang

This patch converts several swiotlb related variables to arrays, in
order to maintain stat/status for different swiotlb buffers. Here are
variables involved:

- io_tlb_start and io_tlb_end
- io_tlb_nslabs and io_tlb_used
- io_tlb_list
- io_tlb_index
- max_segment
- io_tlb_orig_addr
- no_iotlb_memory

There is no functional change and this is to prepare to enable 64-bit
swiotlb.

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 arch/powerpc/platforms/pseries/svm.c |   6 +-
 drivers/xen/swiotlb-xen.c|   4 +-
 include/linux/swiotlb.h  |   5 +-
 kernel/dma/swiotlb.c | 257 ++-
 4 files changed, 140 insertions(+), 132 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/svm.c 
b/arch/powerpc/platforms/pseries/svm.c
index 7b739cc7a8a9..9f8842d0da1f 100644
--- a/arch/powerpc/platforms/pseries/svm.c
+++ b/arch/powerpc/platforms/pseries/svm.c
@@ -55,9 +55,9 @@ void __init svm_swiotlb_init(void)
if (vstart && !swiotlb_init_with_tbl(vstart, io_tlb_nslabs, false))
return;
 
-   if (io_tlb_start)
-   memblock_free_early(io_tlb_start,
-   PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT));
+   if (io_tlb_start[SWIOTLB_LO])
+   memblock_free_early(io_tlb_start[SWIOTLB_LO],
+   PAGE_ALIGN(io_tlb_nslabs[SWIOTLB_LO] << 
IO_TLB_SHIFT));
panic("SVM: Cannot allocate SWIOTLB buffer");
 }
 
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 2b385c1b4a99..3261880ad859 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -192,8 +192,8 @@ int __ref xen_swiotlb_init(int verbose, bool early)
/*
 * IO TLB memory already allocated. Just use it.
 */
-   if (io_tlb_start != 0) {
-   xen_io_tlb_start = phys_to_virt(io_tlb_start);
+   if (io_tlb_start[SWIOTLB_LO] != 0) {
+   xen_io_tlb_start = phys_to_virt(io_tlb_start[SWIOTLB_LO]);
goto end;
}
 
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index ca125c1b1281..777046cd4d1b 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -76,11 +76,12 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
 
 #ifdef CONFIG_SWIOTLB
 extern enum swiotlb_force swiotlb_force;
-extern phys_addr_t io_tlb_start, io_tlb_end;
+extern phys_addr_t io_tlb_start[], io_tlb_end[];
 
 static inline bool is_swiotlb_buffer(phys_addr_t paddr)
 {
-   return paddr >= io_tlb_start && paddr < io_tlb_end;
+   return paddr >= io_tlb_start[SWIOTLB_LO] &&
+  paddr < io_tlb_end[SWIOTLB_LO];
 }
 
 void __init swiotlb_exit(void);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 7c42df6e6100..1fbb65daa2dd 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -69,38 +69,38 @@ enum swiotlb_force swiotlb_force;
  * swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated by 
this
  * API.
  */
-phys_addr_t io_tlb_start, io_tlb_end;
+phys_addr_t io_tlb_start[SWIOTLB_MAX], io_tlb_end[SWIOTLB_MAX];
 
 /*
  * The number of IO TLB blocks (in groups of 64) between io_tlb_start and
  * io_tlb_end.  This is command line adjustable via setup_io_tlb_npages.
  */
-static unsigned long io_tlb_nslabs;
+static unsigned long io_tlb_nslabs[SWIOTLB_MAX];
 
 /*
  * The number of used IO TLB block
  */
-static unsigned long io_tlb_used;
+static unsigned long io_tlb_used[SWIOTLB_MAX];
 
 /*
  * This is a free list describing the number of free entries available from
  * each index
  */
-static unsigned int *io_tlb_list;
-static unsigned int io_tlb_index;
+static unsigned int *io_tlb_list[SWIOTLB_MAX];
+static unsigned int io_tlb_index[SWIOTLB_MAX];
 
 /*
  * Max segment that we can provide which (if pages are contingous) will
  * not be bounced (unless SWIOTLB_FORCE is set).
  */
-static unsigned int max_segment;
+static unsigned int max_segment[SWIOTLB_MAX];
 
 /*
  * We need to save away the original address corresponding to a mapped entry
  * for the sync operations.
  */
 #define INVALID_PHYS_ADDR (~(phys_addr_t)0)
-static phys_addr_t *io_tlb_orig_addr;
+static phys_addr_t *io_tlb_orig_addr[SWIOTLB_MAX];
 
 /*
  * Protect the above data structures in the map and unmap calls
@@ -113,9 +113,9 @@ static int __init
 setup_io_tlb_npages(char *str)
 {
if (isdigit(*str)) {
-   io_tlb_nslabs = simple_strtoul(str, , 0);
+   io_tlb_nslabs[SWIOTLB_LO] = simple_strtoul(str, , 0);
/* avoid tail segment of size < IO_TLB_SEGSIZE */
-   io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE);
+   io_tlb_nslabs[SWIOTLB_LO] = ALIGN(io_tlb_nslabs[SWIOTLB_LO], 
IO_TLB_SEGSIZE);
}
if (*str == ',')
++str;
@@ -123,40 +123,40 @@ setup_io_tlb_npages(char *str)
swiotlb_fo

[PATCH RFC v1 3/6] swiotlb: introduce swiotlb_get_type() to calculate swiotlb buffer type

2021-02-03 Thread Dongli Zhang

This patch introduces swiotlb_get_type() in order to calculate which
swiotlb buffer the given DMA address is belong to.

This is to prepare to enable 64-bit swiotlb.

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 include/linux/swiotlb.h | 14 ++
 kernel/dma/swiotlb.c|  2 ++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 777046cd4d1b..3d5980d77810 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -3,6 +3,7 @@
 #define __LINUX_SWIOTLB_H
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -23,6 +24,8 @@ enum swiotlb_t {
SWIOTLB_MAX,
 };
 
+extern int swiotlb_nr;
+
 /*
  * Maximum allowable number of contiguous slabs to map,
  * must be a power of 2.  What is the appropriate value ?
@@ -84,6 +87,17 @@ static inline bool is_swiotlb_buffer(phys_addr_t paddr)
   paddr < io_tlb_end[SWIOTLB_LO];
 }
 
+static inline int swiotlb_get_type(phys_addr_t paddr)
+{
+   int i;
+
+   for (i = 0; i < swiotlb_nr; i++)
+   if (paddr >= io_tlb_start[i] && paddr < io_tlb_end[i])
+   return i;
+
+   return -ENOENT;
+}
+
 void __init swiotlb_exit(void);
 unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 1fbb65daa2dd..c91d3d2c3936 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -109,6 +109,8 @@ static DEFINE_SPINLOCK(io_tlb_lock);
 
 static int late_alloc;
 
+int swiotlb_nr = 1;
+
 static int __init
 setup_io_tlb_npages(char *str)
 {
-- 
2.17.1

[PATCH RFC v1 0/6] swiotlb: 64-bit DMA buffer

2021-02-03 Thread Dongli Zhang

This RFC is to introduce the 2nd swiotlb buffer for 64-bit DMA access.  The
prototype is based on v5.11-rc6.

The state of the art swiotlb pre-allocates <=32-bit memory in order to meet
the DMA mask requirement for some 32-bit legacy device. Considering most
devices nowadays support 64-bit DMA and IOMMU is available, the swiotlb is
not used for most of times, except:

1. The xen PVM domain requires the DMA addresses to both (1) <= the device
dma mask, and (2) continuous in machine address. Therefore, the 64-bit
device may still require swiotlb on PVM domain.

2. From source code the AMD SME/SEV will enable SWIOTLB_FORCE. As a result
it is always required to allocate from swiotlb buffer even the device dma
mask is 64-bit.

sme_early_init()
-> if (sev_active())
   swiotlb_force = SWIOTLB_FORCE;


Therefore, this RFC introduces the 2nd swiotlb buffer for 64-bit DMA
access.  For instance, the swiotlb_tbl_map_single() allocates from the 2nd
64-bit buffer if the device DMA mask min_not_zero(*hwdev->dma_mask,
hwdev->bus_dma_limit) is 64-bit.  With the RFC, the Xen/AMD will be able to
allocate >4GB swiotlb buffer.

With it being 64-bit, you can (not in this patch set but certainly
possible) allocate this at runtime. Meaning the size could change depending
on the device MMIO buffers, etc.


I have tested the patch set on Xen PVM dom0 boot via QEMU. The dom0 is boot
via:

qemu-system-x86_64 -smp 8 -m 20G -enable-kvm -vnc :9 \
-net nic -net user,hostfwd=tcp::5029-:22 \
-hda disk.img \
-device nvme,drive=nvme0,serial=deudbeaf1,max_ioqpairs=16 \
-drive file=test.qcow2,if=none,id=nvme0 \
-serial stdio

The "swiotlb=65536,1048576,force" is to configure 32-bit swiotlb as 128MB
and 64-bit swiotlb as 2048MB. The swiotlb is enforced.

vm# cat /proc/cmdline 
placeholder root=UUID=4e942d60-c228-4caf-b98e-f41c365d9703 ro text
swiotlb=65536,1048576,force quiet splash

[5.119877] Booting paravirtualized kernel on Xen
... ...
[5.190423] software IO TLB: Low Mem mapped [mem 
0x000234e0-0x00023ce0] (128MB)
[6.276161] software IO TLB: High Mem mapped [mem 
0x000166f33000-0x0001e6f33000] (2048MB)

0x000234e0 is mapped to 0x001c (32-bit machine address)
0x00023ce0-1 is mapped to 0x0ff3 (32-bit machine address)
0x000166f33000 is mapped to 0x0004b728 (64-bit machine address)
0x0001e6f33000-1 is mapped to 0x00033a07 (64-bit machine address)


While running fio for emulated-NVMe, the swiotlb is allocating from 64-bit
io_tlb_used-highmem.

vm# cat /sys/kernel/debug/swiotlb/io_tlb_nslabs
65536
vm# cat /sys/kernel/debug/swiotlb/io_tlb_used
258
vm# cat /sys/kernel/debug/swiotlb/io_tlb_nslabs-highmem
1048576
vm# cat /sys/kernel/debug/swiotlb/io_tlb_used-highmem
58880


I also tested virtio-scsi (with "disable-legacy=on,iommu_platform=true") on
VM with AMD SEV enabled.

qemu-system-x86_64 -enable-kvm -machine q35 -smp 36 -m 20G \
-drive if=pflash,format=raw,unit=0,file=OVMF_CODE.pure-efi.fd,readonly \
-drive if=pflash,format=raw,unit=1,file=OVMF_VARS.fd \
-hda ol7-uefi.qcow2 -serial stdio -vnc :9 \
-net nic -net user,hostfwd=tcp::5029-:22 \
-cpu EPYC -object sev-guest,id=sev0,cbitpos=47,reduced-phys-bits=1 \
-machine memory-encryption=sev0 \
-device virtio-scsi-pci,id=scsi,disable-legacy=on,iommu_platform=true \
-device scsi-hd,drive=disk0 \
-drive file=test.qcow2,if=none,id=disk0,format=qcow2

The "swiotlb=65536,1048576" is to configure 32-bit swiotlb as 128MB and
64-bit swiotlb as 2048MB. We do not need to force swiotlb because AMD SEV
will set SWIOTLB_FORCE.

# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-5.11.0-rc6swiotlb+ root=/dev/mapper/ol-root ro
crashkernel=auto rd.lvm.lv=ol/root rd.lvm.lv=ol/swap rhgb quiet
LANG=en_US.UTF-8 swiotlb=65536,1048576

[0.729790] AMD Memory Encryption Features active: SEV
... ...
[2.113147] software IO TLB: Low Mem mapped [mem 
0x73e1e000-0x7be1e000] (128MB)
[2.113151] software IO TLB: High Mem mapped [mem 
0x0004e840-0x00056840] (2048MB)

While running fio for virtio-scsi, the swiotlb is allocating from 64-bit
io_tlb_used-highmem.

vm# cat /sys/kernel/debug/swiotlb/io_tlb_nslabs
65536
vm# cat /sys/kernel/debug/swiotlb/io_tlb_used
0
vm# cat /sys/kernel/debug/swiotlb/io_tlb_nslabs-highmem
1048576
vm# cat /sys/kernel/debug/swiotlb/io_tlb_used-highmem
64647


Please let me know if there is any feedback for this idea and RFC.


Dongli Zhang (6):
   swiotlb: define new enumerated type
   swiotlb: convert variables to arrays
   swiotlb: introduce swiotlb_get_type() to calculate swiotlb buffer type
   swiotlb: enable 64-bit swiotlb
   xen-swiotlb: convert variables to arrays
   xen-swiotlb: enable 64-bit xen-swiotlb

 arch/mips/cavium-octeon/dma-octeon.c |   3 +-
 arch/powerpc/kernel/dma-swiotlb.c|   2 +-
 arch/powerpc/platforms/pseries/svm.c |   8 +-
 arch/x86/kernel/pci-swiotlb.c

[PATCH RFC v1 1/6] swiotlb: define new enumerated type

2021-02-03 Thread Dongli Zhang

This is just to define new enumerated type without functional change.

The 'SWIOTLB_LO' is to index legacy 32-bit swiotlb buffer, while the
'SWIOTLB_HI' is to index the 64-bit buffer.

This is to prepare to enable 64-bit swiotlb.

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 include/linux/swiotlb.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index d9c9fc9ca5d2..ca125c1b1281 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -17,6 +17,12 @@ enum swiotlb_force {
SWIOTLB_NO_FORCE,   /* swiotlb=noforce */
 };
 
+enum swiotlb_t {
+   SWIOTLB_LO,
+   SWIOTLB_HI,
+   SWIOTLB_MAX,
+};
+
 /*
  * Maximum allowable number of contiguous slabs to map,
  * must be a power of 2.  What is the appropriate value ?
-- 
2.17.1

Re: [PATCH v2] xen-blkback: fix compatibility bug with single page rings

2021-01-29 Thread Dongli Zhang




On 1/29/21 12:13 AM, Paul Durrant wrote:
>> -Original Message-
>> From: Jürgen Groß 
>> Sent: 29 January 2021 07:35
>> To: Dongli Zhang ; Paul Durrant ; xen-
>> de...@lists.xenproject.org; linux-bl...@vger.kernel.org; 
>> linux-ker...@vger.kernel.org
>> Cc: Paul Durrant ; Konrad Rzeszutek Wilk 
>> ; Roger Pau
>> Monné ; Jens Axboe 
>> Subject: Re: [PATCH v2] xen-blkback: fix compatibility bug with single page 
>> rings
>>
>> On 29.01.21 07:20, Dongli Zhang wrote:
>>>
>>>
>>> On 1/28/21 5:04 AM, Paul Durrant wrote:
>>>> From: Paul Durrant 
>>>>
>>>> Prior to commit 4a8c31a1c6f5 ("xen/blkback: rework connect_ring() to avoid
>>>> inconsistent xenstore 'ring-page-order' set by malicious blkfront"), the
>>>> behaviour of xen-blkback when connecting to a frontend was:
>>>>
>>>> - read 'ring-page-order'
>>>> - if not present then expect a single page ring specified by 'ring-ref'
>>>> - else expect a ring specified by 'ring-refX' where X is between 0 and
>>>>1 << ring-page-order
>>>>
>>>> This was correct behaviour, but was broken by the afforementioned commit to
>>>> become:
>>>>
>>>> - read 'ring-page-order'
>>>> - if not present then expect a single page ring (i.e. ring-page-order = 0)
>>>> - expect a ring specified by 'ring-refX' where X is between 0 and
>>>>1 << ring-page-order
>>>> - if that didn't work then see if there's a single page ring specified by
>>>>'ring-ref'
>>>>
>>>> This incorrect behaviour works most of the time but fails when a frontend
>>>> that sets 'ring-page-order' is unloaded and replaced by one that does not
>>>> because, instead of reading 'ring-ref', xen-blkback will read the stale
>>>> 'ring-ref0' left around by the previous frontend will try to map the wrong
>>>> grant reference.
>>>>
>>>> This patch restores the original behaviour.
>>>>
>>>> Fixes: 4a8c31a1c6f5 ("xen/blkback: rework connect_ring() to avoid 
>>>> inconsistent xenstore 'ring-page-
>> order' set by malicious blkfront")
>>>> Signed-off-by: Paul Durrant 
>>>> ---
>>>> Cc: Konrad Rzeszutek Wilk 
>>>> Cc: "Roger Pau Monné" 
>>>> Cc: Jens Axboe 
>>>> Cc: Dongli Zhang 
>>>>
>>>> v2:
>>>>   - Remove now-spurious error path special-case when nr_grefs == 1
>>>> ---
>>>>   drivers/block/xen-blkback/common.h |  1 +
>>>>   drivers/block/xen-blkback/xenbus.c | 38 +-
>>>>   2 files changed, 17 insertions(+), 22 deletions(-)
>>>>
>>>> diff --git a/drivers/block/xen-blkback/common.h 
>>>> b/drivers/block/xen-blkback/common.h
>>>> index b0c71d3a81a0..524a79f10de6 100644
>>>> --- a/drivers/block/xen-blkback/common.h
>>>> +++ b/drivers/block/xen-blkback/common.h
>>>> @@ -313,6 +313,7 @@ struct xen_blkif {
>>>>
>>>>struct work_struct  free_work;
>>>>unsigned intnr_ring_pages;
>>>> +  boolmulti_ref;
>>>
>>> Is it really necessary to introduce 'multi_ref' here or we may just re-use
>>> 'nr_ring_pages'?
>>>
>>> According to blkfront code, 'ring-page-order' is set only when it is not 
>>> zero,
>>> that is, only when (info->nr_ring_pages > 1).
>>
> 
> That's how it is *supposed* to be. Windows certainly behaves that way too.
> 
>> Did you look into all other OS's (Windows, OpenBSD, FreebSD, NetBSD,
>> Solaris, Netware, other proprietary systems) implementations to verify
>> that claim?
>>
>> I don't think so. So better safe than sorry.
>>
> 
> Indeed. It was unfortunate that the commit to blkif.h documenting multi-page 
> (829f2a9c6dfae) was not crystal clear and (possibly as a consequence) blkback 
> was implemented to read ring-ref0 rather than ring-ref if ring-page-order was 
> present and 0. Hence the only safe thing to do is to restore that behaviour.
> 

Thank you very much for the explanation!

Reviewed-by: Dongli Zhang 

Dongli ZHang

Re: [PATCH v2] xen-blkback: fix compatibility bug with single page rings

2021-01-28 Thread Dongli Zhang




On 1/28/21 5:04 AM, Paul Durrant wrote:
> From: Paul Durrant 
> 
> Prior to commit 4a8c31a1c6f5 ("xen/blkback: rework connect_ring() to avoid
> inconsistent xenstore 'ring-page-order' set by malicious blkfront"), the
> behaviour of xen-blkback when connecting to a frontend was:
> 
> - read 'ring-page-order'
> - if not present then expect a single page ring specified by 'ring-ref'
> - else expect a ring specified by 'ring-refX' where X is between 0 and
>   1 << ring-page-order
> 
> This was correct behaviour, but was broken by the afforementioned commit to
> become:
> 
> - read 'ring-page-order'
> - if not present then expect a single page ring (i.e. ring-page-order = 0)
> - expect a ring specified by 'ring-refX' where X is between 0 and
>   1 << ring-page-order
> - if that didn't work then see if there's a single page ring specified by
>   'ring-ref'
> 
> This incorrect behaviour works most of the time but fails when a frontend
> that sets 'ring-page-order' is unloaded and replaced by one that does not
> because, instead of reading 'ring-ref', xen-blkback will read the stale
> 'ring-ref0' left around by the previous frontend will try to map the wrong
> grant reference.
> 
> This patch restores the original behaviour.
> 
> Fixes: 4a8c31a1c6f5 ("xen/blkback: rework connect_ring() to avoid 
> inconsistent xenstore 'ring-page-order' set by malicious blkfront")
> Signed-off-by: Paul Durrant 
> ---
> Cc: Konrad Rzeszutek Wilk 
> Cc: "Roger Pau Monné" 
> Cc: Jens Axboe 
> Cc: Dongli Zhang 
> 
> v2:
>  - Remove now-spurious error path special-case when nr_grefs == 1
> ---
>  drivers/block/xen-blkback/common.h |  1 +
>  drivers/block/xen-blkback/xenbus.c | 38 +-
>  2 files changed, 17 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/block/xen-blkback/common.h 
> b/drivers/block/xen-blkback/common.h
> index b0c71d3a81a0..524a79f10de6 100644
> --- a/drivers/block/xen-blkback/common.h
> +++ b/drivers/block/xen-blkback/common.h
> @@ -313,6 +313,7 @@ struct xen_blkif {
>  
>   struct work_struct  free_work;
>   unsigned intnr_ring_pages;
> + boolmulti_ref;

Is it really necessary to introduce 'multi_ref' here or we may just re-use
'nr_ring_pages'?

According to blkfront code, 'ring-page-order' is set only when it is not zero,
that is, only when (info->nr_ring_pages > 1).

1819 if (info->nr_ring_pages > 1) {
1820 err = xenbus_printf(xbt, dev->nodename, "ring-page-order",
"%u",
1821 ring_page_order);
1822 if (err) {
1823 message = "writing ring-page-order";
1824 goto abort_transaction;
1825 }
1826 }

Therefore, can we assume 'ring-page-order' can never be 0? Once we have
'ring-page-order' set, it should be >= 1 and we should read from "ring-ref%u"?

If the specification allows 'ring-page-order' to be zero with "ring-ref%u"
available, we should introduce 'multi_ref'.

Thank you very much!

Dongli Zhang


>   /* All rings for this device. */
>   struct xen_blkif_ring   *rings;
>   unsigned intnr_rings;
> diff --git a/drivers/block/xen-blkback/xenbus.c 
> b/drivers/block/xen-blkback/xenbus.c
> index 9860d4842f36..6c5e9373e91c 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -998,14 +998,17 @@ static int read_per_ring_refs(struct xen_blkif_ring 
> *ring, const char *dir)
>   for (i = 0; i < nr_grefs; i++) {
>   char ring_ref_name[RINGREF_NAME_LEN];
>  
> - snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", i);
> + if (blkif->multi_ref)
> + snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", 
> i);
> + else {
> + WARN_ON(i != 0);
> + snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref");
> + }
> +
>   err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
>  "%u", _ref[i]);
>  
>   if (err != 1) {
> - if (nr_grefs == 1)
> - break;
> -
>   err = -EINVAL;
>   xenbus_dev_fatal(dev, err, "reading %s/%s",
>dir, ring_ref_name);
> @@ -1013,18 +1016,6 @@ static int read_per_ring_refs(struct xen_blkif_ring 
> *ring, const char *dir)
>   }
>   }
>  
> - i

Re: [PATCH] xen-blkback: fix compatibility bug with single page rings

2021-01-27 Thread Dongli Zhang




On 1/27/21 2:30 AM, Paul Durrant wrote:
> From: Paul Durrant 
> 
> Prior to commit 4a8c31a1c6f5 ("xen/blkback: rework connect_ring() to avoid
> inconsistent xenstore 'ring-page-order' set by malicious blkfront"), the
> behaviour of xen-blkback when connecting to a frontend was:
> 
> - read 'ring-page-order'
> - if not present then expect a single page ring specified by 'ring-ref'
> - else expect a ring specified by 'ring-refX' where X is between 0 and
>   1 << ring-page-order
> 
> This was correct behaviour, but was broken by the afforementioned commit to
> become:
> 
> - read 'ring-page-order'
> - if not present then expect a single page ring
> - expect a ring specified by 'ring-refX' where X is between 0 and
>   1 << ring-page-order
> - if that didn't work then see if there's a single page ring specified by
>   'ring-ref'
> 
> This incorrect behaviour works most of the time but fails when a frontend
> that sets 'ring-page-order' is unloaded and replaced by one that does not
> because, instead of reading 'ring-ref', xen-blkback will read the stale
> 'ring-ref0' left around by the previous frontend will try to map the wrong
> grant reference.
> 
> This patch restores the original behaviour.
> 
> Fixes: 4a8c31a1c6f5 ("xen/blkback: rework connect_ring() to avoid 
> inconsistent xenstore 'ring-page-order' set by malicious blkfront")
> Signed-off-by: Paul Durrant 
> ---
> Cc: Konrad Rzeszutek Wilk 
> Cc: "Roger Pau Monné" 
> Cc: Jens Axboe 
> Cc: Dongli Zhang 
> ---
>  drivers/block/xen-blkback/common.h |  1 +
>  drivers/block/xen-blkback/xenbus.c | 36 +-
>  2 files changed, 17 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/block/xen-blkback/common.h 
> b/drivers/block/xen-blkback/common.h
> index b0c71d3a81a0..524a79f10de6 100644
> --- a/drivers/block/xen-blkback/common.h
> +++ b/drivers/block/xen-blkback/common.h
> @@ -313,6 +313,7 @@ struct xen_blkif {
>  
>   struct work_struct  free_work;
>   unsigned intnr_ring_pages;
> + boolmulti_ref;
>   /* All rings for this device. */
>   struct xen_blkif_ring   *rings;
>   unsigned intnr_rings;
> diff --git a/drivers/block/xen-blkback/xenbus.c 
> b/drivers/block/xen-blkback/xenbus.c
> index 9860d4842f36..4c1541cde68c 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -998,10 +998,15 @@ static int read_per_ring_refs(struct xen_blkif_ring 
> *ring, const char *dir)
>   for (i = 0; i < nr_grefs; i++) {
>   char ring_ref_name[RINGREF_NAME_LEN];
>  
> - snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", i);
> + if (blkif->multi_ref)
> + snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", 
> i);
> + else {
> + WARN_ON(i != 0);
> + snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref");
> + }
> +
>   err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
>  "%u", _ref[i]);
> -
>   if (err != 1) {
>       if (nr_grefs == 1)
>   break;

I think we should not simply break here because the failure can be due to when
(nr_grefs == 1) and reading from legacy "ring-ref".

Should we do something as below?

err = -EINVAL;
xenbus_dev_fatal(dev, err, "reading %s/ring-ref", dir);
return err;

Dongli Zhang


> @@ -1013,18 +1018,6 @@ static int read_per_ring_refs(struct xen_blkif_ring 
> *ring, const char *dir)
>   }
>   }
>  
> - if (err != 1) {
> - WARN_ON(nr_grefs != 1);
> -
> - err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u",
> -_ref[0]);
> - if (err != 1) {
> - err = -EINVAL;
> - xenbus_dev_fatal(dev, err, "reading %s/ring-ref", dir);
> - return err;
> - }
> - }
> -
>   err = -ENOMEM;
>   for (i = 0; i < nr_grefs * XEN_BLKIF_REQS_PER_PAGE; i++) {
>   req = kzalloc(sizeof(*req), GFP_KERNEL);
> @@ -1129,10 +1122,15 @@ static int connect_ring(struct backend_info *be)
>blkif->nr_rings, blkif->blk_protocol, protocol,
>blkif->vbd.feature_gnt_persistent ? "persistent grants" : "");
>  
> - ring_page_order = xenbus_read_unsigned(dev->otherend,
> -&

[PATCH 1/1] xen/manage: enable C_A_D to force reboot

2020-05-13 Thread Dongli Zhang

The systemd may be configured to mask ctrl-alt-del via "systemctl mask
ctrl-alt-del.target". As a result, the pv reboot would not work as signal
is ignored.

This patch always enables C_A_D before the call of ctrl_alt_del() in order
to force the reboot.

Reported-by: Rose Wang 
Cc: Joe Jin 
Cc: Boris Ostrovsky 
Signed-off-by: Dongli Zhang 
---
 drivers/xen/manage.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index cd046684e0d1..3190d0ecb52e 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -204,6 +204,13 @@ static void do_poweroff(void)
 static void do_reboot(void)
 {
shutting_down = SHUTDOWN_POWEROFF; /* ? */
+   /*
+* The systemd may be configured to mask ctrl-alt-del via
+* "systemctl mask ctrl-alt-del.target". As a result, the pv reboot
+* would not work. To enable C_A_D would force the reboot.
+*/
+   C_A_D = 1;
+
ctrl_alt_del();
 }
 
-- 
2.17.1

Re: Live migration and PV device handling

2020-04-03 Thread Dongli Zhang

Hi Andrew,

On 4/3/20 5:42 AM, Andrew Cooper wrote:
> On 03/04/2020 13:32, Anastassios Nanos wrote:
>> Hi all,
>>
>> I am trying to understand how live-migration happens in xen. I am
>> looking in the HVM guest case and I have dug into the relevant parts
>> of the toolstack and the hypervisor regarding memory, vCPU context
>> etc.
>>
>> In particular, I am interested in how PV device migration happens. I
>> assume that the guest is not aware of any suspend/resume operations
>> being done
> 
> Sadly, this assumption is not correct.  HVM guests with PV drivers
> currently have to be aware in exactly the same way as PV guests.
> 
> Work is in progress to try and address this.  See
> https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=775a02452ddf3a6889690de90b1a94eb29c3c732
> (sorry - for some reason that doc isn't being rendered properly in
> https://xenbits.xen.org/docs/ )
> 

I read below from the commit:

+* The toolstack choose a randomized domid for initial creation or default
+migration, but preserve the source domid non-cooperative migration.
+Non-Cooperative migration will have to be denied if the domid is
+unavailable on the target host, but randomization of domid on creation
+should hopefully minimize the likelihood of this. Non-Cooperative migration
+to localhost will clearly not be possible.

Does that indicate while scope of domid_t is shared by a single server in old
design, the scope of domid_t is shared by a cluster of server in new design?

That is, the domid should be unique in the cluster of all servers if we expect
non-cooperative migration always succeed?

Thank you very much!

Dongli Zhang

[Xen-devel] [PATCH v3 2/2] xenbus: req->err should be updated before req->state

2020-03-03 Thread Dongli Zhang

This patch adds the barrier to guarantee that req->err is always updated
before req->state.

Otherwise, read_reply() would not return ERR_PTR(req->err) but
req->body, when process_writes()->xb_write() is failed.

Signed-off-by: Dongli Zhang 
---
 drivers/xen/xenbus/xenbus_comms.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/xen/xenbus/xenbus_comms.c 
b/drivers/xen/xenbus/xenbus_comms.c
index 852ed161fc2a..eb5151fc8efa 100644
--- a/drivers/xen/xenbus/xenbus_comms.c
+++ b/drivers/xen/xenbus/xenbus_comms.c
@@ -397,6 +397,8 @@ static int process_writes(void)
if (state.req->state == xb_req_state_aborted)
kfree(state.req);
else {
+   /* write err, then update state */
+   virt_wmb();
state.req->state = xb_req_state_got_reply;
wake_up(>wq);
}
-- 
2.17.1


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v3 1/2] xenbus: req->body should be updated before req->state

2020-03-03 Thread Dongli Zhang

The req->body should be updated before req->state is updated and the
order should be guaranteed by a barrier.

Otherwise, read_reply() might return req->body = NULL.

Below is sample callstack when the issue is reproduced on purpose by
reordering the updates of req->body and req->state and adding delay in
code between updates of req->state and req->body.

[   22.356105] general protection fault:  [#1] SMP PTI
[   22.361185] CPU: 2 PID: 52 Comm: xenwatch Not tainted 5.5.0xen+ #6
[   22.366727] Hardware name: Xen HVM domU, BIOS ...
[   22.372245] RIP: 0010:_parse_integer_fixup_radix+0x6/0x60
... ...
[   22.392163] RSP: 0018:b2d64023fdf0 EFLAGS: 00010246
[   22.395933] RAX:  RBX: 75746e7562755f6d RCX: 
[   22.400871] RDX:  RSI: b2d64023fdfc RDI: 75746e7562755f6d
[   22.405874] RBP:  R08: 01e8 R09: 00cdcdcd
[   22.410945] R10: b2d6402ffe00 R11: 9d95395eaeb0 R12: 9d9535935000
[   22.417613] R13: 9d9526d4a000 R14: 9d9526f4f340 R15: 9d9537654000
[   22.423726] FS:  () GS:9d953bc8() 
knlGS:
[   22.429898] CS:  0010 DS:  ES:  CR0: 80050033
[   22.434342] CR2: 00c4206a9000 CR3: 0001ea3fc002 CR4: 001606e0
[   22.439645] DR0:  DR1:  DR2: 
[   22.444941] DR3:  DR6: fffe0ff0 DR7: 0400
[   22.450342] Call Trace:
[   22.452509]  simple_strtoull+0x27/0x70
[   22.455572]  xenbus_transaction_start+0x31/0x50
[   22.459104]  netback_changed+0x76c/0xcc1 [xen_netfront]
[   22.463279]  ? find_watch+0x40/0x40
[   22.466156]  xenwatch_thread+0xb4/0x150
[   22.469309]  ? wait_woken+0x80/0x80
[   22.472198]  kthread+0x10e/0x130
[   22.474925]  ? kthread_park+0x80/0x80
[   22.477946]  ret_from_fork+0x35/0x40
[   22.480968] Modules linked in: xen_kbdfront xen_fbfront(+) xen_netfront 
xen_blkfront
[   22.486783] ---[ end trace a9222030a747c3f7 ]---
[   22.490424] RIP: 0010:_parse_integer_fixup_radix+0x6/0x60

The virt_rmb() is added in the 'true' path of test_reply(). The "while"
is changed to "do while" so that test_reply() is used as a read memory
barrier.

Signed-off-by: Dongli Zhang 
---
Changed since v1:
  - change "barrier()" to "virt_rmb()" in test_reply()
Changed since v2:
  - Use "virt_rmb()" only in 'true' path

 drivers/xen/xenbus/xenbus_comms.c | 2 ++
 drivers/xen/xenbus/xenbus_xs.c| 9 ++---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/xen/xenbus/xenbus_comms.c 
b/drivers/xen/xenbus/xenbus_comms.c
index d239fc3c5e3d..852ed161fc2a 100644
--- a/drivers/xen/xenbus/xenbus_comms.c
+++ b/drivers/xen/xenbus/xenbus_comms.c
@@ -313,6 +313,8 @@ static int process_msg(void)
req->msg.type = state.msg.type;
req->msg.len = state.msg.len;
req->body = state.body;
+   /* write body, then update state */
+   virt_wmb();
req->state = xb_req_state_got_reply;
req->cb(req);
} else
diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c
index ddc18da61834..3a06eb699f33 100644
--- a/drivers/xen/xenbus/xenbus_xs.c
+++ b/drivers/xen/xenbus/xenbus_xs.c
@@ -191,8 +191,11 @@ static bool xenbus_ok(void)
 
 static bool test_reply(struct xb_req_data *req)
 {
-   if (req->state == xb_req_state_got_reply || !xenbus_ok())
+   if (req->state == xb_req_state_got_reply || !xenbus_ok()) {
+   /* read req->state before all other fields */
+   virt_rmb();
return true;
+   }
 
/* Make sure to reread req->state each time. */
barrier();
@@ -202,7 +205,7 @@ static bool test_reply(struct xb_req_data *req)
 
 static void *read_reply(struct xb_req_data *req)
 {
-   while (req->state != xb_req_state_got_reply) {
+   do {
wait_event(req->wq, test_reply(req));
 
if (!xenbus_ok())
@@ -216,7 +219,7 @@ static void *read_reply(struct xb_req_data *req)
if (req->err)
return ERR_PTR(req->err);
 
-   }
+   } while (req->state != xb_req_state_got_reply);
 
return req->body;
 }
-- 
2.17.1


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2 1/2] xenbus: req->body should be updated before req->state

2020-03-03 Thread dongli . zhang



On 3/3/20 11:37 AM, Julien Grall wrote:
> Hi,
> 
> On 03/03/2020 18:47, Dongli Zhang wrote:
>> The req->body should be updated before req->state is updated and the
>> order should be guaranteed by a barrier.
>>
>> Otherwise, read_reply() might return req->body = NULL.
>>
>> Below is sample callstack when the issue is reproduced on purpose by
>> reordering the updates of req->body and req->state and adding delay in
>> code between updates of req->state and req->body.
>>
>> [   22.356105] general protection fault:  [#1] SMP PTI
>> [   22.361185] CPU: 2 PID: 52 Comm: xenwatch Not tainted 5.5.0xen+ #6
>> [   22.366727] Hardware name: Xen HVM domU, BIOS ...
>> [   22.372245] RIP: 0010:_parse_integer_fixup_radix+0x6/0x60
>> ... ...
>> [   22.392163] RSP: 0018:b2d64023fdf0 EFLAGS: 00010246
>> [   22.395933] RAX:  RBX: 75746e7562755f6d RCX: 
>> 
>> [   22.400871] RDX:  RSI: b2d64023fdfc RDI: 
>> 75746e7562755f6d
>> [   22.405874] RBP:  R08: 01e8 R09: 
>> 00cdcdcd
>> [   22.410945] R10: b2d6402ffe00 R11: 9d95395eaeb0 R12: 
>> 9d9535935000
>> [   22.417613] R13: 9d9526d4a000 R14: 9d9526f4f340 R15: 
>> 9d9537654000
>> [   22.423726] FS:  () GS:9d953bc8()
>> knlGS:
>> [   22.429898] CS:  0010 DS:  ES:  CR0: 80050033
>> [   22.434342] CR2: 00c4206a9000 CR3: 0001ea3fc002 CR4: 
>> 001606e0
>> [   22.439645] DR0:  DR1:  DR2: 
>> 
>> [   22.444941] DR3:  DR6: fffe0ff0 DR7: 
>> 0400
>> [   22.450342] Call Trace:
>> [   22.452509]  simple_strtoull+0x27/0x70
>> [   22.455572]  xenbus_transaction_start+0x31/0x50
>> [   22.459104]  netback_changed+0x76c/0xcc1 [xen_netfront]
>> [   22.463279]  ? find_watch+0x40/0x40
>> [   22.466156]  xenwatch_thread+0xb4/0x150
>> [   22.469309]  ? wait_woken+0x80/0x80
>> [   22.472198]  kthread+0x10e/0x130
>> [   22.474925]  ? kthread_park+0x80/0x80
>> [   22.477946]  ret_from_fork+0x35/0x40
>> [   22.480968] Modules linked in: xen_kbdfront xen_fbfront(+) xen_netfront
>> xen_blkfront
>> [   22.486783] ---[ end trace a9222030a747c3f7 ]---
>> [   22.490424] RIP: 0010:_parse_integer_fixup_radix+0x6/0x60
>>
>> The barrier() in test_reply() is changed to virt_rmb(). The "while" is
>> changed to "do while" so that test_reply() is used as a read memory
>> barrier.
>>
>> Signed-off-by: Dongli Zhang 
>> ---
>> Changed since v1:
>>    - change "barrier()" to "virt_rmb()" in test_reply()
>>
>>   drivers/xen/xenbus/xenbus_comms.c |  2 ++
>>   drivers/xen/xenbus/xenbus_xs.c    | 11 +++
>>   2 files changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/xen/xenbus/xenbus_comms.c
>> b/drivers/xen/xenbus/xenbus_comms.c
>> index d239fc3c5e3d..852ed161fc2a 100644
>> --- a/drivers/xen/xenbus/xenbus_comms.c
>> +++ b/drivers/xen/xenbus/xenbus_comms.c
>> @@ -313,6 +313,8 @@ static int process_msg(void)
>>   req->msg.type = state.msg.type;
>>   req->msg.len = state.msg.len;
>>   req->body = state.body;
>> +    /* write body, then update state */
>> +    virt_wmb();
>>   req->state = xb_req_state_got_reply;
>>   req->cb(req);
>>   } else
>> diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c
>> index ddc18da61834..1e14c2118861 100644
>> --- a/drivers/xen/xenbus/xenbus_xs.c
>> +++ b/drivers/xen/xenbus/xenbus_xs.c
>> @@ -194,15 +194,18 @@ static bool test_reply(struct xb_req_data *req)
>>   if (req->state == xb_req_state_got_reply || !xenbus_ok())
>>   return true;
>>   -    /* Make sure to reread req->state each time. */
>> -    barrier();
>> +    /*
>> + * read req->state before other fields of struct xb_req_data
>> + * in the caller of test_reply(), e.g., read_reply()
>> + */
>> +    virt_rmb();
> 
> Looking at the code again, I am afraid the barrier only happen in the false
> case. Should not the new barrier added in the 'true' case?

I would leave the original "barrier()" in the 'false' case and add the new
barrier only in the 'true' case.

Thank you very much!

Dongli Zhang

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v2 2/2] xenbus: req->err should be updated before req->state

2020-03-03 Thread Dongli Zhang

This patch adds the barrier to guarantee that req->err is always updated
before req->state.

Otherwise, read_reply() would not return ERR_PTR(req->err) but
req->body, when process_writes()->xb_write() is failed.

Signed-off-by: Dongli Zhang 
---
 drivers/xen/xenbus/xenbus_comms.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/xen/xenbus/xenbus_comms.c 
b/drivers/xen/xenbus/xenbus_comms.c
index 852ed161fc2a..eb5151fc8efa 100644
--- a/drivers/xen/xenbus/xenbus_comms.c
+++ b/drivers/xen/xenbus/xenbus_comms.c
@@ -397,6 +397,8 @@ static int process_writes(void)
if (state.req->state == xb_req_state_aborted)
kfree(state.req);
else {
+   /* write err, then update state */
+   virt_wmb();
state.req->state = xb_req_state_got_reply;
wake_up(>wq);
}
-- 
2.17.1


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v2 1/2] xenbus: req->body should be updated before req->state

2020-03-03 Thread Dongli Zhang

The req->body should be updated before req->state is updated and the
order should be guaranteed by a barrier.

Otherwise, read_reply() might return req->body = NULL.

Below is sample callstack when the issue is reproduced on purpose by
reordering the updates of req->body and req->state and adding delay in
code between updates of req->state and req->body.

[   22.356105] general protection fault:  [#1] SMP PTI
[   22.361185] CPU: 2 PID: 52 Comm: xenwatch Not tainted 5.5.0xen+ #6
[   22.366727] Hardware name: Xen HVM domU, BIOS ...
[   22.372245] RIP: 0010:_parse_integer_fixup_radix+0x6/0x60
... ...
[   22.392163] RSP: 0018:b2d64023fdf0 EFLAGS: 00010246
[   22.395933] RAX:  RBX: 75746e7562755f6d RCX: 
[   22.400871] RDX:  RSI: b2d64023fdfc RDI: 75746e7562755f6d
[   22.405874] RBP:  R08: 01e8 R09: 00cdcdcd
[   22.410945] R10: b2d6402ffe00 R11: 9d95395eaeb0 R12: 9d9535935000
[   22.417613] R13: 9d9526d4a000 R14: 9d9526f4f340 R15: 9d9537654000
[   22.423726] FS:  () GS:9d953bc8() 
knlGS:
[   22.429898] CS:  0010 DS:  ES:  CR0: 80050033
[   22.434342] CR2: 00c4206a9000 CR3: 0001ea3fc002 CR4: 001606e0
[   22.439645] DR0:  DR1:  DR2: 
[   22.444941] DR3:  DR6: fffe0ff0 DR7: 0400
[   22.450342] Call Trace:
[   22.452509]  simple_strtoull+0x27/0x70
[   22.455572]  xenbus_transaction_start+0x31/0x50
[   22.459104]  netback_changed+0x76c/0xcc1 [xen_netfront]
[   22.463279]  ? find_watch+0x40/0x40
[   22.466156]  xenwatch_thread+0xb4/0x150
[   22.469309]  ? wait_woken+0x80/0x80
[   22.472198]  kthread+0x10e/0x130
[   22.474925]  ? kthread_park+0x80/0x80
[   22.477946]  ret_from_fork+0x35/0x40
[   22.480968] Modules linked in: xen_kbdfront xen_fbfront(+) xen_netfront 
xen_blkfront
[   22.486783] ---[ end trace a9222030a747c3f7 ]---
[   22.490424] RIP: 0010:_parse_integer_fixup_radix+0x6/0x60

The barrier() in test_reply() is changed to virt_rmb(). The "while" is
changed to "do while" so that test_reply() is used as a read memory
barrier.

Signed-off-by: Dongli Zhang 
---
Changed since v1:
  - change "barrier()" to "virt_rmb()" in test_reply()

 drivers/xen/xenbus/xenbus_comms.c |  2 ++
 drivers/xen/xenbus/xenbus_xs.c| 11 +++
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/xen/xenbus/xenbus_comms.c 
b/drivers/xen/xenbus/xenbus_comms.c
index d239fc3c5e3d..852ed161fc2a 100644
--- a/drivers/xen/xenbus/xenbus_comms.c
+++ b/drivers/xen/xenbus/xenbus_comms.c
@@ -313,6 +313,8 @@ static int process_msg(void)
req->msg.type = state.msg.type;
req->msg.len = state.msg.len;
req->body = state.body;
+   /* write body, then update state */
+   virt_wmb();
req->state = xb_req_state_got_reply;
req->cb(req);
} else
diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c
index ddc18da61834..1e14c2118861 100644
--- a/drivers/xen/xenbus/xenbus_xs.c
+++ b/drivers/xen/xenbus/xenbus_xs.c
@@ -194,15 +194,18 @@ static bool test_reply(struct xb_req_data *req)
if (req->state == xb_req_state_got_reply || !xenbus_ok())
return true;
 
-   /* Make sure to reread req->state each time. */
-   barrier();
+   /*
+* read req->state before other fields of struct xb_req_data
+* in the caller of test_reply(), e.g., read_reply()
+*/
+   virt_rmb();
 
return false;
 }
 
 static void *read_reply(struct xb_req_data *req)
 {
-   while (req->state != xb_req_state_got_reply) {
+   do {
wait_event(req->wq, test_reply(req));
 
if (!xenbus_ok())
@@ -216,7 +219,7 @@ static void *read_reply(struct xb_req_data *req)
if (req->err)
return ERR_PTR(req->err);
 
-   }
+   } while (req->state != xb_req_state_got_reply);
 
return req->body;
 }
-- 
2.17.1


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 1/2] xenbus: req->body should be updated before req->state

2020-03-03 Thread dongli . zhang



On 3/3/20 1:40 AM, Julien Grall wrote:
> Hi,
> 
> On 03/03/2020 01:58, Dongli Zhang wrote:
>> The req->body should be updated before req->state is updated and the
>> order should be guaranteed by a barrier.
>>
>> Otherwise, read_reply() might return req->body = NULL.
>>
>> Below is sample callstack when the issue is reproduced on purpose by
>> reordering the updates of req->body and req->state and adding delay in
>> code between updates of req->state and req->body.
>>
>> [   22.356105] general protection fault:  [#1] SMP PTI
>> [   22.361185] CPU: 2 PID: 52 Comm: xenwatch Not tainted 5.5.0xen+ #6
>> [   22.366727] Hardware name: Xen HVM domU, BIOS ...
>> [   22.372245] RIP: 0010:_parse_integer_fixup_radix+0x6/0x60
>> ... ...
>> [   22.392163] RSP: 0018:b2d64023fdf0 EFLAGS: 00010246
>> [   22.395933] RAX:  RBX: 75746e7562755f6d RCX: 
>> 
>> [   22.400871] RDX:  RSI: b2d64023fdfc RDI: 
>> 75746e7562755f6d
>> [   22.405874] RBP:  R08: 01e8 R09: 
>> 00cdcdcd
>> [   22.410945] R10: b2d6402ffe00 R11: 9d95395eaeb0 R12: 
>> 9d9535935000
>> [   22.417613] R13: 9d9526d4a000 R14: 9d9526f4f340 R15: 
>> 9d9537654000
>> [   22.423726] FS:  () GS:9d953bc8()
>> knlGS:
>> [   22.429898] CS:  0010 DS:  ES:  CR0: 80050033
>> [   22.434342] CR2: 00c4206a9000 CR3: 0001ea3fc002 CR4: 
>> 001606e0
>> [   22.439645] DR0:  DR1:  DR2: 
>> 
>> [   22.444941] DR3:  DR6: fffe0ff0 DR7: 
>> 0400
>> [   22.450342] Call Trace:
>> [   22.452509]  simple_strtoull+0x27/0x70
>> [   22.455572]  xenbus_transaction_start+0x31/0x50
>> [   22.459104]  netback_changed+0x76c/0xcc1 [xen_netfront]
>> [   22.463279]  ? find_watch+0x40/0x40
>> [   22.466156]  xenwatch_thread+0xb4/0x150
>> [   22.469309]  ? wait_woken+0x80/0x80
>> [   22.472198]  kthread+0x10e/0x130
>> [   22.474925]  ? kthread_park+0x80/0x80
>> [   22.477946]  ret_from_fork+0x35/0x40
>> [   22.480968] Modules linked in: xen_kbdfront xen_fbfront(+) xen_netfront
>> xen_blkfront
>> [   22.486783] ---[ end trace a9222030a747c3f7 ]---
>> [   22.490424] RIP: 0010:_parse_integer_fixup_radix+0x6/0x60
>>
>> The "while" is changed to "do while" so that wait_event() is used as a
>> barrier.
> 
> The correct barrier for read_reply() should be virt_rmb(). While on x86, this 
> is
> equivalent to barrier(), on Arm this will be a dmb(ish) to prevent the 
> processor
> re-ordering memory access.
> 
> Therefore the barrier in test_reply() (called by wait_event()) is not going to
> be sufficient for Arm.

Sorry that I just erroneously thought wait_event() would be used as read 
barrier.

I would change barrier() to virt_rmb() for read_reply() in v2.

Thank you very much!

Dongli Zhang

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH 2/2] xenbus: req->err should be updated before req->state

2020-03-02 Thread Dongli Zhang

This patch adds the barrier to guarantee that req->err is always updated
before req->state.

Otherwise, read_reply() would not return ERR_PTR(req->err) but
req->body, when process_writes()->xb_write() is failed.

Signed-off-by: Dongli Zhang 
---
 drivers/xen/xenbus/xenbus_comms.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/xen/xenbus/xenbus_comms.c 
b/drivers/xen/xenbus/xenbus_comms.c
index 852ed161fc2a..eb5151fc8efa 100644
--- a/drivers/xen/xenbus/xenbus_comms.c
+++ b/drivers/xen/xenbus/xenbus_comms.c
@@ -397,6 +397,8 @@ static int process_writes(void)
if (state.req->state == xb_req_state_aborted)
kfree(state.req);
else {
+   /* write err, then update state */
+   virt_wmb();
state.req->state = xb_req_state_got_reply;
wake_up(>wq);
}
-- 
2.17.1


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH 1/2] xenbus: req->body should be updated before req->state

2020-03-02 Thread Dongli Zhang

The req->body should be updated before req->state is updated and the
order should be guaranteed by a barrier.

Otherwise, read_reply() might return req->body = NULL.

Below is sample callstack when the issue is reproduced on purpose by
reordering the updates of req->body and req->state and adding delay in
code between updates of req->state and req->body.

[   22.356105] general protection fault:  [#1] SMP PTI
[   22.361185] CPU: 2 PID: 52 Comm: xenwatch Not tainted 5.5.0xen+ #6
[   22.366727] Hardware name: Xen HVM domU, BIOS ...
[   22.372245] RIP: 0010:_parse_integer_fixup_radix+0x6/0x60
... ...
[   22.392163] RSP: 0018:b2d64023fdf0 EFLAGS: 00010246
[   22.395933] RAX:  RBX: 75746e7562755f6d RCX: 
[   22.400871] RDX:  RSI: b2d64023fdfc RDI: 75746e7562755f6d
[   22.405874] RBP:  R08: 01e8 R09: 00cdcdcd
[   22.410945] R10: b2d6402ffe00 R11: 9d95395eaeb0 R12: 9d9535935000
[   22.417613] R13: 9d9526d4a000 R14: 9d9526f4f340 R15: 9d9537654000
[   22.423726] FS:  () GS:9d953bc8() 
knlGS:
[   22.429898] CS:  0010 DS:  ES:  CR0: 80050033
[   22.434342] CR2: 00c4206a9000 CR3: 0001ea3fc002 CR4: 001606e0
[   22.439645] DR0:  DR1:  DR2: 
[   22.444941] DR3:  DR6: fffe0ff0 DR7: 0400
[   22.450342] Call Trace:
[   22.452509]  simple_strtoull+0x27/0x70
[   22.455572]  xenbus_transaction_start+0x31/0x50
[   22.459104]  netback_changed+0x76c/0xcc1 [xen_netfront]
[   22.463279]  ? find_watch+0x40/0x40
[   22.466156]  xenwatch_thread+0xb4/0x150
[   22.469309]  ? wait_woken+0x80/0x80
[   22.472198]  kthread+0x10e/0x130
[   22.474925]  ? kthread_park+0x80/0x80
[   22.477946]  ret_from_fork+0x35/0x40
[   22.480968] Modules linked in: xen_kbdfront xen_fbfront(+) xen_netfront 
xen_blkfront
[   22.486783] ---[ end trace a9222030a747c3f7 ]---
[   22.490424] RIP: 0010:_parse_integer_fixup_radix+0x6/0x60

The "while" is changed to "do while" so that wait_event() is used as a
barrier.

Signed-off-by: Dongli Zhang 
---
 drivers/xen/xenbus/xenbus_comms.c | 2 ++
 drivers/xen/xenbus/xenbus_xs.c| 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/xen/xenbus/xenbus_comms.c 
b/drivers/xen/xenbus/xenbus_comms.c
index d239fc3c5e3d..852ed161fc2a 100644
--- a/drivers/xen/xenbus/xenbus_comms.c
+++ b/drivers/xen/xenbus/xenbus_comms.c
@@ -313,6 +313,8 @@ static int process_msg(void)
req->msg.type = state.msg.type;
req->msg.len = state.msg.len;
req->body = state.body;
+   /* write body, then update state */
+   virt_wmb();
req->state = xb_req_state_got_reply;
req->cb(req);
} else
diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c
index ddc18da61834..f5b0a6a72ad3 100644
--- a/drivers/xen/xenbus/xenbus_xs.c
+++ b/drivers/xen/xenbus/xenbus_xs.c
@@ -202,7 +202,7 @@ static bool test_reply(struct xb_req_data *req)
 
 static void *read_reply(struct xb_req_data *req)
 {
-   while (req->state != xb_req_state_got_reply) {
+   do {
wait_event(req->wq, test_reply(req));
 
if (!xenbus_ok())
@@ -216,7 +216,7 @@ static void *read_reply(struct xb_req_data *req)
if (req->err)
return ERR_PTR(req->err);
 
-   }
+   } while (req->state != xb_req_state_got_reply);
 
return req->body;
 }
-- 
2.17.1


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v2 1/1] xen-netfront: do not use ~0U as error return value for xennet_fill_frags()

2019-10-01 Thread Dongli Zhang

xennet_fill_frags() uses ~0U as return value when the sk_buff is not able
to cache extra fragments. This is incorrect because the return type of
xennet_fill_frags() is RING_IDX and 0x is an expected value for
ring buffer index.

In the situation when the rsp_cons is approaching 0x, the return
value of xennet_fill_frags() may become 0x which xennet_poll() (the
caller) would regard as error. As a result, queue->rx.rsp_cons is set
incorrectly because it is updated only when there is error. If there is no
error, xennet_poll() would be responsible to update queue->rx.rsp_cons.
Finally, queue->rx.rsp_cons would point to the rx ring buffer entries whose
queue->rx_skbs[i] and queue->grant_rx_ref[i] are already cleared to NULL.
This leads to NULL pointer access in the next iteration to process rx ring
buffer entries.

The symptom is similar to the one fixed in
commit 00b368502d18 ("xen-netfront: do not assume sk_buff_head list is
empty in error handling").

This patch changes the return type of xennet_fill_frags() to indicate
whether it is successful or failed. The queue->rx.rsp_cons will be
always updated inside this function.

Fixes: ad4f15dc2c70 ("xen/netfront: don't bug in case of too many frags")
Signed-off-by: Dongli Zhang 
---
Changed since v1:
  - Always update queue->rx.rsp_cons inside xennet_fill_frags() so we do
not need to add extra argument to xennet_fill_frags().

 drivers/net/xen-netfront.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index e14ec75..482c6c8 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -887,9 +887,9 @@ static int xennet_set_skb_gso(struct sk_buff *skb,
return 0;
 }
 
-static RING_IDX xennet_fill_frags(struct netfront_queue *queue,
- struct sk_buff *skb,
- struct sk_buff_head *list)
+static int xennet_fill_frags(struct netfront_queue *queue,
+struct sk_buff *skb,
+struct sk_buff_head *list)
 {
RING_IDX cons = queue->rx.rsp_cons;
struct sk_buff *nskb;
@@ -908,7 +908,7 @@ static RING_IDX xennet_fill_frags(struct netfront_queue 
*queue,
if (unlikely(skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS)) {
queue->rx.rsp_cons = ++cons + skb_queue_len(list);
kfree_skb(nskb);
-   return ~0U;
+   return -ENOENT;
}
 
skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
@@ -919,7 +919,9 @@ static RING_IDX xennet_fill_frags(struct netfront_queue 
*queue,
kfree_skb(nskb);
}
 
-   return cons;
+   queue->rx.rsp_cons = cons;
+
+   return 0;
 }
 
 static int checksum_setup(struct net_device *dev, struct sk_buff *skb)
@@ -1045,8 +1047,7 @@ static int xennet_poll(struct napi_struct *napi, int 
budget)
skb->data_len = rx->status;
skb->len += rx->status;
 
-   i = xennet_fill_frags(queue, skb, );
-   if (unlikely(i == ~0U))
+   if (unlikely(xennet_fill_frags(queue, skb, )))
goto err;
 
if (rx->flags & XEN_NETRXF_csum_blank)
@@ -1056,7 +1057,7 @@ static int xennet_poll(struct napi_struct *napi, int 
budget)
 
__skb_queue_tail(, skb);
 
-   queue->rx.rsp_cons = ++i;
+   i = ++queue->rx.rsp_cons;
work_done++;
}
 
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH 1/1] xen-netfront: do not use ~0U as error return value for xennet_fill_frags()

2019-09-30 Thread Dongli Zhang

xennet_fill_frags() uses ~0U as return value when the sk_buff is not able
to cache extra fragments. This is incorrect because the return type of
xennet_fill_frags() is RING_IDX and 0x is an expected value for
ring buffer index.

In the situation when the rsp_cons is approaching 0x, the return
value of xennet_fill_frags() may become 0x which xennet_poll() (the
caller) would regard as error. As a result, queue->rx.rsp_cons is set
incorrectly because it is updated only when there is error. If there is no
error, xennet_poll() would be responsible to update queue->rx.rsp_cons.
Finally, queue->rx.rsp_cons would point to the rx ring buffer entries whose
queue->rx_skbs[i] and queue->grant_rx_ref[i] are already cleared to NULL.
This leads to NULL pointer access in the next iteration to process rx ring
buffer entries.

The symptom is similar to the one fixed in
commit 00b368502d18 ("xen-netfront: do not assume sk_buff_head list is
empty in error handling").

This patch uses an extra argument to help return if there is error in
xennet_fill_frags().

Fixes: ad4f15dc2c70 ("xen/netfront: don't bug in case of too many frags")
Signed-off-by: Dongli Zhang 
---
 drivers/net/xen-netfront.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index e14ec75..c2a1e09 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -889,11 +889,14 @@ static int xennet_set_skb_gso(struct sk_buff *skb,
 
 static RING_IDX xennet_fill_frags(struct netfront_queue *queue,
  struct sk_buff *skb,
- struct sk_buff_head *list)
+ struct sk_buff_head *list,
+ int *errno)
 {
RING_IDX cons = queue->rx.rsp_cons;
struct sk_buff *nskb;
 
+   *errno = 0;
+
while ((nskb = __skb_dequeue(list))) {
struct xen_netif_rx_response *rx =
RING_GET_RESPONSE(>rx, ++cons);
@@ -908,6 +911,7 @@ static RING_IDX xennet_fill_frags(struct netfront_queue 
*queue,
if (unlikely(skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS)) {
queue->rx.rsp_cons = ++cons + skb_queue_len(list);
kfree_skb(nskb);
+   *errno = -ENOENT;
return ~0U;
}
 
@@ -1009,6 +1013,8 @@ static int xennet_poll(struct napi_struct *napi, int 
budget)
i = queue->rx.rsp_cons;
work_done = 0;
while ((i != rp) && (work_done < budget)) {
+   int errno;
+
memcpy(rx, RING_GET_RESPONSE(>rx, i), sizeof(*rx));
memset(extras, 0, sizeof(rinfo.extras));
 
@@ -1045,8 +1051,8 @@ static int xennet_poll(struct napi_struct *napi, int 
budget)
skb->data_len = rx->status;
skb->len += rx->status;
 
-   i = xennet_fill_frags(queue, skb, );
-   if (unlikely(i == ~0U))
+   i = xennet_fill_frags(queue, skb, , );
+   if (unlikely(errno == -ENOENT))
goto err;
 
if (rx->flags & XEN_NETRXF_csum_blank)
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH 1/1] xen-netfront: do not assume sk_buff_head list is empty in error handling

2019-09-15 Thread Dongli Zhang

When skb_shinfo(skb) is not able to cache extra fragment (that is,
skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS), xennet_fill_frags() assumes
the sk_buff_head list is already empty. As a result, cons is increased only
by 1 and returns to error handling path in xennet_poll().

However, if the sk_buff_head list is not empty, queue->rx.rsp_cons may be
set incorrectly. That is, queue->rx.rsp_cons would point to the rx ring
buffer entries whose queue->rx_skbs[i] and queue->grant_rx_ref[i] are
already cleared to NULL. This leads to NULL pointer access in the next
iteration to process rx ring buffer entries.

Below is how xennet_poll() does error handling. All remaining entries in
tmpq are accounted to queue->rx.rsp_cons without assuming how many
outstanding skbs are remained in the list.

 985 static int xennet_poll(struct napi_struct *napi, int budget)
... ...
1032   if (unlikely(xennet_set_skb_gso(skb, gso))) {
1033   __skb_queue_head(, skb);
1034   queue->rx.rsp_cons += skb_queue_len();
1035   goto err;
1036   }

It is better to always have the error handling in the same way.

Fixes: ad4f15dc2c70 ("xen/netfront: don't bug in case of too many frags")
Signed-off-by: Dongli Zhang 
---
 drivers/net/xen-netfront.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 8d33970..5f5722b 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -906,7 +906,7 @@ static RING_IDX xennet_fill_frags(struct netfront_queue 
*queue,
__pskb_pull_tail(skb, pull_to - skb_headlen(skb));
}
if (unlikely(skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS)) {
-   queue->rx.rsp_cons = ++cons;
+   queue->rx.rsp_cons = ++cons + skb_queue_len(list);
kfree_skb(nskb);
return ~0U;
}
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Question on xen-netfront code to fix a potential ring buffer corruption

2019-08-27 Thread Dongli Zhang

Hi Juergen,

On 8/27/19 2:13 PM, Juergen Gross wrote:
> On 18.08.19 10:31, Dongli Zhang wrote:
>> Hi,
>>
>> Would you please help confirm why the condition at line 908 is ">="?
>>
>> In my opinion, we would only hit "skb_shinfo(skb)->nr_frag == MAX_SKB_FRAGS" 
>> at
>> line 908.
>>
>> 890 static RING_IDX xennet_fill_frags(struct netfront_queue *queue,
>> 891   struct sk_buff *skb,
>> 892   struct sk_buff_head *list)
>> 893 {
>> 894 RING_IDX cons = queue->rx.rsp_cons;
>> 895 struct sk_buff *nskb;
>> 896
>> 897 while ((nskb = __skb_dequeue(list))) {
>> 898 struct xen_netif_rx_response *rx =
>> 899 RING_GET_RESPONSE(>rx, ++cons);
>> 900 skb_frag_t *nfrag = _shinfo(nskb)->frags[0];
>> 901
>> 902 if (skb_shinfo(skb)->nr_frags == MAX_SKB_FRAGS) {
>> 903 unsigned int pull_to = 
>> NETFRONT_SKB_CB(skb)->pull_to;
>> 904
>> 905 BUG_ON(pull_to < skb_headlen(skb));
>> 906 __pskb_pull_tail(skb, pull_to - 
>> skb_headlen(skb));
>> 907 }
>> 908 if (unlikely(skb_shinfo(skb)->nr_frags >= 
>> MAX_SKB_FRAGS)) {
>> 909 queue->rx.rsp_cons = ++cons;
>> 910 kfree_skb(nskb);
>> 911 return ~0U;
>> 912 }
>> 913
>> 914 skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
>> 915 skb_frag_page(nfrag),
>> 916 rx->offset, rx->status, PAGE_SIZE);
>> 917
>> 918 skb_shinfo(nskb)->nr_frags = 0;
>> 919 kfree_skb(nskb);
>> 920 }
>> 921
>> 922 return cons;
>> 923 }
>>
>>
>> The reason that I ask about this is because I am considering below patch to
>> avoid a potential xen-netfront ring buffer corruption.
>>
>> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
>> index 8d33970..48a2162 100644
>> --- a/drivers/net/xen-netfront.c
>> +++ b/drivers/net/xen-netfront.c
>> @@ -906,7 +906,7 @@ static RING_IDX xennet_fill_frags(struct netfront_queue
>> *queue,
>>  __pskb_pull_tail(skb, pull_to - skb_headlen(skb));
>>  }
>>  if (unlikely(skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS)) {
>> -   queue->rx.rsp_cons = ++cons;
>> +   queue->rx.rsp_cons = cons + skb_queue_len(list) + 1;
>>  kfree_skb(nskb);
>>  return ~0U;
>>  }
>>
>>
>> If there is skb left in list when we return ~0U, queue->rx.rsp_cons may be 
>> set
>> incorrectly.
> 
> Sa basically you want to consume the responses for all outstanding skbs
> in the list?
> 

I think there would be bug if there is skb left in the list.

This is what is implanted in xennet_poll() when there is error of processing
extra info like below. As at line 1034, if there is error, all outstanding skb
are consumed.

 985 static int xennet_poll(struct napi_struct *napi, int budget)
... ...
1028 if (extras[XEN_NETIF_EXTRA_TYPE_GSO - 1].type) {
1029 struct xen_netif_extra_info *gso;
1030 gso = [XEN_NETIF_EXTRA_TYPE_GSO - 1];
1031
1032 if (unlikely(xennet_set_skb_gso(skb, gso))) {
1033 __skb_queue_head(, skb);
1034 queue->rx.rsp_cons += skb_queue_len();
1035 goto err;
1036 }
1037 }

The reason we need to consume all outstanding skb is because
xennet_get_responses() would reset both queue->rx_skbs[i] and
queue->grant_rx_ref[i] to NULL before enqueue all outstanding skb to the list
(e.g., ), by xennet_get_rx_skb() and xennet_get_rx_ref().

If we do not consume all of them, we would hit NULL in queue->rx_skbs[i] in next
iteration of while loop in xennet_poll().

That is, if we do not consume all outstanding skb, the queue->rx.rsp_cons may
refer to a response whose queue->rx_skbs[i] and queue->grant_rx_ref[i] are
already reset to NULL.

Dongli Zhang

>>
>> While I am trying to make up a case that would hit the corruption, I could 
>> not
>> explain why (unlikely(skb_shinfo(skb)->nr_frags >= MAX_SK

[Xen-devel] Question on xen-netfront code to fix a potential ring buffer corruption

2019-08-18 Thread Dongli Zhang

Hi,

Would you please help confirm why the condition at line 908 is ">="?

In my opinion, we would only hit "skb_shinfo(skb)->nr_frag == MAX_SKB_FRAGS" at
line 908.

890 static RING_IDX xennet_fill_frags(struct netfront_queue *queue,
891   struct sk_buff *skb,
892   struct sk_buff_head *list)
893 {
894 RING_IDX cons = queue->rx.rsp_cons;
895 struct sk_buff *nskb;
896 
897 while ((nskb = __skb_dequeue(list))) {
898 struct xen_netif_rx_response *rx =
899 RING_GET_RESPONSE(>rx, ++cons);
900 skb_frag_t *nfrag = _shinfo(nskb)->frags[0];
901 
902 if (skb_shinfo(skb)->nr_frags == MAX_SKB_FRAGS) {
903 unsigned int pull_to = 
NETFRONT_SKB_CB(skb)->pull_to;
904 
905 BUG_ON(pull_to < skb_headlen(skb));
906 __pskb_pull_tail(skb, pull_to - skb_headlen(skb));
907 }
908 if (unlikely(skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS)) {
909 queue->rx.rsp_cons = ++cons;
910 kfree_skb(nskb);
911 return ~0U;
912 }
913 
914 skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
915 skb_frag_page(nfrag),
916 rx->offset, rx->status, PAGE_SIZE);
917 
918 skb_shinfo(nskb)->nr_frags = 0;
919 kfree_skb(nskb);
920 }
921 
922 return cons;
923 }


The reason that I ask about this is because I am considering below patch to
avoid a potential xen-netfront ring buffer corruption.

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 8d33970..48a2162 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -906,7 +906,7 @@ static RING_IDX xennet_fill_frags(struct netfront_queue 
*queue,
__pskb_pull_tail(skb, pull_to - skb_headlen(skb));
}
if (unlikely(skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS)) {
-   queue->rx.rsp_cons = ++cons;
+   queue->rx.rsp_cons = cons + skb_queue_len(list) + 1;
kfree_skb(nskb);
return ~0U;
}


If there is skb left in list when we return ~0U, queue->rx.rsp_cons may be set
incorrectly.

While I am trying to make up a case that would hit the corruption, I could not
explain why (unlikely(skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS)), but not
just "==". Perhaps __pskb_pull_tail() may fail although pull_to is less than
skb_headlen(skb).

Thank you very much!

Dongli Zhang

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 1/1] xen/gnttab: print log at level XENLOG_ERR before panic

2019-04-25 Thread Dongli Zhang

Hi Andrew,

On 04/23/2019 06:37 PM, Andrew Cooper wrote:
> On 22/04/2019 13:39, Dongli Zhang wrote:
>> Print log at level XENLOG_ERR (instead XENLOG_INFO) as domain_crash()
>> indicates there is fatal error.
>>
>> Signed-off-by: Dongli Zhang 
>> ---
>>  xen/common/grant_table.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
>> index 80728ea..725c618 100644
>> --- a/xen/common/grant_table.c
>> +++ b/xen/common/grant_table.c
>> @@ -1282,7 +1282,7 @@ unmap_common(
>>  if ( unlikely((rd = rcu_lock_domain_by_id(dom)) == NULL) )
>>  {
>>  /* This can happen when a grant is implicitly unmapped. */
>> -gdprintk(XENLOG_INFO, "Could not find domain %d\n", dom);
>> +gdprintk(XENLOG_ERR, "Could not find domain %d\n", dom);
> 
> While I agree with the change in principle, what does this actually do? 
> The entire printk() is compiled out in release builds, and debug builds
> log all messages.

The objective is for below case.

If user sets "guest_loglvl=err", this line would not be printed. Once xen panic
by domain_crash, it would take effort to know the domid leading to the panic.

To verify if the vmcore analysis is correct, the user needs to set
"guest_loglvl=all" and wait for another panic again.

I agree it would be better to turn it into something like gprintk. If xen always
panic after a log, we should guarantee that the log is always printed to the
console in order to help analyze the root cause.

Dongli Zhang

> 
> ~Andrew
> 
>>  domain_crash(ld); /* naughty... */
>>  return;
>>  }
> 
> 
> ___
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel
> 

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH 1/1] xen/gnttab: print log at level XENLOG_ERR before panic

2019-04-22 Thread Dongli Zhang

Print log at level XENLOG_ERR (instead XENLOG_INFO) as domain_crash()
indicates there is fatal error.

Signed-off-by: Dongli Zhang 
---
 xen/common/grant_table.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
index 80728ea..725c618 100644
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -1282,7 +1282,7 @@ unmap_common(
 if ( unlikely((rd = rcu_lock_domain_by_id(dom)) == NULL) )
 {
 /* This can happen when a grant is implicitly unmapped. */
-gdprintk(XENLOG_INFO, "Could not find domain %d\n", dom);
+gdprintk(XENLOG_ERR, "Could not find domain %d\n", dom);
 domain_crash(ld); /* naughty... */
 return;
 }
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Question regarding swiotlb-xen in Linux kernel

2019-04-19 Thread Dongli Zhang



On 04/19/2019 01:47 PM, Juergen Gross wrote:
> On 19/04/2019 00:31, Joe Jin wrote:
>> On 4/18/19 2:09 PM, Boris Ostrovsky wrote:
>>> On 4/18/19 3:36 AM, Juergen Gross wrote:
>>>> I'm currently investigating a problem related to swiotlb-xen. With a
>>>> specific driver a customer is capable to trigger a situation where a
>>>> MFN is mapped to multiple dom0 PFNs at the same time. There is no
>>>> guest involved, so this is not related to grants.
>>>>
>>>> Wit a debug kernel I have managed to track the inconsistency to a
>>>> call of xen_destroy_contiguous_region() from xen_swiotlb_free_coherent()
>>>> where the region was obviously not contiguous.
>>>>
>>>> xen_swiotlb_free_coherent() contains:
>>>>
>>>> if (((dev_addr + size - 1 <= dma_mask)) ||
>>>> range_straddles_page_boundary(phys, size))
>>>> xen_destroy_contiguous_region(phys, order);
>>>>
>>>> Shouldn't it be either:
>>>>
>>>> if (((dev_addr + size - 1 <= dma_mask)) &&
>>>> !range_straddles_page_boundary(phys, size))
>>>> xen_destroy_contiguous_region(phys, order);
>>>
>>> +Joe
>>>
>>> https://lists.xenproject.org/archives/html/xen-devel/2018-10/msg01920.html
>>>
>>> (The discussion happened in
>>> https://lists.xenproject.org/archives/html/xen-devel/2018-10/msg01814.html)
>>>
>>> And looks like we dropped it. Or was there a reason we haven't picked it up?
>>
>> I remembered the concern was whether memory not from Xen.
> 
> The current coding is wrong.
> 
> I believe we should have at least a WARN_ONCE() in case
> xen_swiotlb_free_coherent() is being called with non-contiguous memory.
> And it should not call xen_destroy_contiguous_region() in that case.
> 
> Joe, did you observe such cases? I'm asking because the backtraces I
> have give me no clue _why_ the memory was non-contiguous. The calling
> function allocated the memory via xen_swiotlb_alloc_coherent() and when
> freeing it again it was not contiguous.
> 
> Another topic is the question whether we should really call
> xen_destroy_contiguous_region() when freeing the memory if there was no
> need to use xen_create_contiguous_region() when allocating it.

What would happen if guest domain is malicious?

That is, guest (including dom0) my allocate unlimited MFN continuous memory and
never exchange them back to xen.

Would MFN continuous memory be used up? As a result, xen may not be able to boot
new VM. Is this a sort of DoS attack?

Dongli Zhang


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH 1/1] xen-netback: add reference from xenvif to backend_info to facilitate coredump analysis

2019-04-12 Thread Dongli Zhang

During coredump analysis, it is not easy to obtain the address of
backend_info in xen-netback.

So far there are two ways to obtain backend_info:

1. Do what xenbus_device_find() does for vmcore to find the xenbus_device
and then derive it from dev_get_drvdata().

2. Extract backend_info from callstack of xenwatch (e.g., netback_remove()
or frontend_changed()).

This patch adds a reference from xenvif to backend_info so that it would be
much more easier to obtain backend_info during coredump analysis.

Signed-off-by: Dongli Zhang 
---
 drivers/net/xen-netback/common.h | 18 ++
 drivers/net/xen-netback/xenbus.c | 17 +
 2 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index 936c0b3..05847eb 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -248,6 +248,22 @@ struct xenvif_hash {
struct xenvif_hash_cache cache;
 };
 
+struct backend_info {
+   struct xenbus_device *dev;
+   struct xenvif *vif;
+
+   /* This is the state that will be reflected in xenstore when any
+* active hotplug script completes.
+*/
+   enum xenbus_state state;
+
+   enum xenbus_state frontend_state;
+   struct xenbus_watch hotplug_status_watch;
+   u8 have_hotplug_status_watch:1;
+
+   const char *hotplug_script;
+};
+
 struct xenvif {
/* Unique identifier for this interface. */
domid_t  domid;
@@ -283,6 +299,8 @@ struct xenvif {
struct xenbus_watch credit_watch;
struct xenbus_watch mcast_ctrl_watch;
 
+   struct backend_info *be;
+
spinlock_t lock;
 
 #ifdef CONFIG_DEBUG_FS
diff --git a/drivers/net/xen-netback/xenbus.c b/drivers/net/xen-netback/xenbus.c
index 330ddb6..41c9e8f 100644
--- a/drivers/net/xen-netback/xenbus.c
+++ b/drivers/net/xen-netback/xenbus.c
@@ -22,22 +22,6 @@
 #include 
 #include 
 
-struct backend_info {
-   struct xenbus_device *dev;
-   struct xenvif *vif;
-
-   /* This is the state that will be reflected in xenstore when any
-* active hotplug script completes.
-*/
-   enum xenbus_state state;
-
-   enum xenbus_state frontend_state;
-   struct xenbus_watch hotplug_status_watch;
-   u8 have_hotplug_status_watch:1;
-
-   const char *hotplug_script;
-};
-
 static int connect_data_rings(struct backend_info *be,
  struct xenvif_queue *queue);
 static void connect(struct backend_info *be);
@@ -472,6 +456,7 @@ static int backend_create_xenvif(struct backend_info *be)
return err;
}
be->vif = vif;
+   vif->be = be;
 
kobject_uevent(>dev.kobj, KOBJ_ONLINE);
return 0;
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v4.9 1/1] jiffies: use jiffies64_to_nsecs() to fix 100% steal usage for xen vcpu hotplug

2019-03-12 Thread Dongli Zhang



On 3/12/19 8:56 PM, Greg KH wrote:
> On Wed, Mar 06, 2019 at 03:35:40PM +0800, Dongli Zhang wrote:
>> Thanks to Joe Jin's reminding, this patch is applicable to mainline linux
>> kernel, although there is no issue due to this kind of bug in mainline 
>> kernel.
>>
>> Therefore, can I first submit this patch to mainline kernel and then 
>> backport it
>> to stable linux with more detailed explanation how the issue is reproduced 
>> on xen?
> 
> Yes, please do that.

I am working on a patch to reform the xen steal clock formula by xen specific
interfaces in linux kernel. In that way, we will not need to change the
signature of existing jiffies relevant functions which expect only delta values
as input.

The new formula will not generate such large input value.

Dongli Zhang

> 
> thanks,
> 
> greg k-h
> 
> ___
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel
> 

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v4.9 1/1] jiffies: use jiffies64_to_nsecs() to fix 100% steal usage for xen vcpu hotplug

2019-03-05 Thread Dongli Zhang

Thanks to Joe Jin's reminding, this patch is applicable to mainline linux
kernel, although there is no issue due to this kind of bug in mainline kernel.

Therefore, can I first submit this patch to mainline kernel and then backport it
to stable linux with more detailed explanation how the issue is reproduced on 
xen?

This would help synchronize stable with mainline better.

Thank you very much!

Dongli Zhang

On 3/5/19 3:59 PM, Dongli Zhang wrote:
> [ Not relevant upstream, therefore no upstream commit. ]
> 
> To fix, use jiffies64_to_nsecs() directly instead of deriving the result
> according to jiffies_to_usecs().
> 
> As the return type of jiffies_to_usecs() is 'unsigned int', when the return
> value is more than the size of 'unsigned int', the leading 32 bits would be
> discarded.
> 
> Suppose USEC_PER_SEC=100L and HZ=1000, below are the expected and
> actual incorrect result of jiffies_to_usecs(0x7770ef70):
> 
> - expected  : jiffies_to_usecs(0x7770ef70) = 0x01d291274d80
> - incorrect : jiffies_to_usecs(0x7770ef70) = 0x91274d80
> 
> The leading 0x01d2 is discarded.
> 
> After xen vcpu hotplug and when the new vcpu steal clock is calculated for
> the first time, the result of this_rq()->prev_steal_time in
> steal_account_process_tick() would be far smaller than the expected
> value, due to that jiffies_to_usecs() discards the leading 32 bits.
> 
> As a result, the diff between current steal and this_rq()->prev_steal_time
> is always very large. Steal usage would become 100% when the initial steal
> clock obtained from xen hypervisor is very large during xen vcpu hotplug,
> that is, when the guest is already up for a long time.
> 
> The bug can be detected by doing the following:
> 
> * Boot xen guest with vcpus=2 and maxvcpus=4
> * Leave the guest running for a month so that the initial steal clock for
>   the new vcpu would be very large
> * Hotplug 2 extra vcpus
> * The steal time of new vcpus in /proc/stat would increase abnormally and
>   sometimes steal usage in top can become 100%
> 
> This was incidentally fixed in the patch set starting by
> commit 93825f2ec736 ("jiffies: Reuse TICK_NSEC instead of NSEC_PER_JIFFY")
> and ended with
> commit b672592f0221 ("sched/cputime: Remove generic asm headers").
> 
> This version applies to the v4.9 series.
> 
> Link: https://lkml.org/lkml/2019/2/28/1373
> Suggested-by: Juergen Gross 
> Signed-off-by: Dongli Zhang 
> ---
>  include/linux/jiffies.h | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h
> index 734377a..94aff43 100644
> --- a/include/linux/jiffies.h
> +++ b/include/linux/jiffies.h
> @@ -287,13 +287,13 @@ extern unsigned long preset_lpj;
>  extern unsigned int jiffies_to_msecs(const unsigned long j);
>  extern unsigned int jiffies_to_usecs(const unsigned long j);
>  
> +extern u64 jiffies64_to_nsecs(u64 j);
> +
>  static inline u64 jiffies_to_nsecs(const unsigned long j)
>  {
> - return (u64)jiffies_to_usecs(j) * NSEC_PER_USEC;
> + return jiffies64_to_nsecs(j);
>  }
>  
> -extern u64 jiffies64_to_nsecs(u64 j);
> -
>  extern unsigned long __msecs_to_jiffies(const unsigned int m);
>  #if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
>  /*
> 

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v4.9 1/1] jiffies: use jiffies64_to_nsecs() to fix 100% steal usage for xen vcpu hotplug

2019-03-04 Thread Dongli Zhang

[ Not relevant upstream, therefore no upstream commit. ]

To fix, use jiffies64_to_nsecs() directly instead of deriving the result
according to jiffies_to_usecs().

As the return type of jiffies_to_usecs() is 'unsigned int', when the return
value is more than the size of 'unsigned int', the leading 32 bits would be
discarded.

Suppose USEC_PER_SEC=100L and HZ=1000, below are the expected and
actual incorrect result of jiffies_to_usecs(0x7770ef70):

- expected  : jiffies_to_usecs(0x7770ef70) = 0x01d291274d80
- incorrect : jiffies_to_usecs(0x7770ef70) = 0x91274d80

The leading 0x01d2 is discarded.

After xen vcpu hotplug and when the new vcpu steal clock is calculated for
the first time, the result of this_rq()->prev_steal_time in
steal_account_process_tick() would be far smaller than the expected
value, due to that jiffies_to_usecs() discards the leading 32 bits.

As a result, the diff between current steal and this_rq()->prev_steal_time
is always very large. Steal usage would become 100% when the initial steal
clock obtained from xen hypervisor is very large during xen vcpu hotplug,
that is, when the guest is already up for a long time.

The bug can be detected by doing the following:

* Boot xen guest with vcpus=2 and maxvcpus=4
* Leave the guest running for a month so that the initial steal clock for
  the new vcpu would be very large
* Hotplug 2 extra vcpus
* The steal time of new vcpus in /proc/stat would increase abnormally and
  sometimes steal usage in top can become 100%

This was incidentally fixed in the patch set starting by
commit 93825f2ec736 ("jiffies: Reuse TICK_NSEC instead of NSEC_PER_JIFFY")
and ended with
commit b672592f0221 ("sched/cputime: Remove generic asm headers").

This version applies to the v4.9 series.

Link: https://lkml.org/lkml/2019/2/28/1373
Suggested-by: Juergen Gross 
Signed-off-by: Dongli Zhang 
---
 include/linux/jiffies.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h
index 734377a..94aff43 100644
--- a/include/linux/jiffies.h
+++ b/include/linux/jiffies.h
@@ -287,13 +287,13 @@ extern unsigned long preset_lpj;
 extern unsigned int jiffies_to_msecs(const unsigned long j);
 extern unsigned int jiffies_to_usecs(const unsigned long j);
 
+extern u64 jiffies64_to_nsecs(u64 j);
+
 static inline u64 jiffies_to_nsecs(const unsigned long j)
 {
-   return (u64)jiffies_to_usecs(j) * NSEC_PER_USEC;
+   return jiffies64_to_nsecs(j);
 }
 
-extern u64 jiffies64_to_nsecs(u64 j);
-
 extern unsigned long __msecs_to_jiffies(const unsigned int m);
 #if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
 /*
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [BUG linux-4.9.x] xen hotplug cpu leads to 100% steal usage

2019-03-04 Thread Dongli Zhang

Hi Thomas,

On 3/2/19 7:43 AM, Thomas Gleixner wrote:
> On Thu, 28 Feb 2019, Dongli Zhang wrote:
>>
>> The root cause is that the return type of jiffies_to_usecs() is 'unsigned 
>> int',
>> but not 'unsigned long'. As a result, the leading 32 bits are discarded.
> 
> Errm. No. The root cause is that jiffies_to_usecs() is used for that in the
> first place. The function has been that way forever and all usage sites
> (except a broken dev_debug print in infiniband) feed delta values. Yes, it
> could have documentation

Thank you very much for the explanation. It would help the developers clarify
the usage of jiffies_to_usecs() (which we should always feed with dealt value)
with comments above it.

Indeed, the input value in this bug is also a delta value. Because of the
special mechanisms used by xen to account steal clock, the initial delta value
is always very large, only when the new cpu is added after the VM is already up
for very long time.

Dongli Zhang


> 
>> jiffies_to_usecs() is indirectly triggered by cputime_to_nsecs() at line 264.
>> If guest is already up for long time, the initial steal time for new vcpu 
>> might
>> be large and the leading 32 bits of jiffies_to_usecs() would be discarded.
> 
>> So far, I have two solutions:
>>
>> 1. Change the return type from 'unsigned int' to 'unsigned long' as in above
>> link and I am afraid it would bring side effect. The return type in latest
>> mainline kernel is still 'unsigned int'.
> 
> Changing it to unsigned long would just solve the issue for 64bit.
> 
> Thanks,
> 
>   tglx
> 

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [BUG linux-4.9.x] xen hotplug cpu leads to 100% steal usage

2019-03-04 Thread Dongli Zhang

Hi Juergen,

On 3/4/19 4:14 PM, Juergen Gross wrote:
> On 01/03/2019 03:35, Dongli Zhang wrote:
>> This issue is only for stable 4.9.x (e.g., 4.9.160), while the root cause is
>> still in the lasted mainline kernel.
>>
>> This is obviated by new feature patch set ended with b672592f0221
>> ("sched/cputime: Remove generic asm headers").
>>
>> After xen guest is up for long time, once we hotplug new vcpu, the 
>> corresponding
>> steal usage might become 100% and the steal time from /proc/stat would 
>> increase
>> abnormally.
>>
>> As we cannot wait for long time to reproduce the issue, here is how I 
>> reproduce
>> it on purpose by accounting a large initial steal clock for new vcpu 2 and 3.
>>
>> 1. Apply the below patch to guest 4.9.160 to account large initial steal 
>> clock
>> for new vcpu 2 and 3:
>>
>> diff --git a/drivers/xen/time.c b/drivers/xen/time.c
>> index ac5f23f..3cf629e 100644
>> --- a/drivers/xen/time.c
>> +++ b/drivers/xen/time.c
>> @@ -85,7 +85,14 @@ u64 xen_steal_clock(int cpu)
>> struct vcpu_runstate_info state;
>>  
>> xen_get_runstate_snapshot_cpu(, cpu);
>> -   return state.time[RUNSTATE_runnable] + state.time[RUNSTATE_offline];
>> +
>> +   if (cpu == 2 || cpu == 3)
>> +   return state.time[RUNSTATE_runnable]
>> +  + state.time[RUNSTATE_offline]
>> +  + 0x00071e87e677aa12;
>> +   else
>> +   return state.time[RUNSTATE_runnable]
>> +  + state.time[RUNSTATE_offline];
>>  }
>>  
>>  void xen_setup_runstate_info(int cpu)
>>
>>
>> 2. Boot hvm guest with "vcpus=2" and "maxvcpus=4". By default, VM boot with
>> vcpu 0 and 1.
>>
>> 3. Hotplug vcpu 2 and 3 via "xl vcpu-set  4" on dom0.
>>
>> In my env, the steal becomes 100% within 10s after the "xl vcpu-set" command 
>> on
>> dom0.
>>
>> I can reproduce on kvm with similar method. However, as the initial steal 
>> clock
>> on kvm guest is always 0, I do not think it is easy to hit this issue on kvm.
>>
>> 
>>
>> The root cause is that the return type of jiffies_to_usecs() is 'unsigned 
>> int',
>> but not 'unsigned long'. As a result, the leading 32 bits are discarded.
>>
>> jiffies_to_usecs() is indirectly triggered by cputime_to_nsecs() at line 264.
>> If guest is already up for long time, the initial steal time for new vcpu 
>> might
>> be large and the leading 32 bits of jiffies_to_usecs() would be discarded.
>>
>> As a result, the steal at line 259 is always large and the
>> this_rq()->prev_steal_time at line 264 is always small. The difference at 
>> line
>> 260 is always large during each time steal_account_process_time() is 
>> involved.
>> Finally, the steal time in /proc/stat would increase abnormally.
>>
>> 252 static __always_inline cputime_t steal_account_process_time(cputime_t 
>> maxtime)
>> 253 {
>> 254 #ifdef CONFIG_PARAVIRT
>> 255 if (static_key_false(_steal_enabled)) {
>> 256 cputime_t steal_cputime;
>> 257 u64 steal;
>> 258 
>> 259 steal = paravirt_steal_clock(smp_processor_id());
>> 260 steal -= this_rq()->prev_steal_time;
>> 261 
>> 262 steal_cputime = min(nsecs_to_cputime(steal), maxtime);
>> 263 account_steal_time(steal_cputime);
>> 264 this_rq()->prev_steal_time += 
>> cputime_to_nsecs(steal_cputime);
>> 265 
>> 266 return steal_cputime;
>> 267 }
>> 268 #endif
>> 269 return 0;
>> 270 }
>>
>> 
>>
>> I have emailed the kernel mailing list about the return type of
>> jiffies_to_usecs() and jiffies_to_msecs():
>>
>> https://lkml.org/lkml/2019/2/26/899
>>
>>
>> So far, I have two solutions:
>>
>> 1. Change the return type from 'unsigned int' to 'unsigned long' as in above
>> link and I am afraid it would bring side effect. The return type in latest
>> mainline kernel is still 'unsigned int'.
>>
>> 2. Something like below based on stable 4.9.160:
> 
> 3. use jiffies64_to_nsecs() instead of trying to open code it.

Thank you very much for the suggestion!

I have tested that jiffies64_to_nsecs() works well by reproducing the issue in
kvm g

[Xen-devel] [BUG linux-4.9.x] xen hotplug cpu leads to 100% steal usage

2019-02-28 Thread Dongli Zhang

gt;> HZ_TO_USEC_SHR32;
+# else
+   return (j * HZ_TO_USEC_NUM) / HZ_TO_USEC_DEN;
+# endif
+#endif
+}
+EXPORT_SYMBOL(jiffies_to_usecs64);
+
+
 /**
  * timespec_trunc - Truncate timespec to a granularity
  * @t: Timespec

People may dislike the 2nd solution.

3. Backport patch set ended with b672592f0221 ("sched/cputime: 
Remove generic asm headers").

This is not reasonable for stable branch as the patch set involves
lots of changes.


Would you please let me know if there is any suggestion on this issue?

Thank you very much!

Dongli Zhang

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v6 2/2] xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront

2019-02-18 Thread Dongli Zhang

Hi Konrad,

On 1/17/19 11:29 PM, Konrad Rzeszutek Wilk wrote:
> On Tue, Jan 15, 2019 at 09:20:36AM +0100, Roger Pau Monné wrote:
>> On Tue, Jan 15, 2019 at 12:41:44AM +0800, Dongli Zhang wrote:
>>> The xenstore 'ring-page-order' is used globally for each blkback queue and
>>> therefore should be read from xenstore only once. However, it is obtained
>>> in read_per_ring_refs() which might be called multiple times during the
>>> initialization of each blkback queue.
>>>
>>> If the blkfront is malicious and the 'ring-page-order' is set in different
>>> value by blkfront every time before blkback reads it, this may end up at
>>> the "WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" in
>>> xen_blkif_disconnect() when frontend is destroyed.
>>>
>>> This patch reworks connect_ring() to read xenstore 'ring-page-order' only
>>> once.
>>>
>>> Signed-off-by: Dongli Zhang 
>>
>> LGTM:
>>
>> Reviewed-by: Roger Pau Monné 
> 
> Applied.
> 
> Will push out to Jens in a couple of days. Thank you!
>>
>> Thanks!

I only see the PATCH 1/2 (xen/blkback: add stack variable 'blkif' in
connect_ring()) on the top of below branch for Jens, would you please apply this
one (PATCH 2/2) as well?

https://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git/log/?h=linux-next

1/2 is only a prerequisite for 2/2.

Thank you very much!

Dongli Zhang

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [admin] [Pkg-xen-devel] [BUG] task jbd2/xvda4-8:174 blocked for more than 120 seconds.

2019-02-17 Thread Dongli Zhang



On 2/18/19 5:29 AM, Samuel Thibault wrote:
> Hello,
> 
> Dongli Zhang, le mar. 12 févr. 2019 12:11:20 +0800, a ecrit:
>> On 02/12/2019 06:10 AM, Samuel Thibault wrote:
>>> Hans van Kranenburg, le lun. 11 févr. 2019 22:59:11 +0100, a ecrit:
>>>> On 2/11/19 2:37 AM, Dongli Zhang wrote:
>>>>>
>>>>> On 2/10/19 12:35 AM, Samuel Thibault wrote:
>>>>>>
>>>>>> Hans van Kranenburg, le sam. 09 févr. 2019 17:01:55 +0100, a ecrit:
>>>>>>>> I have forwarded the original mail: all VM I/O get stuck, and thus the
>>>>>>>> VM becomes unusable.
>>>>>>>
>>>>>>> These are in many cases the symptoms of running out of "grant frames".
>>>>>>
>>>>>> Oh!  That could be it indeed.  I'm wondering what could be monopolizing
>>>>>> them, though, and why +deb9u11 is affected while +deb9u10 is not.  I'm
>>>>>> afraid increasing the gnttab max size to 32 might just defer filling it
>>>>>> up.
>>>>>>
>>>>>>>   -# ./xen-diag  gnttab_query_size 5
>>>>>>>   domid=5: nr_frames=11, max_nr_frames=32
>>>>>>
>>>>>> The current value is 31 over max 32 indeed.
>>>>>
>>>>> Assuming this is grant v1, there are still 4096/8=512 grant references 
>>>>> available
>>>>> (32-31=1 frame available). I do not think the I/O hang can be affected by 
>>>>> the
>>>>> lack of grant entry.
>>>>
>>>> I suspect that 31 measurement was taken when the domU was not hanging yet.
>>>
>>> Indeed, I didn't have the hanging VM offhand.  I have looked again, it's
>>> now at 33. We'll have to monitor to check that it doesn't continue just
>>> increasing.
>>
>> If the max used to be 32 and the current is already 33, this indicates the 
>> grant
>> entries might be used up in the past before the max_nr_frames is tuned.
> 
> The number seems to be going up by about one every day. So probably a
> grant entry leak somewhere :/

This might not be a grant leak. The block pv driver would hold the persistent
grant for a long time.

Juergen has introduced the feature to reclaim the stale grants.


blkfront since a46b53672b2c2e3770b38a4abf90d16364d2584b

blkback since 973e5405f2f67ddbb2bf07b3ffc71908a37fea8e

Dongli Zhang

> 
> Samuel
> 

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [admin] [Pkg-xen-devel] [BUG] task jbd2/xvda4-8:174 blocked for more than 120 seconds.

2019-02-11 Thread Dongli Zhang



On 02/12/2019 06:10 AM, Samuel Thibault wrote:
> Hans van Kranenburg, le lun. 11 févr. 2019 22:59:11 +0100, a ecrit:
>> On 2/11/19 2:37 AM, Dongli Zhang wrote:
>>>
>>> On 2/10/19 12:35 AM, Samuel Thibault wrote:
>>>>
>>>> Hans van Kranenburg, le sam. 09 févr. 2019 17:01:55 +0100, a ecrit:
>>>>>> I have forwarded the original mail: all VM I/O get stuck, and thus the
>>>>>> VM becomes unusable.
>>>>>
>>>>> These are in many cases the symptoms of running out of "grant frames".
>>>>
>>>> Oh!  That could be it indeed.  I'm wondering what could be monopolizing
>>>> them, though, and why +deb9u11 is affected while +deb9u10 is not.  I'm
>>>> afraid increasing the gnttab max size to 32 might just defer filling it
>>>> up.
>>>>
>>>>>   -# ./xen-diag  gnttab_query_size 5
>>>>>   domid=5: nr_frames=11, max_nr_frames=32
>>>>
>>>> The current value is 31 over max 32 indeed.
>>>
>>> Assuming this is grant v1, there are still 4096/8=512 grant references 
>>> available
>>> (32-31=1 frame available). I do not think the I/O hang can be affected by 
>>> the
>>> lack of grant entry.
>>
>> I suspect that 31 measurement was taken when the domU was not hanging yet.
> 
> Indeed, I didn't have the hanging VM offhand.  I have looked again, it's
> now at 33. We'll have to monitor to check that it doesn't continue just
> increasing.

If the max used to be 32 and the current is already 33, this indicates the grant
entries might be used up in the past before the max_nr_frames is tuned.

Dongli ZHang

> 
> Samuel
> 

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [admin] [Pkg-xen-devel] [BUG] task jbd2/xvda4-8:174 blocked for more than 120 seconds.

2019-02-10 Thread Dongli Zhang



On 2/10/19 12:35 AM, Samuel Thibault wrote:
> Hello,
> 
> Hans van Kranenburg, le sam. 09 févr. 2019 17:01:55 +0100, a ecrit:
>>> I have forwarded the original mail: all VM I/O get stuck, and thus the
>>> VM becomes unusable.
>>
>> These are in many cases the symptoms of running out of "grant frames".
> 
> Oh!  That could be it indeed.  I'm wondering what could be monopolizing
> them, though, and why +deb9u11 is affected while +deb9u10 is not.  I'm
> afraid increasing the gnttab max size to 32 might just defer filling it
> up.
> 
>>   -# ./xen-diag  gnttab_query_size 5
>>   domid=5: nr_frames=11, max_nr_frames=32
> 
> The current value is 31 over max 32 indeed.

Assuming this is grant v1, there are still 4096/8=512 grant references available
(32-31=1 frame available). I do not think the I/O hang can be affected by the
lack of grant entry.

If to increase the max frame to 64 takes effect, it is weird why the I/O would
hang when there are still 512 entries available.

Dongli Zhang

> 
>> With Xen 4.8, you can add gnttab_max_frames=64 (or another number, but
>> higher than the default 32) to the xen hypervisor command line and reboot.
> 
> admin@: I made the modification in the grub config. We can probably try
> to reboot with the newer hypervisor, and monitor that value.
> 
> Samuel
> 
> ___
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel
> 

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 1/1] xen-blkback: do not wake up shutdown_wq after xen_blkif_schedule() is stopped

2019-01-17 Thread Dongli Zhang

Hi Roger,

On 01/17/2019 04:20 PM, Roger Pau Monné wrote:
> On Thu, Jan 17, 2019 at 10:10:00AM +0800, Dongli Zhang wrote:
>> Hi Roger,
>>
>> On 2019/1/16 下午10:52, Roger Pau Monné wrote:
>>> On Wed, Jan 16, 2019 at 09:47:41PM +0800, Dongli Zhang wrote:
>>>> There is no need to wake up xen_blkif_schedule() as kthread_stop() is able
>>>> to already wake up the kernel thread.
>>>>
>>>> Signed-off-by: Dongli Zhang 
>>>> ---
>>>>  drivers/block/xen-blkback/xenbus.c | 4 +---
>>>>  1 file changed, 1 insertion(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/block/xen-blkback/xenbus.c 
>>>> b/drivers/block/xen-blkback/xenbus.c
>>>> index a4bc74e..37ac59e 100644
>>>> --- a/drivers/block/xen-blkback/xenbus.c
>>>> +++ b/drivers/block/xen-blkback/xenbus.c
>>>> @@ -254,10 +254,8 @@ static int xen_blkif_disconnect(struct xen_blkif 
>>>> *blkif)
>>>>if (!ring->active)
>>>>continue;
>>>>  
>>>> -  if (ring->xenblkd) {
>>>> +  if (ring->xenblkd)
>>>>kthread_stop(ring->xenblkd);
>>>> -  wake_up(>shutdown_wq);
>>>
>>> I've now realized that shutdown_wq is basically useless, and should be
>>> removed, could you please do it in this patch?
>>
>> I do not think shutdown_wq is useless.
>>
>> It is used to halt the xen_blkif_schedule() kthread when
>> RING_REQUEST_PROD_OVERFLOW() returns true in __do_block_io_op():
>>
>> 1121 static int
>> 1122 __do_block_io_op(struct xen_blkif_ring *ring)
>> ... ...
>> 1134 if (RING_REQUEST_PROD_OVERFLOW(_rings->common, rp)) {
>> 1135 rc = blk_rings->common.rsp_prod_pvt;
>> 1136 pr_warn("Frontend provided bogus ring requests (%d - %d 
>> =
>> %d). Halting ring processing on dev=%04x\n",
>> 1137 rp, rc, rp - rc, ring->blkif->vbd.pdevice);
>> 1138 return -EACCES;
>> 1139 }
>>
>>
>> If there is bogus/invalid ring requests, __do_block_io_op() would return 
>> -EACCES
>> without modifying prod/cons index.
>>
>> Without shutdown_wq (just simply assuming we remove the below code without
>> handling -EACCES in xen_blkif_schedule()), the kernel thread would continue 
>> the
>> while loop.
>>
>> 648 if (ret == -EACCES)
>> 649 wait_event_interruptible(ring->shutdown_wq,
>> 650  kthread_should_stop());
>>
>>
>> If xen_blkif_be_int() is triggered again (let's assume there is no 
>> optimization
>> on guest part and guest would send event for every request it puts on ring
>> buffer), we may come to do_block_io_op() again.
>>
>>
>> As the prod/cons index are not modified last time the code runs into
>> do_block_io_op() to process bogus request, we would hit the bogus request 
>> issue
>> again.
>>
>>
>> With shutdown_wq, the kernel kthread is blocked forever until such 
>> queue/ring is
>> destroyed. If we remove shutdown_wq, we are changing the policy to handle 
>> bogus
>> requests on ring buffer?
> 
> AFAICT the only wakeup call to shutdown_wq is removed in this patch,
> hence waiting on it seems useless. I would replace the
> wait_event_interruptible call in xen_blkif_schedule with a break, so
> that the kthread ends as soon as a bogus request is found. I think
> there's no point in waiting for xen_blkif_disconnect to stop the
> kthread.
> 
> Thanks, Roger.
> 

My fault. The shutdown_wq is useless. I think I was going to say
wait_event_interruptible() is useful as it is used to halt the kthread when
there is bogus request.

It is fine to replace wait_event_interruptible with a break to exit immediately
when there is bogus request.

Thank you very much!

Dongli Zhang

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 1/1] xen-blkback: do not wake up shutdown_wq after xen_blkif_schedule() is stopped

2019-01-16 Thread Dongli Zhang

Hi Roger,

On 2019/1/16 下午10:52, Roger Pau Monné wrote:
> On Wed, Jan 16, 2019 at 09:47:41PM +0800, Dongli Zhang wrote:
>> There is no need to wake up xen_blkif_schedule() as kthread_stop() is able
>> to already wake up the kernel thread.
>>
>> Signed-off-by: Dongli Zhang 
>> ---
>>  drivers/block/xen-blkback/xenbus.c | 4 +---
>>  1 file changed, 1 insertion(+), 3 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkback/xenbus.c 
>> b/drivers/block/xen-blkback/xenbus.c
>> index a4bc74e..37ac59e 100644
>> --- a/drivers/block/xen-blkback/xenbus.c
>> +++ b/drivers/block/xen-blkback/xenbus.c
>> @@ -254,10 +254,8 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif)
>>  if (!ring->active)
>>  continue;
>>  
>> -if (ring->xenblkd) {
>> +if (ring->xenblkd)
>>  kthread_stop(ring->xenblkd);
>> -wake_up(>shutdown_wq);
> 
> I've now realized that shutdown_wq is basically useless, and should be
> removed, could you please do it in this patch?

I do not think shutdown_wq is useless.

It is used to halt the xen_blkif_schedule() kthread when
RING_REQUEST_PROD_OVERFLOW() returns true in __do_block_io_op():

1121 static int
1122 __do_block_io_op(struct xen_blkif_ring *ring)
... ...
1134 if (RING_REQUEST_PROD_OVERFLOW(_rings->common, rp)) {
1135 rc = blk_rings->common.rsp_prod_pvt;
1136 pr_warn("Frontend provided bogus ring requests (%d - %d =
%d). Halting ring processing on dev=%04x\n",
1137 rp, rc, rp - rc, ring->blkif->vbd.pdevice);
1138 return -EACCES;
1139 }

If there is bogus/invalid ring requests, __do_block_io_op() would return -EACCES
without modifying prod/cons index.

Without shutdown_wq (just simply assuming we remove the below code without
handling -EACCES in xen_blkif_schedule()), the kernel thread would continue the
while loop.

648 if (ret == -EACCES)
649 wait_event_interruptible(ring->shutdown_wq,
650  kthread_should_stop());

If xen_blkif_be_int() is triggered again (let's assume there is no optimization
on guest part and guest would send event for every request it puts on ring
buffer), we may come to do_block_io_op() again.

As the prod/cons index are not modified last time the code runs into
do_block_io_op() to process bogus request, we would hit the bogus request issue
again.

With shutdown_wq, the kernel kthread is blocked forever until such queue/ring is
destroyed. If we remove shutdown_wq, we are changing the policy to handle bogus
requests on ring buffer?

Please correct me if my understanding is wrong.

Thank you very much!

Dongli Zhang

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v5 2/2] xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront

2019-01-16 Thread Dongli Zhang



On 2019/1/17 上午12:32, Konrad Rzeszutek Wilk wrote:
> On Tue, Jan 08, 2019 at 04:24:32PM +0800, Dongli Zhang wrote:
>> oops. Please ignore this v5 patch.
>>
>> I just realized Linus suggested in an old email not use BUG()/BUG_ON() in 
>> the code.
>>
>> I will switch to the WARN() solution and resend again.
> 
> OK. Did I miss it?

Hi Konrad,

If you are talking about the new patch set:

https://lists.xenproject.org/archives/html/xen-devel/2019-01/msg00977.html
https://lists.xenproject.org/archives/html/xen-devel/2019-01/msg00978.html

About Linus' suggestion:

http://lkml.iu.edu/hypermail/linux/kernel/1610.0/00878.html

Dongli Zhang

> 
> ___
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel
> 

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH 1/1] xen-blkback: do not wake up shutdown_wq after xen_blkif_schedule() is stopped

2019-01-16 Thread Dongli Zhang

There is no need to wake up xen_blkif_schedule() as kthread_stop() is able
to already wake up the kernel thread.

Signed-off-by: Dongli Zhang 
---
 drivers/block/xen-blkback/xenbus.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index a4bc74e..37ac59e 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -254,10 +254,8 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif)
if (!ring->active)
continue;
 
-   if (ring->xenblkd) {
+   if (ring->xenblkd)
kthread_stop(ring->xenblkd);
-   wake_up(>shutdown_wq);
-   }
 
/* The above kthread_stop() guarantees that at this point we
 * don't have any discard_io or other_io requests. So, checking
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v6 1/2] xen/blkback: add stack variable 'blkif' in connect_ring()

2019-01-14 Thread Dongli Zhang

As 'be->blkif' is used for many times in connect_ring(), the stack variable
'blkif' is added to substitute 'be-blkif'.

Suggested-by: Paul Durrant 
Signed-off-by: Dongli Zhang 
Reviewed-by: Paul Durrant 
Reviewed-by: Roger Pau Monné 
---
 drivers/block/xen-blkback/xenbus.c | 27 ++-
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index a4bc74e..a4aadac 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -1023,6 +1023,7 @@ static int read_per_ring_refs(struct xen_blkif_ring 
*ring, const char *dir)
 static int connect_ring(struct backend_info *be)
 {
struct xenbus_device *dev = be->dev;
+   struct xen_blkif *blkif = be->blkif;
unsigned int pers_grants;
char protocol[64] = "";
int err, i;
@@ -1033,25 +1034,25 @@ static int connect_ring(struct backend_info *be)
 
pr_debug("%s %s\n", __func__, dev->otherend);
 
-   be->blkif->blk_protocol = BLKIF_PROTOCOL_DEFAULT;
+   blkif->blk_protocol = BLKIF_PROTOCOL_DEFAULT;
err = xenbus_scanf(XBT_NIL, dev->otherend, "protocol",
   "%63s", protocol);
if (err <= 0)
strcpy(protocol, "unspecified, assuming default");
else if (0 == strcmp(protocol, XEN_IO_PROTO_ABI_NATIVE))
-   be->blkif->blk_protocol = BLKIF_PROTOCOL_NATIVE;
+   blkif->blk_protocol = BLKIF_PROTOCOL_NATIVE;
else if (0 == strcmp(protocol, XEN_IO_PROTO_ABI_X86_32))
-   be->blkif->blk_protocol = BLKIF_PROTOCOL_X86_32;
+   blkif->blk_protocol = BLKIF_PROTOCOL_X86_32;
else if (0 == strcmp(protocol, XEN_IO_PROTO_ABI_X86_64))
-   be->blkif->blk_protocol = BLKIF_PROTOCOL_X86_64;
+   blkif->blk_protocol = BLKIF_PROTOCOL_X86_64;
else {
xenbus_dev_fatal(dev, err, "unknown fe protocol %s", protocol);
return -ENOSYS;
}
pers_grants = xenbus_read_unsigned(dev->otherend, "feature-persistent",
   0);
-   be->blkif->vbd.feature_gnt_persistent = pers_grants;
-   be->blkif->vbd.overflow_max_grants = 0;
+   blkif->vbd.feature_gnt_persistent = pers_grants;
+   blkif->vbd.overflow_max_grants = 0;
 
/*
 * Read the number of hardware queues from frontend.
@@ -1067,16 +1068,16 @@ static int connect_ring(struct backend_info *be)
requested_num_queues, xenblk_max_queues);
return -ENOSYS;
}
-   be->blkif->nr_rings = requested_num_queues;
-   if (xen_blkif_alloc_rings(be->blkif))
+   blkif->nr_rings = requested_num_queues;
+   if (xen_blkif_alloc_rings(blkif))
return -ENOMEM;
 
pr_info("%s: using %d queues, protocol %d (%s) %s\n", dev->nodename,
-be->blkif->nr_rings, be->blkif->blk_protocol, protocol,
+blkif->nr_rings, blkif->blk_protocol, protocol,
 pers_grants ? "persistent grants" : "");
 
-   if (be->blkif->nr_rings == 1)
-   return read_per_ring_refs(>blkif->rings[0], dev->otherend);
+   if (blkif->nr_rings == 1)
+   return read_per_ring_refs(>rings[0], dev->otherend);
else {
xspathsize = strlen(dev->otherend) + xenstore_path_ext_size;
xspath = kmalloc(xspathsize, GFP_KERNEL);
@@ -1085,10 +1086,10 @@ static int connect_ring(struct backend_info *be)
return -ENOMEM;
}
 
-   for (i = 0; i < be->blkif->nr_rings; i++) {
+   for (i = 0; i < blkif->nr_rings; i++) {
memset(xspath, 0, xspathsize);
snprintf(xspath, xspathsize, "%s/queue-%u", 
dev->otherend, i);
-   err = read_per_ring_refs(>blkif->rings[i], xspath);
+   err = read_per_ring_refs(>rings[i], xspath);
if (err) {
kfree(xspath);
return err;
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v6 2/2] xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront

2019-01-14 Thread Dongli Zhang

The xenstore 'ring-page-order' is used globally for each blkback queue and
therefore should be read from xenstore only once. However, it is obtained
in read_per_ring_refs() which might be called multiple times during the
initialization of each blkback queue.

If the blkfront is malicious and the 'ring-page-order' is set in different
value by blkfront every time before blkback reads it, this may end up at
the "WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" in
xen_blkif_disconnect() when frontend is destroyed.

This patch reworks connect_ring() to read xenstore 'ring-page-order' only
once.

Signed-off-by: Dongli Zhang 
---
Changed since v1:
  * change the order of xenstore read in read_per_ring_refs
  * use xenbus_read_unsigned() in connect_ring()

Changed since v2:
  * simplify the condition check as "(err != 1 && nr_grefs > 1)"
  * avoid setting err as -EINVAL to remove extra one line of code

Changed since v3:
  * exit at the beginning if !nr_grefs
  * change the if statements to avoid test (err != 1) twice
  * initialize a 'blkif' stack variable (refer to PATCH 1/2)

Changed since v4:
  * use BUG_ON() when (nr_grefs == 0) to reminder the developer
  * set err = -EINVAL before xenbus_dev_fatal()

Changed since v5:
  * use WARN_ON() instead of BUG_ON()

 drivers/block/xen-blkback/xenbus.c | 72 +++---
 1 file changed, 43 insertions(+), 29 deletions(-)

diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index a4aadac..0878e55 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -926,7 +926,7 @@ static int read_per_ring_refs(struct xen_blkif_ring *ring, 
const char *dir)
int err, i, j;
struct xen_blkif *blkif = ring->blkif;
struct xenbus_device *dev = blkif->be->dev;
-   unsigned int ring_page_order, nr_grefs, evtchn;
+   unsigned int nr_grefs, evtchn;
 
err = xenbus_scanf(XBT_NIL, dir, "event-channel", "%u",
  );
@@ -936,43 +936,42 @@ static int read_per_ring_refs(struct xen_blkif_ring 
*ring, const char *dir)
return err;
}
 
-   err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-page-order", "%u",
- _page_order);
-   if (err != 1) {
-   err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u", 
_ref[0]);
+   nr_grefs = blkif->nr_ring_pages;
+
+   if (unlikely(!nr_grefs)) {
+   WARN_ON(true);
+   return -EINVAL;
+   }
+
+   for (i = 0; i < nr_grefs; i++) {
+   char ring_ref_name[RINGREF_NAME_LEN];
+
+   snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", i);
+   err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
+  "%u", _ref[i]);
+
if (err != 1) {
+   if (nr_grefs == 1)
+   break;
+
err = -EINVAL;
-   xenbus_dev_fatal(dev, err, "reading %s/ring-ref", dir);
+   xenbus_dev_fatal(dev, err, "reading %s/%s",
+dir, ring_ref_name);
return err;
}
-   nr_grefs = 1;
-   } else {
-   unsigned int i;
+   }
+
+   if (err != 1) {
+   WARN_ON(nr_grefs != 1);
 
-   if (ring_page_order > xen_blkif_max_ring_order) {
+   err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u",
+  _ref[0]);
+   if (err != 1) {
err = -EINVAL;
-   xenbus_dev_fatal(dev, err, "%s/request %d ring page 
order exceed max:%d",
-dir, ring_page_order,
-xen_blkif_max_ring_order);
+   xenbus_dev_fatal(dev, err, "reading %s/ring-ref", dir);
return err;
}
-
-   nr_grefs = 1 << ring_page_order;
-   for (i = 0; i < nr_grefs; i++) {
-   char ring_ref_name[RINGREF_NAME_LEN];
-
-   snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", 
i);
-   err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
-  "%u", _ref[i]);
-   if (err != 1) {
-   err = -EINVAL;
-   xenbus_dev_fatal(dev, err, "reading %s/%s",
-dir, ring_ref_name);
-   return err;
-   }
-   }
}
-   blkif->nr_ring_pag

Re: [Xen-devel] [PATCH v4 2/2] xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront

2019-01-08 Thread Dongli Zhang

Hi Roger,

On 01/07/2019 11:27 PM, Roger Pau Monné wrote:
> On Mon, Jan 07, 2019 at 10:07:34PM +0800, Dongli Zhang wrote:
>>
>>
>> On 01/07/2019 10:05 PM, Dongli Zhang wrote:
>>>
>>>
>>> On 01/07/2019 08:01 PM, Roger Pau Monné wrote:
>>>> On Mon, Jan 07, 2019 at 01:35:59PM +0800, Dongli Zhang wrote:
>>>>> The xenstore 'ring-page-order' is used globally for each blkback queue and
>>>>> therefore should be read from xenstore only once. However, it is obtained
>>>>> in read_per_ring_refs() which might be called multiple times during the
>>>>> initialization of each blkback queue.
>>>>>
>>>>> If the blkfront is malicious and the 'ring-page-order' is set in different
>>>>> value by blkfront every time before blkback reads it, this may end up at
>>>>> the "WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" in
>>>>> xen_blkif_disconnect() when frontend is destroyed.
>>>>>
>>>>> This patch reworks connect_ring() to read xenstore 'ring-page-order' only
>>>>> once.
>>>>>
>>>>> Signed-off-by: Dongli Zhang 
>>>>> ---
>>>>> Changed since v1:
>>>>>   * change the order of xenstore read in read_per_ring_refs
>>>>>   * use xenbus_read_unsigned() in connect_ring()
>>>>>
>>>>> Changed since v2:
>>>>>   * simplify the condition check as "(err != 1 && nr_grefs > 1)"
>>>>>   * avoid setting err as -EINVAL to remove extra one line of code
>>>>>
>>>>> Changed since v3:
>>>>>   * exit at the beginning if !nr_grefs
>>>>>   * change the if statements to avoid test (err != 1) twice
>>>>>   * initialize a 'blkif' stack variable (refer to PATCH 1/2)
>>>>>
>>>>>  drivers/block/xen-blkback/xenbus.c | 76 
>>>>> +-
>>>>>  1 file changed, 43 insertions(+), 33 deletions(-)
>>>>>
>>>>> diff --git a/drivers/block/xen-blkback/xenbus.c 
>>>>> b/drivers/block/xen-blkback/xenbus.c
>>>>> index a4aadac..a2acbc9 100644
>>>>> --- a/drivers/block/xen-blkback/xenbus.c
>>>>> +++ b/drivers/block/xen-blkback/xenbus.c
>>>>> @@ -926,7 +926,7 @@ static int read_per_ring_refs(struct xen_blkif_ring 
>>>>> *ring, const char *dir)
>>>>>   int err, i, j;
>>>>>   struct xen_blkif *blkif = ring->blkif;
>>>>>   struct xenbus_device *dev = blkif->be->dev;
>>>>> - unsigned int ring_page_order, nr_grefs, evtchn;
>>>>> + unsigned int nr_grefs, evtchn;
>>>>>  
>>>>>   err = xenbus_scanf(XBT_NIL, dir, "event-channel", "%u",
>>>>> );
>>>>> @@ -936,43 +936,38 @@ static int read_per_ring_refs(struct xen_blkif_ring 
>>>>> *ring, const char *dir)
>>>>>   return err;
>>>>>   }
>>>>>  
>>>>> - err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-page-order", "%u",
>>>>> -   _page_order);
>>>>> - if (err != 1) {
>>>>> - err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u", 
>>>>> _ref[0]);
>>>>> + nr_grefs = blkif->nr_ring_pages;
>>>>> +
>>>>> + if (unlikely(!nr_grefs))
>>>>> + return -EINVAL;
>>>>
>>>> Is this even possible? AFAICT read_per_ring_refs will always be called
>>>> with blkif->nr_ring_pages != 0?
>>>>
>>>> If so, I would consider turning this into a BUG_ON/WARN_ON.
>>>
>>> It used to be "WARN_ON(!nr_grefs);" in the v3 of the patch.
>>>
>>> I would turn it into WARN_ON if it is fine with both Paul and you.
>>
>> To clarify, I would use WARN_ON() before exit with -EINVAL (when
>> blkif->nr_ring_pages is 0).
> 
> Given that this function will never be called with nr_ring_pages == 0
> I would be fine with just using a BUG_ON, getting here with
> nr_ring_pages == 0 would imply memory corruption or some other severe
> issue has happened, and there's no possible recovery.
> 
> If you want to instead keep the return, please use plain WARN instead
> of WARN_ON.
> 
> Thanks, Roger.
> 

Is there any reason using WARN than WARN_ON? Because of the message printed by
WARN? something like below?

 941 if (unlikely(!nr_grefs)) {
 942 WARN(!nr_grefs,
 943  "read_per_ring_refs() should be called with
blkif->nr_ring_pages != 0");
 944 return -EINVAL;
 945 }

Thank you very much.

Dongli Zhang

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v5 2/2] xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront

2019-01-08 Thread Dongli Zhang

oops. Please ignore this v5 patch.

I just realized Linus suggested in an old email not use BUG()/BUG_ON() in the 
code.

I will switch to the WARN() solution and resend again.

Sorry for the junk email.

Dongli Zhang

On 2019/1/8 下午4:15, Dongli Zhang wrote:
> The xenstore 'ring-page-order' is used globally for each blkback queue and
> therefore should be read from xenstore only once. However, it is obtained
> in read_per_ring_refs() which might be called multiple times during the
> initialization of each blkback queue.
> 
> If the blkfront is malicious and the 'ring-page-order' is set in different
> value by blkfront every time before blkback reads it, this may end up at
> the "WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" in
> xen_blkif_disconnect() when frontend is destroyed.
> 
> This patch reworks connect_ring() to read xenstore 'ring-page-order' only
> once.
> 
> Signed-off-by: Dongli Zhang 
> ---
> Changed since v1:
>   * change the order of xenstore read in read_per_ring_refs
>   * use xenbus_read_unsigned() in connect_ring()
> 
> Changed since v2:
>   * simplify the condition check as "(err != 1 && nr_grefs > 1)"
>   * avoid setting err as -EINVAL to remove extra one line of code
> 
> Changed since v3:
>   * exit at the beginning if !nr_grefs
>   * change the if statements to avoid test (err != 1) twice
>   * initialize a 'blkif' stack variable (refer to PATCH 1/2)
> 
> Changed since v4:
>   * use BUG_ON() when (nr_grefs == 0) to reminder the developer
>   * set err = -EINVAL before xenbus_dev_fatal()
> 
>  drivers/block/xen-blkback/xenbus.c | 69 
> ++
>  1 file changed, 40 insertions(+), 29 deletions(-)
> 
> diff --git a/drivers/block/xen-blkback/xenbus.c 
> b/drivers/block/xen-blkback/xenbus.c
> index a4aadac..f6146cd 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -926,7 +926,7 @@ static int read_per_ring_refs(struct xen_blkif_ring 
> *ring, const char *dir)
>   int err, i, j;
>   struct xen_blkif *blkif = ring->blkif;
>   struct xenbus_device *dev = blkif->be->dev;
> - unsigned int ring_page_order, nr_grefs, evtchn;
> + unsigned int nr_grefs, evtchn;
>  
>   err = xenbus_scanf(XBT_NIL, dir, "event-channel", "%u",
> );
> @@ -936,43 +936,39 @@ static int read_per_ring_refs(struct xen_blkif_ring 
> *ring, const char *dir)
>   return err;
>   }
>  
> - err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-page-order", "%u",
> -   _page_order);
> - if (err != 1) {
> - err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u", 
> _ref[0]);
> + nr_grefs = blkif->nr_ring_pages;
> +
> + BUG_ON(!nr_grefs);
> +
> + for (i = 0; i < nr_grefs; i++) {
> + char ring_ref_name[RINGREF_NAME_LEN];
> +
> + snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", i);
> + err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
> +"%u", _ref[i]);
> +
>   if (err != 1) {
> + if (nr_grefs == 1)
> + break;
> +
>   err = -EINVAL;
> - xenbus_dev_fatal(dev, err, "reading %s/ring-ref", dir);
> + xenbus_dev_fatal(dev, err, "reading %s/%s",
> +  dir, ring_ref_name);
>   return err;
>   }
> - nr_grefs = 1;
> - } else {
> - unsigned int i;
> + }
>  
> - if (ring_page_order > xen_blkif_max_ring_order) {
> + if (err != 1) {
> + WARN_ON(nr_grefs != 1);
> +
> + err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u",
> +_ref[0]);
> + if (err != 1) {
>   err = -EINVAL;
> - xenbus_dev_fatal(dev, err, "%s/request %d ring page 
> order exceed max:%d",
> -  dir, ring_page_order,
> -  xen_blkif_max_ring_order);
> + xenbus_dev_fatal(dev, err, "reading %s/ring-ref", dir);
>   return err;
>   }
> -
> - nr_grefs = 1 << ring_page_order;
> - for (i = 0; i < nr_grefs; i++) {
> - char ring_ref_name[RINGREF_NAME_LEN];
> -
> - snprintf(r

[Xen-devel] [PATCH v5 2/2] xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront

2019-01-08 Thread Dongli Zhang

The xenstore 'ring-page-order' is used globally for each blkback queue and
therefore should be read from xenstore only once. However, it is obtained
in read_per_ring_refs() which might be called multiple times during the
initialization of each blkback queue.

If the blkfront is malicious and the 'ring-page-order' is set in different
value by blkfront every time before blkback reads it, this may end up at
the "WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" in
xen_blkif_disconnect() when frontend is destroyed.

This patch reworks connect_ring() to read xenstore 'ring-page-order' only
once.

Signed-off-by: Dongli Zhang 
---
Changed since v1:
  * change the order of xenstore read in read_per_ring_refs
  * use xenbus_read_unsigned() in connect_ring()

Changed since v2:
  * simplify the condition check as "(err != 1 && nr_grefs > 1)"
  * avoid setting err as -EINVAL to remove extra one line of code

Changed since v3:
  * exit at the beginning if !nr_grefs
  * change the if statements to avoid test (err != 1) twice
  * initialize a 'blkif' stack variable (refer to PATCH 1/2)

Changed since v4:
  * use BUG_ON() when (nr_grefs == 0) to reminder the developer
  * set err = -EINVAL before xenbus_dev_fatal()

 drivers/block/xen-blkback/xenbus.c | 69 ++
 1 file changed, 40 insertions(+), 29 deletions(-)

diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index a4aadac..f6146cd 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -926,7 +926,7 @@ static int read_per_ring_refs(struct xen_blkif_ring *ring, 
const char *dir)
int err, i, j;
struct xen_blkif *blkif = ring->blkif;
struct xenbus_device *dev = blkif->be->dev;
-   unsigned int ring_page_order, nr_grefs, evtchn;
+   unsigned int nr_grefs, evtchn;
 
err = xenbus_scanf(XBT_NIL, dir, "event-channel", "%u",
  );
@@ -936,43 +936,39 @@ static int read_per_ring_refs(struct xen_blkif_ring 
*ring, const char *dir)
return err;
}
 
-   err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-page-order", "%u",
- _page_order);
-   if (err != 1) {
-   err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u", 
_ref[0]);
+   nr_grefs = blkif->nr_ring_pages;
+
+   BUG_ON(!nr_grefs);
+
+   for (i = 0; i < nr_grefs; i++) {
+   char ring_ref_name[RINGREF_NAME_LEN];
+
+   snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", i);
+   err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
+  "%u", _ref[i]);
+
if (err != 1) {
+   if (nr_grefs == 1)
+   break;
+
err = -EINVAL;
-   xenbus_dev_fatal(dev, err, "reading %s/ring-ref", dir);
+   xenbus_dev_fatal(dev, err, "reading %s/%s",
+dir, ring_ref_name);
return err;
}
-   nr_grefs = 1;
-   } else {
-   unsigned int i;
+   }
 
-   if (ring_page_order > xen_blkif_max_ring_order) {
+   if (err != 1) {
+   WARN_ON(nr_grefs != 1);
+
+   err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u",
+  _ref[0]);
+   if (err != 1) {
err = -EINVAL;
-   xenbus_dev_fatal(dev, err, "%s/request %d ring page 
order exceed max:%d",
-dir, ring_page_order,
-xen_blkif_max_ring_order);
+   xenbus_dev_fatal(dev, err, "reading %s/ring-ref", dir);
return err;
}
-
-   nr_grefs = 1 << ring_page_order;
-   for (i = 0; i < nr_grefs; i++) {
-   char ring_ref_name[RINGREF_NAME_LEN];
-
-   snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", 
i);
-   err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
-  "%u", _ref[i]);
-   if (err != 1) {
-   err = -EINVAL;
-   xenbus_dev_fatal(dev, err, "reading %s/%s",
-dir, ring_ref_name);
-   return err;
-   }
-   }
}
-   blkif->nr_ring_pages = nr_grefs;
 
for (i = 0; i < nr_grefs * XEN_BLKIF_REQS_PER_PAGE; i++) {
req = kzalloc(sizeof(*req), G

[Xen-devel] [PATCH v5 1/2] xen/blkback: add stack variable 'blkif' in connect_ring()

2019-01-08 Thread Dongli Zhang

As 'be->blkif' is used for many times in connect_ring(), the stack variable
'blkif' is added to substitute 'be-blkif'.

Suggested-by: Paul Durrant 
Signed-off-by: Dongli Zhang 
Reviewed-by: Paul Durrant 
Reviewed-by: Roger Pau Monné 
---
 drivers/block/xen-blkback/xenbus.c | 27 ++-
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index a4bc74e..a4aadac 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -1023,6 +1023,7 @@ static int read_per_ring_refs(struct xen_blkif_ring 
*ring, const char *dir)
 static int connect_ring(struct backend_info *be)
 {
struct xenbus_device *dev = be->dev;
+   struct xen_blkif *blkif = be->blkif;
unsigned int pers_grants;
char protocol[64] = "";
int err, i;
@@ -1033,25 +1034,25 @@ static int connect_ring(struct backend_info *be)
 
pr_debug("%s %s\n", __func__, dev->otherend);
 
-   be->blkif->blk_protocol = BLKIF_PROTOCOL_DEFAULT;
+   blkif->blk_protocol = BLKIF_PROTOCOL_DEFAULT;
err = xenbus_scanf(XBT_NIL, dev->otherend, "protocol",
   "%63s", protocol);
if (err <= 0)
strcpy(protocol, "unspecified, assuming default");
else if (0 == strcmp(protocol, XEN_IO_PROTO_ABI_NATIVE))
-   be->blkif->blk_protocol = BLKIF_PROTOCOL_NATIVE;
+   blkif->blk_protocol = BLKIF_PROTOCOL_NATIVE;
else if (0 == strcmp(protocol, XEN_IO_PROTO_ABI_X86_32))
-   be->blkif->blk_protocol = BLKIF_PROTOCOL_X86_32;
+   blkif->blk_protocol = BLKIF_PROTOCOL_X86_32;
else if (0 == strcmp(protocol, XEN_IO_PROTO_ABI_X86_64))
-   be->blkif->blk_protocol = BLKIF_PROTOCOL_X86_64;
+   blkif->blk_protocol = BLKIF_PROTOCOL_X86_64;
else {
xenbus_dev_fatal(dev, err, "unknown fe protocol %s", protocol);
return -ENOSYS;
}
pers_grants = xenbus_read_unsigned(dev->otherend, "feature-persistent",
   0);
-   be->blkif->vbd.feature_gnt_persistent = pers_grants;
-   be->blkif->vbd.overflow_max_grants = 0;
+   blkif->vbd.feature_gnt_persistent = pers_grants;
+   blkif->vbd.overflow_max_grants = 0;
 
/*
 * Read the number of hardware queues from frontend.
@@ -1067,16 +1068,16 @@ static int connect_ring(struct backend_info *be)
requested_num_queues, xenblk_max_queues);
return -ENOSYS;
}
-   be->blkif->nr_rings = requested_num_queues;
-   if (xen_blkif_alloc_rings(be->blkif))
+   blkif->nr_rings = requested_num_queues;
+   if (xen_blkif_alloc_rings(blkif))
return -ENOMEM;
 
pr_info("%s: using %d queues, protocol %d (%s) %s\n", dev->nodename,
-be->blkif->nr_rings, be->blkif->blk_protocol, protocol,
+blkif->nr_rings, blkif->blk_protocol, protocol,
 pers_grants ? "persistent grants" : "");
 
-   if (be->blkif->nr_rings == 1)
-   return read_per_ring_refs(>blkif->rings[0], dev->otherend);
+   if (blkif->nr_rings == 1)
+   return read_per_ring_refs(>rings[0], dev->otherend);
else {
xspathsize = strlen(dev->otherend) + xenstore_path_ext_size;
xspath = kmalloc(xspathsize, GFP_KERNEL);
@@ -1085,10 +1086,10 @@ static int connect_ring(struct backend_info *be)
return -ENOMEM;
}
 
-   for (i = 0; i < be->blkif->nr_rings; i++) {
+   for (i = 0; i < blkif->nr_rings; i++) {
memset(xspath, 0, xspathsize);
snprintf(xspath, xspathsize, "%s/queue-%u", 
dev->otherend, i);
-   err = read_per_ring_refs(>blkif->rings[i], xspath);
+   err = read_per_ring_refs(>rings[i], xspath);
if (err) {
kfree(xspath);
return err;
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v4 2/2] xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront

2019-01-07 Thread Dongli Zhang



On 01/07/2019 10:05 PM, Dongli Zhang wrote:
> 
> 
> On 01/07/2019 08:01 PM, Roger Pau Monné wrote:
>> On Mon, Jan 07, 2019 at 01:35:59PM +0800, Dongli Zhang wrote:
>>> The xenstore 'ring-page-order' is used globally for each blkback queue and
>>> therefore should be read from xenstore only once. However, it is obtained
>>> in read_per_ring_refs() which might be called multiple times during the
>>> initialization of each blkback queue.
>>>
>>> If the blkfront is malicious and the 'ring-page-order' is set in different
>>> value by blkfront every time before blkback reads it, this may end up at
>>> the "WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" in
>>> xen_blkif_disconnect() when frontend is destroyed.
>>>
>>> This patch reworks connect_ring() to read xenstore 'ring-page-order' only
>>> once.
>>>
>>> Signed-off-by: Dongli Zhang 
>>> ---
>>> Changed since v1:
>>>   * change the order of xenstore read in read_per_ring_refs
>>>   * use xenbus_read_unsigned() in connect_ring()
>>>
>>> Changed since v2:
>>>   * simplify the condition check as "(err != 1 && nr_grefs > 1)"
>>>   * avoid setting err as -EINVAL to remove extra one line of code
>>>
>>> Changed since v3:
>>>   * exit at the beginning if !nr_grefs
>>>   * change the if statements to avoid test (err != 1) twice
>>>   * initialize a 'blkif' stack variable (refer to PATCH 1/2)
>>>
>>>  drivers/block/xen-blkback/xenbus.c | 76 
>>> +-
>>>  1 file changed, 43 insertions(+), 33 deletions(-)
>>>
>>> diff --git a/drivers/block/xen-blkback/xenbus.c 
>>> b/drivers/block/xen-blkback/xenbus.c
>>> index a4aadac..a2acbc9 100644
>>> --- a/drivers/block/xen-blkback/xenbus.c
>>> +++ b/drivers/block/xen-blkback/xenbus.c
>>> @@ -926,7 +926,7 @@ static int read_per_ring_refs(struct xen_blkif_ring 
>>> *ring, const char *dir)
>>> int err, i, j;
>>> struct xen_blkif *blkif = ring->blkif;
>>> struct xenbus_device *dev = blkif->be->dev;
>>> -   unsigned int ring_page_order, nr_grefs, evtchn;
>>> +   unsigned int nr_grefs, evtchn;
>>>  
>>> err = xenbus_scanf(XBT_NIL, dir, "event-channel", "%u",
>>>   );
>>> @@ -936,43 +936,38 @@ static int read_per_ring_refs(struct xen_blkif_ring 
>>> *ring, const char *dir)
>>> return err;
>>> }
>>>  
>>> -   err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-page-order", "%u",
>>> - _page_order);
>>> -   if (err != 1) {
>>> -   err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u", 
>>> _ref[0]);
>>> +   nr_grefs = blkif->nr_ring_pages;
>>> +
>>> +   if (unlikely(!nr_grefs))
>>> +   return -EINVAL;
>>
>> Is this even possible? AFAICT read_per_ring_refs will always be called
>> with blkif->nr_ring_pages != 0?
>>
>> If so, I would consider turning this into a BUG_ON/WARN_ON.
> 
> It used to be "WARN_ON(!nr_grefs);" in the v3 of the patch.
> 
> I would turn it into WARN_ON if it is fine with both Paul and you.

To clarify, I would use WARN_ON() before exit with -EINVAL (when
blkif->nr_ring_pages is 0).

Dongli Zhang

> 
> I prefer WARN_ON because it would remind the developers in the future that
> read_per_ring_refs() should be used only when blkif->nr_ring_pages != 0.
> 
>>
>>> +
>>> +   for (i = 0; i < nr_grefs; i++) {
>>> +   char ring_ref_name[RINGREF_NAME_LEN];
>>> +
>>> +   snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", i);
>>> +   err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
>>> +  "%u", _ref[i]);
>>> +
>>> if (err != 1) {
>>> -   err = -EINVAL;
>>> -   xenbus_dev_fatal(dev, err, "reading %s/ring-ref", dir);
>>> -   return err;
>>> -   }
>>> -   nr_grefs = 1;
>>> -   } else {
>>> -   unsigned int i;
>>> -
>>> -   if (ring_page_order > xen_blkif_max_ring_order) {
>>> -   err = -EINVAL;
>>> -   xenbus_dev_fatal(dev, err, "%s/req

Re: [Xen-devel] [PATCH v4 2/2] xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront

2019-01-07 Thread Dongli Zhang



On 01/07/2019 08:01 PM, Roger Pau Monné wrote:
> On Mon, Jan 07, 2019 at 01:35:59PM +0800, Dongli Zhang wrote:
>> The xenstore 'ring-page-order' is used globally for each blkback queue and
>> therefore should be read from xenstore only once. However, it is obtained
>> in read_per_ring_refs() which might be called multiple times during the
>> initialization of each blkback queue.
>>
>> If the blkfront is malicious and the 'ring-page-order' is set in different
>> value by blkfront every time before blkback reads it, this may end up at
>> the "WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" in
>> xen_blkif_disconnect() when frontend is destroyed.
>>
>> This patch reworks connect_ring() to read xenstore 'ring-page-order' only
>> once.
>>
>> Signed-off-by: Dongli Zhang 
>> ---
>> Changed since v1:
>>   * change the order of xenstore read in read_per_ring_refs
>>   * use xenbus_read_unsigned() in connect_ring()
>>
>> Changed since v2:
>>   * simplify the condition check as "(err != 1 && nr_grefs > 1)"
>>   * avoid setting err as -EINVAL to remove extra one line of code
>>
>> Changed since v3:
>>   * exit at the beginning if !nr_grefs
>>   * change the if statements to avoid test (err != 1) twice
>>   * initialize a 'blkif' stack variable (refer to PATCH 1/2)
>>
>>  drivers/block/xen-blkback/xenbus.c | 76 
>> +-
>>  1 file changed, 43 insertions(+), 33 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkback/xenbus.c 
>> b/drivers/block/xen-blkback/xenbus.c
>> index a4aadac..a2acbc9 100644
>> --- a/drivers/block/xen-blkback/xenbus.c
>> +++ b/drivers/block/xen-blkback/xenbus.c
>> @@ -926,7 +926,7 @@ static int read_per_ring_refs(struct xen_blkif_ring 
>> *ring, const char *dir)
>>  int err, i, j;
>>  struct xen_blkif *blkif = ring->blkif;
>>  struct xenbus_device *dev = blkif->be->dev;
>> -unsigned int ring_page_order, nr_grefs, evtchn;
>> +unsigned int nr_grefs, evtchn;
>>  
>>  err = xenbus_scanf(XBT_NIL, dir, "event-channel", "%u",
>>);
>> @@ -936,43 +936,38 @@ static int read_per_ring_refs(struct xen_blkif_ring 
>> *ring, const char *dir)
>>  return err;
>>  }
>>  
>> -err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-page-order", "%u",
>> -  _page_order);
>> -if (err != 1) {
>> -err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u", 
>> _ref[0]);
>> +nr_grefs = blkif->nr_ring_pages;
>> +
>> +if (unlikely(!nr_grefs))
>> +return -EINVAL;
> 
> Is this even possible? AFAICT read_per_ring_refs will always be called
> with blkif->nr_ring_pages != 0?
> 
> If so, I would consider turning this into a BUG_ON/WARN_ON.

It used to be "WARN_ON(!nr_grefs);" in the v3 of the patch.

I would turn it into WARN_ON if it is fine with both Paul and you.

I prefer WARN_ON because it would remind the developers in the future that
read_per_ring_refs() should be used only when blkif->nr_ring_pages != 0.

> 
>> +
>> +for (i = 0; i < nr_grefs; i++) {
>> +char ring_ref_name[RINGREF_NAME_LEN];
>> +
>> +snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", i);
>> +err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
>> +   "%u", _ref[i]);
>> +
>>  if (err != 1) {
>> -err = -EINVAL;
>> -xenbus_dev_fatal(dev, err, "reading %s/ring-ref", dir);
>> -return err;
>> -}
>> -nr_grefs = 1;
>> -} else {
>> -unsigned int i;
>> -
>> -if (ring_page_order > xen_blkif_max_ring_order) {
>> -err = -EINVAL;
>> -xenbus_dev_fatal(dev, err, "%s/request %d ring page 
>> order exceed max:%d",
>> - dir, ring_page_order,
>> - xen_blkif_max_ring_order);
>> -return err;
>> +if (nr_grefs == 1)
>> +break;
>> +
> 
> You need to either set err to EINVAL before calling xenbus_dev_fatal,
> or call xenbus_dev_fatal with EINVAL as the second parameter.
> 
>> +

[Xen-devel] [PATCH v4 1/2] xen/blkback: add stack variable 'blkif' in connect_ring()

2019-01-06 Thread Dongli Zhang

As 'be->blkif' is used for many times in connect_ring(), the stack variable
'blkif' is added to substitute 'be-blkif'.

Suggested-by: Paul Durrant 
Signed-off-by: Dongli Zhang 
---
 drivers/block/xen-blkback/xenbus.c | 27 ++-
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index a4bc74e..a4aadac 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -1023,6 +1023,7 @@ static int read_per_ring_refs(struct xen_blkif_ring 
*ring, const char *dir)
 static int connect_ring(struct backend_info *be)
 {
struct xenbus_device *dev = be->dev;
+   struct xen_blkif *blkif = be->blkif;
unsigned int pers_grants;
char protocol[64] = "";
int err, i;
@@ -1033,25 +1034,25 @@ static int connect_ring(struct backend_info *be)
 
pr_debug("%s %s\n", __func__, dev->otherend);
 
-   be->blkif->blk_protocol = BLKIF_PROTOCOL_DEFAULT;
+   blkif->blk_protocol = BLKIF_PROTOCOL_DEFAULT;
err = xenbus_scanf(XBT_NIL, dev->otherend, "protocol",
   "%63s", protocol);
if (err <= 0)
strcpy(protocol, "unspecified, assuming default");
else if (0 == strcmp(protocol, XEN_IO_PROTO_ABI_NATIVE))
-   be->blkif->blk_protocol = BLKIF_PROTOCOL_NATIVE;
+   blkif->blk_protocol = BLKIF_PROTOCOL_NATIVE;
else if (0 == strcmp(protocol, XEN_IO_PROTO_ABI_X86_32))
-   be->blkif->blk_protocol = BLKIF_PROTOCOL_X86_32;
+   blkif->blk_protocol = BLKIF_PROTOCOL_X86_32;
else if (0 == strcmp(protocol, XEN_IO_PROTO_ABI_X86_64))
-   be->blkif->blk_protocol = BLKIF_PROTOCOL_X86_64;
+   blkif->blk_protocol = BLKIF_PROTOCOL_X86_64;
else {
xenbus_dev_fatal(dev, err, "unknown fe protocol %s", protocol);
return -ENOSYS;
}
pers_grants = xenbus_read_unsigned(dev->otherend, "feature-persistent",
   0);
-   be->blkif->vbd.feature_gnt_persistent = pers_grants;
-   be->blkif->vbd.overflow_max_grants = 0;
+   blkif->vbd.feature_gnt_persistent = pers_grants;
+   blkif->vbd.overflow_max_grants = 0;
 
/*
 * Read the number of hardware queues from frontend.
@@ -1067,16 +1068,16 @@ static int connect_ring(struct backend_info *be)
requested_num_queues, xenblk_max_queues);
return -ENOSYS;
}
-   be->blkif->nr_rings = requested_num_queues;
-   if (xen_blkif_alloc_rings(be->blkif))
+   blkif->nr_rings = requested_num_queues;
+   if (xen_blkif_alloc_rings(blkif))
return -ENOMEM;
 
pr_info("%s: using %d queues, protocol %d (%s) %s\n", dev->nodename,
-be->blkif->nr_rings, be->blkif->blk_protocol, protocol,
+blkif->nr_rings, blkif->blk_protocol, protocol,
 pers_grants ? "persistent grants" : "");
 
-   if (be->blkif->nr_rings == 1)
-   return read_per_ring_refs(>blkif->rings[0], dev->otherend);
+   if (blkif->nr_rings == 1)
+   return read_per_ring_refs(>rings[0], dev->otherend);
else {
xspathsize = strlen(dev->otherend) + xenstore_path_ext_size;
xspath = kmalloc(xspathsize, GFP_KERNEL);
@@ -1085,10 +1086,10 @@ static int connect_ring(struct backend_info *be)
return -ENOMEM;
}
 
-   for (i = 0; i < be->blkif->nr_rings; i++) {
+   for (i = 0; i < blkif->nr_rings; i++) {
memset(xspath, 0, xspathsize);
snprintf(xspath, xspathsize, "%s/queue-%u", 
dev->otherend, i);
-   err = read_per_ring_refs(>blkif->rings[i], xspath);
+   err = read_per_ring_refs(>rings[i], xspath);
if (err) {
kfree(xspath);
return err;
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v4 2/2] xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront

2019-01-06 Thread Dongli Zhang

The xenstore 'ring-page-order' is used globally for each blkback queue and
therefore should be read from xenstore only once. However, it is obtained
in read_per_ring_refs() which might be called multiple times during the
initialization of each blkback queue.

If the blkfront is malicious and the 'ring-page-order' is set in different
value by blkfront every time before blkback reads it, this may end up at
the "WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" in
xen_blkif_disconnect() when frontend is destroyed.

This patch reworks connect_ring() to read xenstore 'ring-page-order' only
once.

Signed-off-by: Dongli Zhang 
---
Changed since v1:
  * change the order of xenstore read in read_per_ring_refs
  * use xenbus_read_unsigned() in connect_ring()

Changed since v2:
  * simplify the condition check as "(err != 1 && nr_grefs > 1)"
  * avoid setting err as -EINVAL to remove extra one line of code

Changed since v3:
  * exit at the beginning if !nr_grefs
  * change the if statements to avoid test (err != 1) twice
  * initialize a 'blkif' stack variable (refer to PATCH 1/2)

 drivers/block/xen-blkback/xenbus.c | 76 +-
 1 file changed, 43 insertions(+), 33 deletions(-)

diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index a4aadac..a2acbc9 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -926,7 +926,7 @@ static int read_per_ring_refs(struct xen_blkif_ring *ring, 
const char *dir)
int err, i, j;
struct xen_blkif *blkif = ring->blkif;
struct xenbus_device *dev = blkif->be->dev;
-   unsigned int ring_page_order, nr_grefs, evtchn;
+   unsigned int nr_grefs, evtchn;
 
err = xenbus_scanf(XBT_NIL, dir, "event-channel", "%u",
  );
@@ -936,43 +936,38 @@ static int read_per_ring_refs(struct xen_blkif_ring 
*ring, const char *dir)
return err;
}
 
-   err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-page-order", "%u",
- _page_order);
-   if (err != 1) {
-   err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u", 
_ref[0]);
+   nr_grefs = blkif->nr_ring_pages;
+
+   if (unlikely(!nr_grefs))
+   return -EINVAL;
+
+   for (i = 0; i < nr_grefs; i++) {
+   char ring_ref_name[RINGREF_NAME_LEN];
+
+   snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", i);
+   err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
+  "%u", _ref[i]);
+
if (err != 1) {
-   err = -EINVAL;
-   xenbus_dev_fatal(dev, err, "reading %s/ring-ref", dir);
-   return err;
-   }
-   nr_grefs = 1;
-   } else {
-   unsigned int i;
-
-   if (ring_page_order > xen_blkif_max_ring_order) {
-   err = -EINVAL;
-   xenbus_dev_fatal(dev, err, "%s/request %d ring page 
order exceed max:%d",
-dir, ring_page_order,
-xen_blkif_max_ring_order);
-   return err;
+   if (nr_grefs == 1)
+   break;
+
+   xenbus_dev_fatal(dev, err, "reading %s/%s",
+dir, ring_ref_name);
+   return -EINVAL;
}
+   }
 
-   nr_grefs = 1 << ring_page_order;
-   for (i = 0; i < nr_grefs; i++) {
-   char ring_ref_name[RINGREF_NAME_LEN];
-
-   snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", 
i);
-   err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
-  "%u", _ref[i]);
-   if (err != 1) {
-   err = -EINVAL;
-   xenbus_dev_fatal(dev, err, "reading %s/%s",
-dir, ring_ref_name);
-   return err;
-   }
+   if (err != 1) {
+   WARN_ON(nr_grefs != 1);
+
+   err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u",
+  _ref[0]);
+   if (err != 1) {
+   xenbus_dev_fatal(dev, err, "reading %s/ring-ref", dir);
+   return -EINVAL;
}
}
-   blkif->nr_ring_pages = nr_grefs;
 
for (i = 0; i < nr_grefs * XEN_BLKIF_REQS_PER_PAGE; i++) {
req = kzalloc(sizeof(*req), GFP_KERNEL);
@@ -1031,6

Re: [Xen-devel] [PATCH v3 1/1] xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront

2019-01-03 Thread Dongli Zhang

Ping?

Dongli Zhang

On 12/19/2018 09:23 PM, Dongli Zhang wrote:
> The xenstore 'ring-page-order' is used globally for each blkback queue and
> therefore should be read from xenstore only once. However, it is obtained
> in read_per_ring_refs() which might be called multiple times during the
> initialization of each blkback queue.
> 
> If the blkfront is malicious and the 'ring-page-order' is set in different
> value by blkfront every time before blkback reads it, this may end up at
> the "WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" in
> xen_blkif_disconnect() when frontend is destroyed.
> 
> This patch reworks connect_ring() to read xenstore 'ring-page-order' only
> once.
> 
> Signed-off-by: Dongli Zhang 
> ---
> Changed since v1:
>   * change the order of xenstore read in read_per_ring_refs
>   * use xenbus_read_unsigned() in connect_ring()
> 
> Changed since v2:
>   * simplify the condition check as "(err != 1 && nr_grefs > 1)"
>   * avoid setting err as -EINVAL to remove extra one line of code
> 
>  drivers/block/xen-blkback/xenbus.c | 74 
> +-
>  1 file changed, 41 insertions(+), 33 deletions(-)
> 
> diff --git a/drivers/block/xen-blkback/xenbus.c 
> b/drivers/block/xen-blkback/xenbus.c
> index a4bc74e..dfea3a4 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -926,7 +926,7 @@ static int read_per_ring_refs(struct xen_blkif_ring 
> *ring, const char *dir)
>   int err, i, j;
>   struct xen_blkif *blkif = ring->blkif;
>   struct xenbus_device *dev = blkif->be->dev;
> - unsigned int ring_page_order, nr_grefs, evtchn;
> + unsigned int nr_grefs, evtchn;
>  
>   err = xenbus_scanf(XBT_NIL, dir, "event-channel", "%u",
> );
> @@ -936,43 +936,36 @@ static int read_per_ring_refs(struct xen_blkif_ring 
> *ring, const char *dir)
>   return err;
>   }
>  
> - err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-page-order", "%u",
> -   _page_order);
> + nr_grefs = blkif->nr_ring_pages;
> + WARN_ON(!nr_grefs);
> +
> + for (i = 0; i < nr_grefs; i++) {
> + char ring_ref_name[RINGREF_NAME_LEN];
> +
> + snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", i);
> + err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
> +"%u", _ref[i]);
> +
> + if (err != 1 && nr_grefs > 1) {
> + xenbus_dev_fatal(dev, err, "reading %s/%s",
> +  dir, ring_ref_name);
> + return -EINVAL;
> + }
> +
> + if (err != 1)
> + break;
> + }
> +
>   if (err != 1) {
> - err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u", 
> _ref[0]);
> + WARN_ON(nr_grefs != 1);
> +
> + err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u",
> +_ref[0]);
>   if (err != 1) {
> - err = -EINVAL;
>   xenbus_dev_fatal(dev, err, "reading %s/ring-ref", dir);
> - return err;
> - }
> - nr_grefs = 1;
> - } else {
> - unsigned int i;
> -
> - if (ring_page_order > xen_blkif_max_ring_order) {
> - err = -EINVAL;
> - xenbus_dev_fatal(dev, err, "%s/request %d ring page 
> order exceed max:%d",
> -  dir, ring_page_order,
> -  xen_blkif_max_ring_order);
> - return err;
> - }
> -
> - nr_grefs = 1 << ring_page_order;
> - for (i = 0; i < nr_grefs; i++) {
> - char ring_ref_name[RINGREF_NAME_LEN];
> -
> - snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", 
> i);
> - err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
> -"%u", _ref[i]);
> - if (err != 1) {
> - err = -EINVAL;
> - xenbus_dev_fatal(dev, err, "reading %s/%s",
> -  dir, ring_ref_name);
> - return err;
> - }
> + return -EINVAL;
>   }
>

[Xen-devel] [PATCH v3 1/1] xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront

2018-12-19 Thread Dongli Zhang

The xenstore 'ring-page-order' is used globally for each blkback queue and
therefore should be read from xenstore only once. However, it is obtained
in read_per_ring_refs() which might be called multiple times during the
initialization of each blkback queue.

If the blkfront is malicious and the 'ring-page-order' is set in different
value by blkfront every time before blkback reads it, this may end up at
the "WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" in
xen_blkif_disconnect() when frontend is destroyed.

This patch reworks connect_ring() to read xenstore 'ring-page-order' only
once.

Signed-off-by: Dongli Zhang 
---
Changed since v1:
  * change the order of xenstore read in read_per_ring_refs
  * use xenbus_read_unsigned() in connect_ring()

Changed since v2:
  * simplify the condition check as "(err != 1 && nr_grefs > 1)"
  * avoid setting err as -EINVAL to remove extra one line of code

 drivers/block/xen-blkback/xenbus.c | 74 +-
 1 file changed, 41 insertions(+), 33 deletions(-)

diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index a4bc74e..dfea3a4 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -926,7 +926,7 @@ static int read_per_ring_refs(struct xen_blkif_ring *ring, 
const char *dir)
int err, i, j;
struct xen_blkif *blkif = ring->blkif;
struct xenbus_device *dev = blkif->be->dev;
-   unsigned int ring_page_order, nr_grefs, evtchn;
+   unsigned int nr_grefs, evtchn;
 
err = xenbus_scanf(XBT_NIL, dir, "event-channel", "%u",
  );
@@ -936,43 +936,36 @@ static int read_per_ring_refs(struct xen_blkif_ring 
*ring, const char *dir)
return err;
}
 
-   err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-page-order", "%u",
- _page_order);
+   nr_grefs = blkif->nr_ring_pages;
+   WARN_ON(!nr_grefs);
+
+   for (i = 0; i < nr_grefs; i++) {
+   char ring_ref_name[RINGREF_NAME_LEN];
+
+   snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", i);
+   err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
+  "%u", _ref[i]);
+
+   if (err != 1 && nr_grefs > 1) {
+   xenbus_dev_fatal(dev, err, "reading %s/%s",
+dir, ring_ref_name);
+   return -EINVAL;
+   }
+
+   if (err != 1)
+   break;
+   }
+
if (err != 1) {
-   err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u", 
_ref[0]);
+   WARN_ON(nr_grefs != 1);
+
+   err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u",
+  _ref[0]);
if (err != 1) {
-   err = -EINVAL;
xenbus_dev_fatal(dev, err, "reading %s/ring-ref", dir);
-   return err;
-   }
-   nr_grefs = 1;
-   } else {
-   unsigned int i;
-
-   if (ring_page_order > xen_blkif_max_ring_order) {
-   err = -EINVAL;
-   xenbus_dev_fatal(dev, err, "%s/request %d ring page 
order exceed max:%d",
-dir, ring_page_order,
-xen_blkif_max_ring_order);
-   return err;
-   }
-
-   nr_grefs = 1 << ring_page_order;
-   for (i = 0; i < nr_grefs; i++) {
-   char ring_ref_name[RINGREF_NAME_LEN];
-
-   snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", 
i);
-   err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
-  "%u", _ref[i]);
-   if (err != 1) {
-   err = -EINVAL;
-   xenbus_dev_fatal(dev, err, "reading %s/%s",
-dir, ring_ref_name);
-   return err;
-   }
+   return -EINVAL;
}
}
-   blkif->nr_ring_pages = nr_grefs;
 
for (i = 0; i < nr_grefs * XEN_BLKIF_REQS_PER_PAGE; i++) {
req = kzalloc(sizeof(*req), GFP_KERNEL);
@@ -1030,6 +1023,7 @@ static int connect_ring(struct backend_info *be)
size_t xspathsize;
const size_t xenstore_path_ext_size = 11; /* sufficient for 
"/queue-NNN" */
unsigned int requested_num_queues = 0;
+   unsigned int ring_page_order;
 
pr_debug("%s %s\n&qu

Re: [Xen-devel] [PATCH v2 1/1] xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront

2018-12-18 Thread Dongli Zhang



On 12/18/2018 11:13 PM, Roger Pau Monné wrote:
> On Tue, Dec 18, 2018 at 07:31:59PM +0800, Dongli Zhang wrote:
>> Hi Roger,
>>
>> On 12/18/2018 05:33 PM, Roger Pau Monné wrote:
>>> On Tue, Dec 18, 2018 at 08:55:38AM +0800, Dongli Zhang wrote:
>>>> The xenstore 'ring-page-order' is used globally for each blkback queue and
>>>> therefore should be read from xenstore only once. However, it is obtained
>>>> in read_per_ring_refs() which might be called multiple times during the
>>>> initialization of each blkback queue.
>>>>
>>>> If the blkfront is malicious and the 'ring-page-order' is set in different
>>>> value by blkfront every time before blkback reads it, this may end up at
>>>> the "WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" in
>>>> xen_blkif_disconnect() when frontend is destroyed.
>>>>
>>>> This patch reworks connect_ring() to read xenstore 'ring-page-order' only
>>>> once.
>>>>
>>>> Signed-off-by: Dongli Zhang 
>>>> ---
>>>> Changed since v1:
>>>>   * change the order of xenstore read in read_per_ring_refs(suggested by 
>>>> Roger Pau Monne)
>>>>   * use xenbus_read_unsigned() in connect_ring() (suggested by Roger Pau 
>>>> Monne)
>>>>
>>>>  drivers/block/xen-blkback/xenbus.c | 70 
>>>> ++
>>>>  1 file changed, 40 insertions(+), 30 deletions(-)
>>>>
>>>> diff --git a/drivers/block/xen-blkback/xenbus.c 
>>>> b/drivers/block/xen-blkback/xenbus.c
>>>> index a4bc74e..7178f0f 100644
>>>> --- a/drivers/block/xen-blkback/xenbus.c
>>>> +++ b/drivers/block/xen-blkback/xenbus.c
>>>> @@ -926,7 +926,7 @@ static int read_per_ring_refs(struct xen_blkif_ring 
>>>> *ring, const char *dir)
>>>>int err, i, j;
>>>>struct xen_blkif *blkif = ring->blkif;
>>>>struct xenbus_device *dev = blkif->be->dev;
>>>> -  unsigned int ring_page_order, nr_grefs, evtchn;
>>>> +  unsigned int nr_grefs, evtchn;
>>>>  
>>>>err = xenbus_scanf(XBT_NIL, dir, "event-channel", "%u",
>>>>  );
>>>> @@ -936,43 +936,38 @@ static int read_per_ring_refs(struct xen_blkif_ring 
>>>> *ring, const char *dir)
>>>>return err;
>>>>}
>>>>  
>>>> -  err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-page-order", "%u",
>>>> -_page_order);
>>>> -  if (err != 1) {
>>>> -  err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u", 
>>>> _ref[0]);
>>>> -  if (err != 1) {
>>>> +  nr_grefs = blkif->nr_ring_pages;
>>>> +  WARN_ON(!nr_grefs);
>>>> +
>>>> +  for (i = 0; i < nr_grefs; i++) {
>>>> +  char ring_ref_name[RINGREF_NAME_LEN];
>>>> +
>>>> +  snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", i);
>>>> +  err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
>>>> + "%u", _ref[i]);
>>>> +
>>>> +  if (err != 1 && (i || (!i && nr_grefs > 1))) {
>>>
>>> AFAICT the above condition can be simplified as "err != 1 &&
>>> nr_grefs".
>>>
>>>>err = -EINVAL;
>>>
>>> There's no point in setting err here...
>>>
>>>> -  xenbus_dev_fatal(dev, err, "reading %s/ring-ref", dir);
>>>> +  xenbus_dev_fatal(dev, err, "reading %s/%s",
>>>> +   dir, ring_ref_name);
>>>>return err;
>>>
>>> ...since you can just return -EINVAL (same applies to the other
>>> instance below).
>>
>> I would like to confirm if I would keep the err = -EINVAL in below because 
>> most
>> of the below code is copied from original implementation without 
>> modification.
>>
>> There is no err set by xenbus_read_unsigned().
> 
> Right, but instead of doing:
> 
> err = -EINVAL;
> return err;
> 
> You can just do:
> 
> return -EINVAL;
> 
> Which is one line shorter :).

However, for the "ring-page-order" case, the err used in xen

Re: [Xen-devel] [PATCH v2 1/1] xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront

2018-12-18 Thread Dongli Zhang

Hi Roger,

On 12/18/2018 05:33 PM, Roger Pau Monné wrote:
> On Tue, Dec 18, 2018 at 08:55:38AM +0800, Dongli Zhang wrote:
>> The xenstore 'ring-page-order' is used globally for each blkback queue and
>> therefore should be read from xenstore only once. However, it is obtained
>> in read_per_ring_refs() which might be called multiple times during the
>> initialization of each blkback queue.
>>
>> If the blkfront is malicious and the 'ring-page-order' is set in different
>> value by blkfront every time before blkback reads it, this may end up at
>> the "WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" in
>> xen_blkif_disconnect() when frontend is destroyed.
>>
>> This patch reworks connect_ring() to read xenstore 'ring-page-order' only
>> once.
>>
>> Signed-off-by: Dongli Zhang 
>> ---
>> Changed since v1:
>>   * change the order of xenstore read in read_per_ring_refs(suggested by 
>> Roger Pau Monne)
>>   * use xenbus_read_unsigned() in connect_ring() (suggested by Roger Pau 
>> Monne)
>>
>>  drivers/block/xen-blkback/xenbus.c | 70 
>> ++
>>  1 file changed, 40 insertions(+), 30 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkback/xenbus.c 
>> b/drivers/block/xen-blkback/xenbus.c
>> index a4bc74e..7178f0f 100644
>> --- a/drivers/block/xen-blkback/xenbus.c
>> +++ b/drivers/block/xen-blkback/xenbus.c
>> @@ -926,7 +926,7 @@ static int read_per_ring_refs(struct xen_blkif_ring 
>> *ring, const char *dir)
>>  int err, i, j;
>>  struct xen_blkif *blkif = ring->blkif;
>>  struct xenbus_device *dev = blkif->be->dev;
>> -unsigned int ring_page_order, nr_grefs, evtchn;
>> +unsigned int nr_grefs, evtchn;
>>  
>>  err = xenbus_scanf(XBT_NIL, dir, "event-channel", "%u",
>>);
>> @@ -936,43 +936,38 @@ static int read_per_ring_refs(struct xen_blkif_ring 
>> *ring, const char *dir)
>>  return err;
>>  }
>>  
>> -err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-page-order", "%u",
>> -  _page_order);
>> -if (err != 1) {
>> -err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u", 
>> _ref[0]);
>> -if (err != 1) {
>> +nr_grefs = blkif->nr_ring_pages;
>> +WARN_ON(!nr_grefs);
>> +
>> +for (i = 0; i < nr_grefs; i++) {
>> +char ring_ref_name[RINGREF_NAME_LEN];
>> +
>> +snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", i);
>> +err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
>> +   "%u", _ref[i]);
>> +
>> +if (err != 1 && (i || (!i && nr_grefs > 1))) {
> 
> AFAICT the above condition can be simplified as "err != 1 &&
> nr_grefs".
> 
>>  err = -EINVAL;
> 
> There's no point in setting err here...
> 
>> -xenbus_dev_fatal(dev, err, "reading %s/ring-ref", dir);
>> +xenbus_dev_fatal(dev, err, "reading %s/%s",
>> + dir, ring_ref_name);
>>  return err;
> 
> ...since you can just return -EINVAL (same applies to the other
> instance below).

I would like to confirm if I would keep the err = -EINVAL in below because most
of the below code is copied from original implementation without modification.

There is no err set by xenbus_read_unsigned().

+   ring_page_order = xenbus_read_unsigned(dev->otherend,
+  "ring-page-order", 0);
+
+   if (ring_page_order > xen_blkif_max_ring_order) {
+   err = -EINVAL;
+   xenbus_dev_fatal(dev, err,
+"requested ring page order %d exceed max:%d",
+ring_page_order,
+xen_blkif_max_ring_order);
+   return err;
+   }
+
+   be->blkif->nr_ring_pages = 1 << ring_page_order;


For the rest, I would do something like:

+   err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
+  "%u", _ref[i]);
+
+   if (err != 1 && nr_grefs > 1) {
+   xenbus_dev_fatal(dev, err, "reading %s/%s",
+dir, ring_ref_name);
+   return -EINVAL;
+   }


Thank you very much!

Dongi Zhang

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v2 1/1] xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront

2018-12-17 Thread Dongli Zhang

The xenstore 'ring-page-order' is used globally for each blkback queue and
therefore should be read from xenstore only once. However, it is obtained
in read_per_ring_refs() which might be called multiple times during the
initialization of each blkback queue.

If the blkfront is malicious and the 'ring-page-order' is set in different
value by blkfront every time before blkback reads it, this may end up at
the "WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" in
xen_blkif_disconnect() when frontend is destroyed.

This patch reworks connect_ring() to read xenstore 'ring-page-order' only
once.

Signed-off-by: Dongli Zhang 
---
Changed since v1:
  * change the order of xenstore read in read_per_ring_refs(suggested by Roger 
Pau Monne)
  * use xenbus_read_unsigned() in connect_ring() (suggested by Roger Pau Monne)

 drivers/block/xen-blkback/xenbus.c | 70 ++
 1 file changed, 40 insertions(+), 30 deletions(-)

diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index a4bc74e..7178f0f 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -926,7 +926,7 @@ static int read_per_ring_refs(struct xen_blkif_ring *ring, 
const char *dir)
int err, i, j;
struct xen_blkif *blkif = ring->blkif;
struct xenbus_device *dev = blkif->be->dev;
-   unsigned int ring_page_order, nr_grefs, evtchn;
+   unsigned int nr_grefs, evtchn;
 
err = xenbus_scanf(XBT_NIL, dir, "event-channel", "%u",
  );
@@ -936,43 +936,38 @@ static int read_per_ring_refs(struct xen_blkif_ring 
*ring, const char *dir)
return err;
}
 
-   err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-page-order", "%u",
- _page_order);
-   if (err != 1) {
-   err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u", 
_ref[0]);
-   if (err != 1) {
+   nr_grefs = blkif->nr_ring_pages;
+   WARN_ON(!nr_grefs);
+
+   for (i = 0; i < nr_grefs; i++) {
+   char ring_ref_name[RINGREF_NAME_LEN];
+
+   snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", i);
+   err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
+  "%u", _ref[i]);
+
+   if (err != 1 && (i || (!i && nr_grefs > 1))) {
err = -EINVAL;
-   xenbus_dev_fatal(dev, err, "reading %s/ring-ref", dir);
+   xenbus_dev_fatal(dev, err, "reading %s/%s",
+dir, ring_ref_name);
return err;
}
-   nr_grefs = 1;
-   } else {
-   unsigned int i;
 
-   if (ring_page_order > xen_blkif_max_ring_order) {
+   if (err != 1)
+   break;
+   }
+
+   if (err != 1) {
+   WARN_ON(nr_grefs != 1);
+
+   err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u",
+  _ref[0]);
+   if (err != 1) {
err = -EINVAL;
-   xenbus_dev_fatal(dev, err, "%s/request %d ring page 
order exceed max:%d",
-dir, ring_page_order,
-xen_blkif_max_ring_order);
+   xenbus_dev_fatal(dev, err, "reading %s/ring-ref", dir);
return err;
}
-
-   nr_grefs = 1 << ring_page_order;
-   for (i = 0; i < nr_grefs; i++) {
-   char ring_ref_name[RINGREF_NAME_LEN];
-
-   snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", 
i);
-   err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
-  "%u", _ref[i]);
-   if (err != 1) {
-   err = -EINVAL;
-   xenbus_dev_fatal(dev, err, "reading %s/%s",
-dir, ring_ref_name);
-   return err;
-   }
-   }
}
-   blkif->nr_ring_pages = nr_grefs;
 
for (i = 0; i < nr_grefs * XEN_BLKIF_REQS_PER_PAGE; i++) {
req = kzalloc(sizeof(*req), GFP_KERNEL);
@@ -1030,6 +1025,7 @@ static int connect_ring(struct backend_info *be)
size_t xspathsize;
const size_t xenstore_path_ext_size = 11; /* sufficient for 
"/queue-NNN" */
unsigned int requested_num_queues = 0;
+   unsigned int ring_page_order;
 
pr_debug("%s %s\n", __func__, dev->otherend);
 
@@ -

Re: [Xen-devel] [PATCH 1/1] xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront

2018-12-07 Thread Dongli Zhang



On 12/07/2018 11:15 PM, Paul Durrant wrote:
>> -Original Message-
>> From: Dongli Zhang [mailto:dongli.zh...@oracle.com]
>> Sent: 07 December 2018 15:10
>> To: Paul Durrant ; linux-ker...@vger.kernel.org;
>> xen-devel@lists.xenproject.org; linux-bl...@vger.kernel.org
>> Cc: ax...@kernel.dk; Roger Pau Monne ;
>> konrad.w...@oracle.com
>> Subject: Re: [Xen-devel] [PATCH 1/1] xen/blkback: rework connect_ring() to
>> avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront
>>
>> Hi Paul,
>>
>> On 12/07/2018 05:39 PM, Paul Durrant wrote:
>>>> -Original Message-
>>>> From: Xen-devel [mailto:xen-devel-boun...@lists.xenproject.org] On
>> Behalf
>>>> Of Dongli Zhang
>>>> Sent: 07 December 2018 04:18
>>>> To: linux-ker...@vger.kernel.org; xen-devel@lists.xenproject.org;
>> linux-
>>>> bl...@vger.kernel.org
>>>> Cc: ax...@kernel.dk; Roger Pau Monne ;
>>>> konrad.w...@oracle.com
>>>> Subject: [Xen-devel] [PATCH 1/1] xen/blkback: rework connect_ring() to
>>>> avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront
>>>>
>>>> The xenstore 'ring-page-order' is used globally for each blkback queue
>> and
>>>> therefore should be read from xenstore only once. However, it is
>> obtained
>>>> in read_per_ring_refs() which might be called multiple times during the
>>>> initialization of each blkback queue.
>>>
>>> That is certainly sub-optimal.
>>>
>>>>
>>>> If the blkfront is malicious and the 'ring-page-order' is set in
>> different
>>>> value by blkfront every time before blkback reads it, this may end up
>> at
>>>> the "WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));"
>> in
>>>> xen_blkif_disconnect() when frontend is destroyed.
>>>
>>> I can't actually see what useful function blkif->nr_ring_pages actually
>> performs any more. Perhaps you could actually get rid of it?
>>
>> How about we keep it? Other than reading from xenstore, it is the only
>> place for
>> us to know the value from 'ring-page-order'.
>>
>> This helps calculate the initialized number of elements on all
>> xen_blkif_ring->pending_free lists. That's how "WARN_ON(i !=
>> (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" is used to double
>> check if
>> there is no leak of elements reclaimed from all xen_blkif_ring-
>>> pending_free.
>>
>> It helps vmcore analysis as well. Given blkif->nr_ring_pages, we would be
>> able
>> to double check if the number of ring buffer slots are correct.
>>
>> I could not see any drawback leaving blkif->nr_ring_pagesin the code.
> 
> No, there's no drawback apart from space, but apart from that cross-check 
> and, as you say, core analysis it seems to have little value.
> 
>   Paul

I will not remove blkif->nr_ring_pages and leave the current patch waiting for
review.

Dongli Zhang



> 
>>
>> Dongli Zhang
>>
>>>
>>>>
>>>> This patch reworks connect_ring() to read xenstore 'ring-page-order'
>> only
>>>> once.
>>>
>>> That is certainly a good thing :-)
>>>
>>>   Paul
>>>
>>>>
>>>> Signed-off-by: Dongli Zhang 
>>>> ---
>>>>  drivers/block/xen-blkback/xenbus.c | 49 --
>> ---
>>>> -
>>>>  1 file changed, 31 insertions(+), 18 deletions(-)
>>>>
>>>> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-
>>>> blkback/xenbus.c
>>>> index a4bc74e..4a8ce20 100644
>>>> --- a/drivers/block/xen-blkback/xenbus.c
>>>> +++ b/drivers/block/xen-blkback/xenbus.c
>>>> @@ -919,14 +919,15 @@ static void connect(struct backend_info *be)
>>>>  /*
>>>>   * Each ring may have multi pages, depends on "ring-page-order".
>>>>   */
>>>> -static int read_per_ring_refs(struct xen_blkif_ring *ring, const char
>>>> *dir)
>>>> +static int read_per_ring_refs(struct xen_blkif_ring *ring, const char
>>>> *dir,
>>>> +bool use_ring_page_order)
>>>>  {
>>>>unsigned int ring_ref[XENBUS_MAX_RING_GRANTS];
>>>>struct pending_req *req, *n;
>>>>int err, i, j;
>>>>struct xen_blkif *blkif = ring->blkif;
>>&

Re: [Xen-devel] [PATCH 1/1] xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront

2018-12-07 Thread Dongli Zhang

Hi Paul,

On 12/07/2018 05:39 PM, Paul Durrant wrote:
>> -Original Message-
>> From: Xen-devel [mailto:xen-devel-boun...@lists.xenproject.org] On Behalf
>> Of Dongli Zhang
>> Sent: 07 December 2018 04:18
>> To: linux-ker...@vger.kernel.org; xen-devel@lists.xenproject.org; linux-
>> bl...@vger.kernel.org
>> Cc: ax...@kernel.dk; Roger Pau Monne ;
>> konrad.w...@oracle.com
>> Subject: [Xen-devel] [PATCH 1/1] xen/blkback: rework connect_ring() to
>> avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront
>>
>> The xenstore 'ring-page-order' is used globally for each blkback queue and
>> therefore should be read from xenstore only once. However, it is obtained
>> in read_per_ring_refs() which might be called multiple times during the
>> initialization of each blkback queue.
> 
> That is certainly sub-optimal.
> 
>>
>> If the blkfront is malicious and the 'ring-page-order' is set in different
>> value by blkfront every time before blkback reads it, this may end up at
>> the "WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" in
>> xen_blkif_disconnect() when frontend is destroyed.
> 
> I can't actually see what useful function blkif->nr_ring_pages actually 
> performs any more. Perhaps you could actually get rid of it?

How about we keep it? Other than reading from xenstore, it is the only place for
us to know the value from 'ring-page-order'.

This helps calculate the initialized number of elements on all
xen_blkif_ring->pending_free lists. That's how "WARN_ON(i !=
(XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" is used to double check if
there is no leak of elements reclaimed from all xen_blkif_ring->pending_free.

It helps vmcore analysis as well. Given blkif->nr_ring_pages, we would be able
to double check if the number of ring buffer slots are correct.

I could not see any drawback leaving blkif->nr_ring_pagesin the code.

Dongli Zhang

> 
>>
>> This patch reworks connect_ring() to read xenstore 'ring-page-order' only
>> once.
> 
> That is certainly a good thing :-)
> 
>   Paul
> 
>>
>> Signed-off-by: Dongli Zhang 
>> ---
>>  drivers/block/xen-blkback/xenbus.c | 49 -
>> -
>>  1 file changed, 31 insertions(+), 18 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-
>> blkback/xenbus.c
>> index a4bc74e..4a8ce20 100644
>> --- a/drivers/block/xen-blkback/xenbus.c
>> +++ b/drivers/block/xen-blkback/xenbus.c
>> @@ -919,14 +919,15 @@ static void connect(struct backend_info *be)
>>  /*
>>   * Each ring may have multi pages, depends on "ring-page-order".
>>   */
>> -static int read_per_ring_refs(struct xen_blkif_ring *ring, const char
>> *dir)
>> +static int read_per_ring_refs(struct xen_blkif_ring *ring, const char
>> *dir,
>> +  bool use_ring_page_order)
>>  {
>>  unsigned int ring_ref[XENBUS_MAX_RING_GRANTS];
>>  struct pending_req *req, *n;
>>  int err, i, j;
>>  struct xen_blkif *blkif = ring->blkif;
>>  struct xenbus_device *dev = blkif->be->dev;
>> -unsigned int ring_page_order, nr_grefs, evtchn;
>> +unsigned int nr_grefs, evtchn;
>>
>>  err = xenbus_scanf(XBT_NIL, dir, "event-channel", "%u",
>>);
>> @@ -936,28 +937,18 @@ static int read_per_ring_refs(struct xen_blkif_ring
>> *ring, const char *dir)
>>  return err;
>>  }
>>
>> -err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-page-order", "%u",
>> -  _page_order);
>> -if (err != 1) {
>> +nr_grefs = blkif->nr_ring_pages;
>> +
>> +if (!use_ring_page_order) {
>>  err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u",
>> _ref[0]);
>>  if (err != 1) {
>>  err = -EINVAL;
>>  xenbus_dev_fatal(dev, err, "reading %s/ring-ref", dir);
>>  return err;
>>  }
>> -nr_grefs = 1;
>>  } else {
>>  unsigned int i;
>>
>> -if (ring_page_order > xen_blkif_max_ring_order) {
>> -err = -EINVAL;
>> -xenbus_dev_fatal(dev, err, "%s/request %d ring page
>> order exceed max:%d",
>> - dir, ring_page_order,
>> - xen_blkif_m

1 2 >

1 - 100 of 143 matches

Mail list logo