Re: [PATCH RFC 1/1] x86/paravirt: introduce param to disable pv sched_clock

2023-10-19 Thread Vitaly Kuznetsov
Dongli Zhang  writes:

> As mentioned in the linux kernel development document, "sched_clock() is
> used for scheduling and timestamping". While there is a default native
> implementation, many paravirtualizations have their own implementations.
>
> About KVM, it uses kvm_sched_clock_read() and there is no way to only
> disable KVM's sched_clock. The "no-kvmclock" may disable all
> paravirtualized kvmclock features.
>
>  94 static inline void kvm_sched_clock_init(bool stable)
>  95 {
>  96 if (!stable)
>  97 clear_sched_clock_stable();
>  98 kvm_sched_clock_offset = kvm_clock_read();
>  99 paravirt_set_sched_clock(kvm_sched_clock_read);
> 100
> 101 pr_info("kvm-clock: using sched offset of %llu cycles",
> 102 kvm_sched_clock_offset);
> 103
> 104 BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
> 105 sizeof(((struct pvclock_vcpu_time_info 
> *)NULL)->system_time));
> 106 }
>
> There is known issue that kvmclock may drift during vCPU hotplug [1].
> Although a temporary fix is available [2], we may need a way to disable pv
> sched_clock. Nowadays, the TSC is more stable and has less performance
> overhead than kvmclock.
>
> This is to propose to introduce a global param to disable pv sched_clock
> for all paravirtualizations.
>
> Please suggest and comment if other options are better:
>
> 1. Global param (this RFC patch).
>
> 2. The kvmclock specific param (e.g., "no-vmw-sched-clock" in vmware).
>
> Indeed I like the 2nd method.
>
> 3. Enforce native sched_clock only when TSC is invariant (hyper-v method).
>
> 4. Remove and cleanup pv sched_clock, and always use pv_sched_clock() for
> all (suggested by Peter Zijlstra in [3]). Some paravirtualizations may
> want to keep the pv sched_clock.

Normally, it should be up to the hypervisor to tell the guest which
clock to use, i.e. if TSC is reliable or not. Let me put my question
this way: if TSC on the particular host is good for everything, why
does the hypervisor advertises 'kvmclock' to its guests? If for some
'historical reasons' we can't revoke features we can always introduce a
new PV feature bit saying that TSC is preferred.

1) Global param doesn't sound like a good idea to me: chances are that
people will be setting it on their guest images to workaround problems
on one hypervisor (or, rather, on one public cloud which is too lazy to
fix their hypervisor) while simultaneously creating problems on another.

2) KVM specific parameter can work, but as KVM's sched_clock is the same
as kvmclock, I'm not convinced it actually makes sense to separate the
two. Like if sched_clock is known to be bad but TSC is good, why do we
need to use PV clock at all? Having a parameter for debugging purposes
may be OK though...

3) This is Hyper-V specific, you can see that it uses a dedicated PV bit
(HV_ACCESS_TSC_INVARIANT) and not the architectural
CPUID.8007H:EDX[8]. I'm not sure we can blindly trust the later on
all hypervisors.

4) Personally, I'm not sure that relying on 'TSC is crap' detection is
100% reliable. I can imagine cases when we can't detect that fact that
while synchronized across CPUs and not going backwards, it is, for
example, ticking with an unstable frequency and PV sched clock is
supposed to give the right correction (all of them are rdtsc() based
anyways, aren't they?).

>
> To introduce a param may be easier to backport to old kernel version.
>
> References:
> [1] 
> https://lore.kernel.org/all/20230926230649.67852-1-dongli.zh...@oracle.com/
> [2] https://lore.kernel.org/all/20231018195638.1898375-1-sea...@google.com/
> [3] 
> https://lore.kernel.org/all/20231002211651.ga3...@noisy.programming.kicks-ass.net/
>
> Thank you very much for the suggestion!
>
> Signed-off-by: Dongli Zhang 
> ---
>  arch/x86/include/asm/paravirt.h |  2 +-
>  arch/x86/kernel/kvmclock.c  | 12 +++-
>  arch/x86/kernel/paravirt.c  | 18 +-
>  3 files changed, 25 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index 6c8ff12140ae..f36edf608b6b 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -24,7 +24,7 @@ u64 dummy_sched_clock(void);
>  DECLARE_STATIC_CALL(pv_steal_clock, dummy_steal_clock);
>  DECLARE_STATIC_CALL(pv_sched_clock, dummy_sched_clock);
>  
> -void paravirt_set_sched_clock(u64 (*func)(void));
> +int paravirt_set_sched_clock(u64 (*func)(void));
>  
>  static __always_inline u64 paravirt_sched_clock(void)
>  {
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index fb8f52149be9..0b8bf5677d44 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -93,13 +93,15 @@ static noinstr u64 kvm_sched_clock_read(void)
>  
>  static inline void kvm_sched_clock_init(bool stable)
>  {
> - if (!stable)
> - clear_sched_clock_stable();
>   kvm_sched_clock_offset = kvm_clock_read();
> - 

Re: Making KMSAN compatible with paravirtualization

2022-11-11 Thread Vitaly Kuznetsov
Alexander Potapenko  writes:

> Hi,
>
> While investigating KMSAN's incompatibilities with the default Ubuntu
> config (https://github.com/google/kmsan/issues/89#issuecomment-1310702949),
> I figured out that a kernel won't boot with both CONFIG_KMSAN=y and
> CONFIG_XEN_PV=y.
>
> In particular, it may crash in load_percpu_segment():
>
> __loadsegment_simple(gs, 0);
> wrmsrl(MSR_GS_BASE, cpu_kernelmode_gs_base(cpu));
>
> Here the value of %gs between __loadsegment_simple() and wrmsrl() is
> zero, so when KMSAN's __msan_get_context_state() instrumentation
> function is called before the actual WRMSR instruction is performed,
> it will attempt to access percpu data and crash.
>
> Unless instructed otherwise (by noinstr or __no_sanitize_memory on the
> source level, or by KMSAN_SANITIZE := n on the Makefile level), KMSAN
> inserts instrumentation at function prologue for every non-inlined
> function, including native_write_msr().
>
> Marking native_write_msr() noinstr actually makes the kernel boot for
> me, but I am not sure if this is enough. In fact we'll need to fix
> every situation in which instrumentation code may be called with
> invalid %gs value. Do you think this is feasible? Overall, should we
> care about KMSAN working with paravirtualization?

I think XEN PV is really special, let's Cc: xen-devel@ first.

-- 
Vitaly




Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor

2022-10-04 Thread Vitaly Kuznetsov
Nadav Amit  writes:

> On Oct 3, 2022, at 8:03 AM, Vitaly Kuznetsov  wrote:
>
>> Not my but rather PCI maintainer's call but IMHO dropping 'const' is
>> better, introducing a new global var is our 'last resort' and should be
>> avoided whenever possible. Alternatively, you can add a
>> raw_pci_ext_ops_preferred() function checking somethin within 'struct
>> hypervisor_x86' but I'm unsure if it's better.
>> 
>> Also, please check Alex' question/suggestion.
>
> Here is my take (and Ajay knows probably more than me):
>
> Looking briefly on MCFG, I do not see a clean way of using the ACPI table.
> The two options are either to use a reserved field (which who knows, might
> be used one day) or some OEM ID. I am also not familiar with
> PCI_COMMAND.MEMORY=0, so Ajay can hopefully give some answer about that.
>
> Anyhow, I understand (although not relate) to the objection for a new global
> variable. How about explicitly calling this hardware bug a “bug” and using
> the proper infrastructure? Calling it explicitly a bug may even push whoever
> can to resolve it.
>
> IOW, how about doing something along the lines of (not tested):
>

Works for me. Going forward, the intention shoud be to also clear the
bug on other x86 hypervisors, e.g. we test modern Hyper-V versions and
if MMIO works well we clear it, we test modern QEMU/KVM setups and if
MMIO works introduce a feature bit somewhere and also clear the bug in
the guest when the bit is set.

-- 
Vitaly




Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor

2022-10-03 Thread Vitaly Kuznetsov
Ajay Kaher  writes:

>> On 13/09/22, 7:05 PM, "Vitaly Kuznetsov"  wrote:
>>>
>>> Thanks Vitaly for your response.
>>>
>>> 1. we have multiple objects of struct pci_raw_ops, 2. adding 'priority' 
>>> field to struct pci_raw_ops
>>> doesn't seems to be appropriate as need to take decision which object of 
>>> struct pci_raw_ops has
>>> to be used, not something with-in struct pci_raw_ops.
>>
>> I'm not sure I follow, you have two instances of 'struct pci_raw_ops'
>> which are called 'raw_pci_ops' and 'raw_pci_ext_ops'. What if you do
>> something like (completely untested):
>>
>> diff --git a/arch/x86/include/asm/pci_x86.h b/arch/x86/include/asm/pci_x86.h
>> index 70533fdcbf02..fb8270fa6c78 100644
>> --- a/arch/x86/include/asm/pci_x86.h
>> +++ b/arch/x86/include/asm/pci_x86.h
>> @@ -116,6 +116,7 @@ extern void (*pcibios_disable_irq)(struct pci_dev *dev);
>> extern bool mp_should_keep_irq(struct device *dev);
>>
>> struct pci_raw_ops {
>> +   int rating;
>>  int (*read)(unsigned int domain, unsigned int bus, unsigned int 
>> devfn,
>>int reg, int len, u32 *val);
>>  int (*write)(unsigned int domain, unsigned int bus, unsigned int 
>> devfn,
>> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
>> index ddb798603201..e9965fd11576 100644
>> --- a/arch/x86/pci/common.c
>> +++ b/arch/x86/pci/common.c
>> @@ -40,7 +40,8 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
>>  int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>> int reg, int len, u32 *val)
>> {
>> -   if (domain == 0 && reg < 256 && raw_pci_ops)
>> +   if (domain == 0 && reg < 256 && raw_pci_ops &&
>> +   (!raw_pci_ext_ops || raw_pci_ext_ops->rating <= 
>> raw_pci_ops->rating))
>> return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
>> if (raw_pci_ext_ops)
>> return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, 
>> val);
>> @@ -50,7 +51,8 @@ int raw_pci_read(unsigned int domain, unsigned int bus, 
>> unsigned int devfn,
>>  int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
>> int reg, int len, u32 val)
>> {
>> -   if (domain == 0 && reg < 256 && raw_pci_ops)
>> +   if (domain == 0 && reg < 256 && raw_pci_ops &&
>> +   (!raw_pci_ext_ops || raw_pci_ext_ops->rating <= 
>> raw_pci_ops->rating))
>> return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
>>  if (raw_pci_ext_ops)
>> return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, 
>> val);
>>
>> and then somewhere in Vmware hypervisor initialization code
>> (arch/x86/kernel/cpu/vmware.c) you do
>>
>>  raw_pci_ext_ops->rating = 100;
>
> Thanks Vitaly, for your review and helping us to improve the code.
>
> I was working to make changes as you suggested, but before sending v3 would 
> like to
> discuss on following:
>
> If we add rating with-in struct pci_raw_ops then we can't have pci_mmcfg as 
> const,
> and following change is must in arch/x86/pci/mmconfig_64.c:
>
> -const struct pci_raw_ops pci_mmcfg = {
> +struct pci_raw_ops pci_mmcfg = {
>   .read = pci_mmcfg_read,
>   .write =pci_mmcfg_write,
> };
>
> So to avoid this change, is it fine to have global bool 
> prefer_raw_pci_ext_ops?
>
> And raw_pci_read() will have following change:
>
> - if (domain == 0 && reg < 256 && raw_pci_ops)
> + if (domain == 0 && reg < 256 && raw_pci_ops &&
> +  (!prefer_raw_pci_ext_ops ||  !raw_pci_ext_ops)
>

Not my but rather PCI maintainer's call but IMHO dropping 'const' is
better, introducing a new global var is our 'last resort' and should be
avoided whenever possible. Alternatively, you can add a
raw_pci_ext_ops_preferred() function checking somethin within 'struct
hypervisor_x86' but I'm unsure if it's better.

Also, please check Alex' question/suggestion.

...

-- 
Vitaly




Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor

2022-09-13 Thread Vitaly Kuznetsov
Ajay Kaher  writes:

> Note: Corrected the Subject.
>
>> On 07/09/22, 8:50 PM, "Vitaly Kuznetsov"  wrote:
>>
>>> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
>>> index ddb7986..1e5a8f7 100644
>>> --- a/arch/x86/pci/common.c
>>> +++ b/arch/x86/pci/common.c
>>> @@ -20,6 +20,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>
>>>  unsigned int pci_probe = PCI_PROBE_BIOS | PCI_PROBE_CONF1 | 
>>> PCI_PROBE_CONF2 |
>>>   PCI_PROBE_MMCONF;
>>> @@ -57,14 +58,58 @@ int raw_pci_write(unsigned int domain, unsigned int 
>>> bus, unsigned int devfn,
>>>   return -EINVAL;
>>>  }
>>>
>>> +#ifdef CONFIG_HYPERVISOR_GUEST
>>> +static int vm_raw_pci_read(unsigned int domain, unsigned int bus, unsigned 
>>> int devfn,
>>> + int reg, int len, u32 *val)
>>> +{
>>> + if (raw_pci_ext_ops)
>>> + return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, 
>>> val);
>>> + if (domain == 0 && reg < 256 && raw_pci_ops)
>>> + return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
>>> + return -EINVAL;
>>> +}
>>> +
>>> +static int vm_raw_pci_write(unsigned int domain, unsigned int bus, 
>>> unsigned int devfn,
>>> + int reg, int len, u32 val)
>>> +{
>>> + if (raw_pci_ext_ops)
>>> + return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, 
>>> val);
>>> + if (domain == 0 && reg < 256 && raw_pci_ops)
>>> + return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
>>> + return -EINVAL;
>>> +}
>>
>> These look exactly like raw_pci_read()/raw_pci_write() but with inverted
>> priority. We could've added a parameter but to be more flexible, I'd
>> suggest we add a 'priority' field to 'struct pci_raw_ops' and make
>> raw_pci_read()/raw_pci_write() check it before deciding what to use
>> first. To be on the safe side, you can leave raw_pci_ops's priority
>> higher than raw_pci_ext_ops's by default and only tweak it in
>> arch/x86/kernel/cpu/vmware.c
>
> Thanks Vitaly for your response.
>
> 1. we have multiple objects of struct pci_raw_ops, 2. adding 'priority' field 
> to struct pci_raw_ops
> doesn't seems to be appropriate as need to take decision which object of 
> struct pci_raw_ops has
> to be used, not something with-in struct pci_raw_ops.

I'm not sure I follow, you have two instances of 'struct pci_raw_ops'
which are called 'raw_pci_ops' and 'raw_pci_ext_ops'. What if you do
something like (completely untested):

diff --git a/arch/x86/include/asm/pci_x86.h b/arch/x86/include/asm/pci_x86.h
index 70533fdcbf02..fb8270fa6c78 100644
--- a/arch/x86/include/asm/pci_x86.h
+++ b/arch/x86/include/asm/pci_x86.h
@@ -116,6 +116,7 @@ extern void (*pcibios_disable_irq)(struct pci_dev *dev);
 extern bool mp_should_keep_irq(struct device *dev);
 
 struct pci_raw_ops {
+   int rating;
int (*read)(unsigned int domain, unsigned int bus, unsigned int devfn,
int reg, int len, u32 *val);
int (*write)(unsigned int domain, unsigned int bus, unsigned int devfn,
diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
index ddb798603201..e9965fd11576 100644
--- a/arch/x86/pci/common.c
+++ b/arch/x86/pci/common.c
@@ -40,7 +40,8 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
 int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
int reg, int len, u32 *val)
 {
-   if (domain == 0 && reg < 256 && raw_pci_ops)
+   if (domain == 0 && reg < 256 && raw_pci_ops &&
+   (!raw_pci_ext_ops || raw_pci_ext_ops->rating <= 
raw_pci_ops->rating))
return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
if (raw_pci_ext_ops)
return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
@@ -50,7 +51,8 @@ int raw_pci_read(unsigned int domain, unsigned int bus, 
unsigned int devfn,
 int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
int reg, int len, u32 val)
 {
-   if (domain == 0 && reg < 256 && raw_pci_ops)
+   if (domain == 0 && reg < 256 && raw_pci_ops &&
+   (!raw_pci_ext_ops || raw_pci_ext_ops->rating <= 
raw_pci_

Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on VMware hypervisor

2022-09-07 Thread Vitaly Kuznetsov
Ajay Kaher  writes:

> During boot-time there are many PCI config reads, these could be performed
> either using Port IO instructions (PIO) or memory mapped I/O (MMIO).
>
> PIO are less efficient than MMIO, they require twice as many PCI accesses
> and PIO instructions are serializing. As a result, MMIO should be preferred
> when possible over PIO.
>
> Virtual Machine test result using VMware hypervisor
> 1 hundred thousand reads using raw_pci_read() took:
> PIO: 12.809 seconds
> MMIO: 8.517 seconds (~33.5% faster then PIO)
>
> Currently, when these reads are performed by a virtual machine, they all
> cause a VM-exit, and therefore each one of them induces a considerable
> overhead.
>
> This overhead can be further improved, by mapping MMIO region of virtual
> machine to memory area that holds the values that the “emulated hardware”
> is supposed to return. The memory region is mapped as "read-only” in the
> NPT/EPT, so reads from these regions would be treated as regular memory
> reads. Writes would still be trapped and emulated by the hypervisor.
>
> Virtual Machine test result with above changes in VMware hypervisor
> 1 hundred thousand read using raw_pci_read() took:
> PIO: 12.809 seconds
> MMIO: 0.010 seconds
>
> This helps to reduce virtual machine PCI scan and initialization time by
> ~65%. In our case it reduced to ~18 mSec from ~55 mSec.
>
> MMIO is also faster than PIO on bare-metal systems, but due to some bugs
> with legacy hardware and the smaller gains on bare-metal, it seems prudent
> not to change bare-metal behavior.

Out of curiosity, are we sure MMIO *always* works for other hypervisors
besides Vmware? Various Hyper-V version can probably be tested (were
they?) but with KVM it's much harder as PCI is emulated in VMM and
there's certainly more than 1 in existence...

>
> Signed-off-by: Ajay Kaher 
> ---
> v1 -> v2:
> Limit changes to apply only to VMs [Matthew W.]
> ---
>  arch/x86/pci/common.c | 45 +
>  1 file changed, 45 insertions(+)
>
> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
> index ddb7986..1e5a8f7 100644
> --- a/arch/x86/pci/common.c
> +++ b/arch/x86/pci/common.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  unsigned int pci_probe = PCI_PROBE_BIOS | PCI_PROBE_CONF1 | PCI_PROBE_CONF2 |
>   PCI_PROBE_MMCONF;
> @@ -57,14 +58,58 @@ int raw_pci_write(unsigned int domain, unsigned int bus, 
> unsigned int devfn,
>   return -EINVAL;
>  }
>  
> +#ifdef CONFIG_HYPERVISOR_GUEST
> +static int vm_raw_pci_read(unsigned int domain, unsigned int bus, unsigned 
> int devfn,
> + int reg, int len, u32 *val)
> +{
> + if (raw_pci_ext_ops)
> + return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
> + if (domain == 0 && reg < 256 && raw_pci_ops)
> + return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
> + return -EINVAL;
> +}
> +
> +static int vm_raw_pci_write(unsigned int domain, unsigned int bus, unsigned 
> int devfn,
> + int reg, int len, u32 val)
> +{
> + if (raw_pci_ext_ops)
> + return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, 
> val);
> + if (domain == 0 && reg < 256 && raw_pci_ops)
> + return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
> + return -EINVAL;
> +}

These look exactly like raw_pci_read()/raw_pci_write() but with inverted
priority. We could've added a parameter but to be more flexible, I'd
suggest we add a 'priority' field to 'struct pci_raw_ops' and make
raw_pci_read()/raw_pci_write() check it before deciding what to use
first. To be on the safe side, you can leave raw_pci_ops's priority
higher than raw_pci_ext_ops's by default and only tweak it in
arch/x86/kernel/cpu/vmware.c 

> +#endif /* CONFIG_HYPERVISOR_GUEST */
> +
>  static int pci_read(struct pci_bus *bus, unsigned int devfn, int where, int 
> size, u32 *value)
>  {
> +#ifdef CONFIG_HYPERVISOR_GUEST
> + /*
> +  * MMIO is faster than PIO, but due to some bugs with legacy
> +  * hardware, it seems prudent to prefer MMIO for VMs and PIO
> +  * for bare-metal.
> +  */
> + if (!hypervisor_is_type(X86_HYPER_NATIVE))
> + return vm_raw_pci_read(pci_domain_nr(bus), bus->number,
> +  devfn, where, size, value);
> +#endif /* CONFIG_HYPERVISOR_GUEST */
> +
>   return raw_pci_read(pci_domain_nr(bus), bus->number,
>devfn, where, size, value);
>  }
>  
>  static int pci_write(struct pci_bus *bus, unsigned int devfn, int where, int 
> size, u32 value)
>  {
> +#ifdef CONFIG_HYPERVISOR_GUEST
> + /*
> +  * MMIO is faster than PIO, but due to some bugs with legacy
> +  * hardware, it seems prudent to prefer MMIO for VMs and PIO
> +  * for bare-metal.
> +  */
> + if 

Re: [BUG report] Deadlock in xen-blkfront upon device hot-unplug

2021-07-15 Thread Vitaly Kuznetsov
Christoph Hellwig  writes:

> On Thu, Jul 15, 2021 at 03:17:37PM +0200, Vitaly Kuznetsov wrote:
>> Christoph Hellwig  writes:
>> 
>> > On Thu, Jul 15, 2021 at 11:16:30AM +0200, Vitaly Kuznetsov wrote:
>> >> I'm observing a deadlock every time I try to unplug a xen-blkfront
>> >> device. With 5.14-rc1+ the deadlock looks like:
>> >
>> > I did actually stumble over this a few days ago just from code
>> > inspection.  Below is what I come up with, can you give it a spin?
>> 
>> This eliminates the deadlock, thanks! Unfortunately, this reveals the
>> same issue I observed when I just dropped taking the mutex from
>> blkfront_closing():
>
> Yeah, this still left too much cruft in blkfront_closing.  Can you
> try this version instead?
>

This one seems to work fine for me! Feel free to throw

Tested-by: Vitaly Kuznetsov 

in. Thanks a lot!

-- 
Vitaly




Re: [BUG report] Deadlock in xen-blkfront upon device hot-unplug

2021-07-15 Thread Vitaly Kuznetsov
Christoph Hellwig  writes:

> On Thu, Jul 15, 2021 at 11:16:30AM +0200, Vitaly Kuznetsov wrote:
>> I'm observing a deadlock every time I try to unplug a xen-blkfront
>> device. With 5.14-rc1+ the deadlock looks like:
>
> I did actually stumble over this a few days ago just from code
> inspection.  Below is what I come up with, can you give it a spin?

This eliminates the deadlock, thanks! Unfortunately, this reveals the
same issue I observed when I just dropped taking the mutex from
blkfront_closing():

[   66.455122] general protection fault, probably for non-canonical address 
0xf1af5e354e6da159:  [#1] SMP PTI
[   66.462802] CPU: 4 PID: 145 Comm: xenwatch Not tainted 5.14.0-rc1+ #370
[   66.467486] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[   66.472817] RIP: 0010:del_timer+0x1f/0x80
[   66.476570] Code: 71 af a3 00 eb c1 31 c0 c3 66 90 0f 1f 44 00 00 41 55 41 
54 45 31 e4 55 48 83 ec 10 65 48 8b 04 25 28 00 00 00 48 89 44 24 08 <48> 8b 47 
08 48 85 c0 74 2d 48 89 e6 48 89 fd e8 dd e8 ff ff 48 89
[   66.493967] RSP: 0018:b5c10426bcd8 EFLAGS: 00010086
[   66.499416] RAX: a49b3c9544841100 RBX: f1af5e354e6da101 RCX: 5ebf
[   66.506386] RDX: 5ec0 RSI: 0001 RDI: f1af5e354e6da151
[   66.512799] RBP: b5c10426bd30 R08: 0001 R09: 0001
[   66.518372] R10: 0001 R11: 0002 R12: 
[   66.523681] R13: 9aba8df63e40 R14:  R15: 9aba86f4
[   66.529365] FS:  () GS:9af60920() 
knlGS:
[   66.536187] CS:  0010 DS:  ES:  CR0: 80050033
[   66.540806] CR2: 7ff024600130 CR3: 00010117a006 CR4: 001706e0
[   66.546345] DR0:  DR1:  DR2: 
[   66.552322] DR3:  DR6: fffe0ff0 DR7: 0400
[   66.558501] Call Trace:
[   66.561449]  try_to_grab_pending+0x13f/0x2e0
[   66.565658]  cancel_delayed_work+0x2e/0xd0
[   66.570012]  blk_mq_stop_hw_queues+0x2d/0x50
[   66.574110]  blkfront_remove+0x30/0x130 [xen_blkfront]
[   66.579049]  xenbus_dev_remove+0x6d/0xf0
[   66.582473]  __device_release_driver+0x180/0x240
[   66.586963]  device_release_driver+0x26/0x40
[   66.591050]  bus_remove_device+0xef/0x160
[   66.594805]  device_del+0x18c/0x3e0
[   66.598570]  ? xenbus_probe_devices+0x120/0x120
[   66.602987]  ? klist_iter_exit+0x14/0x20
[   66.606915]  device_unregister+0x13/0x60
[   66.611135]  xenbus_dev_changed+0x174/0x1e0
[   66.615104]  xenwatch_thread+0x94/0x190
[   66.619028]  ? do_wait_intr_irq+0xb0/0xb0
[   66.623052]  ? xenbus_dev_request_and_reply+0x90/0x90
[   66.628218]  kthread+0x149/0x170
[   66.631509]  ? set_kthread_struct+0x40/0x40
[   66.635355]  ret_from_fork+0x22/0x30
[   66.639162] Modules linked in: vfat fat i2c_piix4 xfs libcrc32c 
crct10dif_pclmul crc32_pclmul crc32c_intel xen_blkfront ghash_clmulni_intel ena
[   66.650868] ---[ end trace 7fa9da1e39697767 ]---
[   66.655490] RIP: 0010:del_timer+0x1f/0x80
[   66.659813] Code: 71 af a3 00 eb c1 31 c0 c3 66 90 0f 1f 44 00 00 41 55 41 
54 45 31 e4 55 48 83 ec 10 65 48 8b 04 25 28 00 00 00 48 89 44 24 08 <48> 8b 47 
08 48 85 c0 74 2d 48 89 e6 48 89 fd e8 dd e8 ff ff 48 89
[   66.681045] RSP: 0018:b5c10426bcd8 EFLAGS: 00010086
[   66.685888] RAX: a49b3c9544841100 RBX: f1af5e354e6da101 RCX: 5ebf
[   66.692153] RDX: 5ec0 RSI: 0001 RDI: f1af5e354e6da151
[   66.698778] RBP: b5c10426bd30 R08: 0001 R09: 0001
[   66.705175] R10: 0001 R11: 0002 R12: 
[   66.711696] R13: 9aba8df63e40 R14:  R15: 9aba86f4
[   66.718035] FS:  () GS:9af60920() 
knlGS:
[   66.725210] CS:  0010 DS:  ES:  CR0: 80050033
[   66.730291] CR2: 7ff024600130 CR3: 00010117a006 CR4: 001706e0
[   66.736235] DR0:  DR1:  DR2: 
[   66.742373] DR3:  DR6: fffe0ff0 DR7: 0400
[   66.749026] BUG: sleeping function called from invalid context at 
include/linux/percpu-rwsem.h:49
[   66.756118] in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 145, 
name: xenwatch
[   66.763473] INFO: lockdep is turned off.
[   66.767428] irq event stamp: 24256
[   66.770900] hardirqs last  enabled at (24255): [] 
_raw_spin_unlock_irqrestore+0x4b/0x5d
[   66.779620] hardirqs last disabled at (24256): [] 
try_to_grab_pending+0x15c/0x2e0
[   66.787763] softirqs last  enabled at (24196): [] 
__irq_exit_rcu+0xe1/0x100
[   66.794519] softirqs last disabled at (24191): [] 
__irq_exit_rcu+0xe1/0x100
[   66.801953] CPU: 4 PID: 145 Comm: xenwatch Tainted: G  D   
5.14.0-rc1+ #370
[   66.809315] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[   66.814924] Call Trace:
[   66.817461]  du

[BUG report] Deadlock in xen-blkfront upon device hot-unplug

2021-07-15 Thread Vitaly Kuznetsov
I'm observing a deadlock every time I try to unplug a xen-blkfront
device. With 5.14-rc1+ the deadlock looks like:

[  513.682405] 
[  513.686617] WARNING: possible recursive locking detected
[  513.691020] 5.14.0-rc1+ #370 Not tainted
[  513.694272] 
[  513.698528] xenwatch/144 is trying to acquire lock:
[  513.702424] 96dc4a4c1d28 (>open_mutex){+.+.}-{3:3}, at: 
del_gendisk+0x53/0x210
[  513.708768] 
   but task is already holding lock:
[  513.713320] 96dc4a4c1d28 (>open_mutex){+.+.}-{3:3}, at: 
blkback_changed+0x118/0xeb9 [xen_blkfront]
[  513.720369] 
   other info that might help us debug this:
[  513.724901]  Possible unsafe locking scenario:

[  513.729241]CPU0
[  513.731326]
[  513.733404]   lock(>open_mutex);
[  513.736679]   lock(>open_mutex);
[  513.739988] 
*** DEADLOCK ***

[  513.745524]  May be due to missing lock nesting notation

[  513.751438] 2 locks held by xenwatch/144:
[  513.755344]  #0: 8c9f3c70 (xenwatch_mutex){+.+.}-{3:3}, at: 
xenwatch_thread+0xe6/0x190
[  513.762137]  #1: 96dc4a4c1d28 (>open_mutex){+.+.}-{3:3}, at: 
blkback_changed+0x118/0xeb9 [xen_blkfront]
[  513.770381] 
   stack backtrace:
[  513.774785] CPU: 1 PID: 144 Comm: xenwatch Not tainted 5.14.0-rc1+ #370
[  513.780131] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[  513.785097] Call Trace:
[  513.787920]  dump_stack_lvl+0x6a/0x9a
[  513.791223]  __lock_acquire.cold+0x14a/0x2ba
[  513.794918]  ? mark_held_locks+0x50/0x80
[  513.798453]  lock_acquire+0xd3/0x2f0
[  513.801819]  ? del_gendisk+0x53/0x210
[  513.805334]  ? kernfs_put.part.0+0xe8/0x1b0
[  513.808905]  ? del_gendisk+0x53/0x210
[  513.812230]  __mutex_lock+0x8d/0x8c0
[  513.815415]  ? del_gendisk+0x53/0x210
[  513.818931]  ? kernfs_put.part.0+0xe8/0x1b0
[  513.822594]  del_gendisk+0x53/0x210
[  513.825782]  xlvbd_release_gendisk+0x6f/0xb0 [xen_blkfront]
[  513.830186]  blkback_changed+0x20e/0xeb9 [xen_blkfront]
[  513.834458]  ? xenbus_read_driver_state+0x39/0x60
[  513.838540]  xenwatch_thread+0x94/0x190
[  513.841936]  ? do_wait_intr_irq+0xb0/0xb0
[  513.845451]  ? xenbus_dev_request_and_reply+0x90/0x90
[  513.849885]  kthread+0x149/0x170
[  513.853039]  ? set_kthread_struct+0x40/0x40
[  513.857027]  ret_from_fork+0x22/0x30

My suspicion is that the problem was introduced by:

commit c76f48eb5c084b1e15c931ae8cc1826cd771d70d
Author: Christoph Hellwig 
Date:   Tue Apr 6 08:22:56 2021 +0200

block: take bd_mutex around delete_partitions in del_gendisk

blkfront_closing() takes '>bd_disk->open_mutex' around
xlvbd_release_gendisk() call which in its turn calls del_gendisk() which
after the above mentioned commit tries to take the same mutex. I may be
completely wrong though.

If I try to avoid taking the mutex from blkfront_closing(): 

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 8d49f8fa98bb..9af6831492d4 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -2145,8 +2145,6 @@ static void blkfront_closing(struct blkfront_info *info)
return;
}
 
-   mutex_lock(>bd_disk->open_mutex);
-
if (bdev->bd_openers) {
xenbus_dev_error(xbdev, -EBUSY,
 "Device in use; refusing to close");
@@ -2156,7 +2154,6 @@ static void blkfront_closing(struct blkfront_info *info)
xenbus_frontend_closed(xbdev);
}
 
-   mutex_unlock(>bd_disk->open_mutex);
bdput(bdev);
 }
 
the situation becomes even worse:

[   74.371465] general protection fault, probably for non-canonical address 
0xb0fa8ce8ee8a2234:  [#1] SMP PTI
[   74.381294] CPU: 3 PID: 144 Comm: xenwatch Not tainted 5.14.0-rc1+ #370
[   74.386172] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[   74.390918] RIP: 0010:del_timer+0x1f/0x80
[   74.394282] Code: 71 af a3 00 eb c1 31 c0 c3 66 90 0f 1f 44 00 00 41 55 41 
54 45 31 e4 55 48 83 ec 10 65 48 8b 04 25 28 00 00 00 48 89 44 24 08 <48> 8b 47 
08 48 85 c0 74 2d 48 89 e6 48 89 fd e8 dd e8 ff ff 48 89
[   74.407591] RSP: 0018:bab68423bcc8 EFLAGS: 00010082
[   74.411691] RAX: dd931e09aefb8f00 RBX: b0fa8ce8ee8a21dc RCX: 5e7f
[   74.417041] RDX: 5e80 RSI: 0001 RDI: b0fa8ce8ee8a222c
[   74.422425] RBP: bab68423bd20 R08: 0001 R09: 0001
[   74.427595] R10: 0001 R11: 0002 R12: 
[   74.432886] R13: a0484f3e4000 R14:  R15: a0484691c000
[   74.438784] FS:  () GS:a083c8e0() 
knlGS:
[   74.444592] CS:  0010 DS:  ES:  CR0: 80050033
[   74.448917] CR2: 7ff618903ff8 CR3: 000111e16001 CR4: 001706e0
[   74.454309] DR0:  DR1:  DR2: 
[   74.460128] DR3: 

Re: [RFC PATCH V3 10/11] HV/Netvsc: Add Isolation VM support for netvsc driver

2021-06-10 Thread Vitaly Kuznetsov
Tianyu Lan  writes:

> From: Tianyu Lan 
>
> In Isolation VM, all shared memory with host needs to mark visible
> to host via hvcall. vmbus_establish_gpadl() has already done it for
> netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
> pagebuffer() still need to handle. Use DMA API to map/umap these
> memory during sending/receiving packet and Hyper-V DMA ops callback
> will use swiotlb function to allocate bounce buffer and copy data
> from/to bounce buffer.
>
> Signed-off-by: Tianyu Lan 
> ---
>  drivers/net/hyperv/hyperv_net.h   |   6 ++
>  drivers/net/hyperv/netvsc.c   | 125 --
>  drivers/net/hyperv/rndis_filter.c |   3 +
>  include/linux/hyperv.h|   5 ++
>  4 files changed, 133 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index b11aa68b44ec..c2fbb9d4df2c 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -164,6 +164,7 @@ struct hv_netvsc_packet {
>   u32 total_bytes;
>   u32 send_buf_index;
>   u32 total_data_buflen;
> + struct hv_dma_range *dma_range;
>  };
>  
>  #define NETVSC_HASH_KEYLEN 40
> @@ -1074,6 +1075,7 @@ struct netvsc_device {
>  
>   /* Receive buffer allocated by us but manages by NetVSP */
>   void *recv_buf;
> + void *recv_original_buf;
>   u32 recv_buf_size; /* allocated bytes */
>   u32 recv_buf_gpadl_handle;
>   u32 recv_section_cnt;
> @@ -1082,6 +1084,8 @@ struct netvsc_device {
>  
>   /* Send buffer allocated by us */
>   void *send_buf;
> + void *send_original_buf;
> + u32 send_buf_size;
>   u32 send_buf_gpadl_handle;
>   u32 send_section_cnt;
>   u32 send_section_size;
> @@ -1729,4 +1733,6 @@ struct rndis_message {
>  #define RETRY_US_HI  1
>  #define RETRY_MAX2000/* >10 sec */
>  
> +void netvsc_dma_unmap(struct hv_device *hv_dev,
> +   struct hv_netvsc_packet *packet);
>  #endif /* _HYPERV_NET_H */
> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index 7bd935412853..a01740c6c6b8 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
> @@ -153,8 +153,21 @@ static void free_netvsc_device(struct rcu_head *head)
>   int i;
>  
>   kfree(nvdev->extension);
> - vfree(nvdev->recv_buf);
> - vfree(nvdev->send_buf);
> +
> + if (nvdev->recv_original_buf) {
> + vunmap(nvdev->recv_buf);
> + vfree(nvdev->recv_original_buf);
> + } else {
> + vfree(nvdev->recv_buf);
> + }
> +
> + if (nvdev->send_original_buf) {
> + vunmap(nvdev->send_buf);
> + vfree(nvdev->send_original_buf);
> + } else {
> + vfree(nvdev->send_buf);
> + }
> +
>   kfree(nvdev->send_section_map);
>  
>   for (i = 0; i < VRSS_CHANNEL_MAX; i++) {
> @@ -338,8 +351,10 @@ static int netvsc_init_buf(struct hv_device *device,
>   struct net_device *ndev = hv_get_drvdata(device);
>   struct nvsp_message *init_packet;
>   unsigned int buf_size;
> + unsigned long *pfns;
>   size_t map_words;
>   int i, ret = 0;
> + void *vaddr;
>  
>   /* Get receive buffer area. */
>   buf_size = device_info->recv_sections * device_info->recv_section_size;
> @@ -375,6 +390,21 @@ static int netvsc_init_buf(struct hv_device *device,
>   goto cleanup;
>   }
>  
> + if (hv_isolation_type_snp()) {
> + pfns = kcalloc(buf_size / HV_HYP_PAGE_SIZE, sizeof(unsigned 
> long),
> +GFP_KERNEL);
> + for (i = 0; i < buf_size / HV_HYP_PAGE_SIZE; i++)
> + pfns[i] = virt_to_hvpfn(net_device->recv_buf + i * 
> HV_HYP_PAGE_SIZE) +
> + (ms_hyperv.shared_gpa_boundary >> 
> HV_HYP_PAGE_SHIFT);
> +
> + vaddr = vmap_pfn(pfns, buf_size / HV_HYP_PAGE_SIZE, 
> PAGE_KERNEL_IO);
> + kfree(pfns);
> + if (!vaddr)
> + goto cleanup;
> + net_device->recv_original_buf = net_device->recv_buf;
> + net_device->recv_buf = vaddr;
> + }
> +
>   /* Notify the NetVsp of the gpadl handle */
>   init_packet = _device->channel_init_pkt;
>   memset(init_packet, 0, sizeof(struct nvsp_message));
> @@ -477,6 +507,23 @@ static int netvsc_init_buf(struct hv_device *device,
>   goto cleanup;
>   }
>  
> + if (hv_isolation_type_snp()) {
> + pfns = kcalloc(buf_size / HV_HYP_PAGE_SIZE, sizeof(unsigned 
> long),
> +GFP_KERNEL);
> +
> + for (i = 0; i < buf_size / HV_HYP_PAGE_SIZE; i++)
> + pfns[i] = virt_to_hvpfn(net_device->send_buf + i * 
> HV_HYP_PAGE_SIZE)
> + + (ms_hyperv.shared_gpa_boundary >> 
> HV_HYP_PAGE_SHIFT);
> +
> + vaddr = vmap_pfn(pfns, buf_size / HV_HYP_PAGE_SIZE, 
> PAGE_KERNEL_IO);

Re: [RFC PATCH V3 03/11] x86/Hyper-V: Add new hvcall guest address host visibility support

2021-06-10 Thread Vitaly Kuznetsov
Tianyu Lan  writes:

> From: Tianyu Lan 
>
> Add new hvcall guest address host visibility support. Mark vmbus
> ring buffer visible to host when create gpadl buffer and mark back
> to not visible when tear down gpadl buffer.
>
> Co-developed-by: Sunil Muthuswamy 
> Signed-off-by: Tianyu Lan 
> ---
>  arch/x86/hyperv/Makefile   |   2 +-
>  arch/x86/hyperv/ivm.c  | 106 +
>  arch/x86/include/asm/hyperv-tlfs.h |  24 +++
>  arch/x86/include/asm/mshyperv.h|   4 +-
>  arch/x86/mm/pat/set_memory.c   |  10 ++-
>  drivers/hv/channel.c   |  38 ++-
>  include/asm-generic/hyperv-tlfs.h  |   1 +
>  include/linux/hyperv.h |  10 +++
>  8 files changed, 190 insertions(+), 5 deletions(-)
>  create mode 100644 arch/x86/hyperv/ivm.c
>
> diff --git a/arch/x86/hyperv/Makefile b/arch/x86/hyperv/Makefile
> index 48e2c51464e8..5d2de10809ae 100644
> --- a/arch/x86/hyperv/Makefile
> +++ b/arch/x86/hyperv/Makefile
> @@ -1,5 +1,5 @@
>  # SPDX-License-Identifier: GPL-2.0-only
> -obj-y:= hv_init.o mmu.o nested.o irqdomain.o
> +obj-y:= hv_init.o mmu.o nested.o irqdomain.o ivm.o
>  obj-$(CONFIG_X86_64) += hv_apic.o hv_proc.o
>  
>  ifdef CONFIG_X86_64
> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> new file mode 100644
> index ..fad1d3024056
> --- /dev/null
> +++ b/arch/x86/hyperv/ivm.c
> @@ -0,0 +1,106 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Hyper-V Isolation VM interface with paravisor and hypervisor
> + *
> + * Author:
> + *  Tianyu Lan 
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/*
> + * hv_mark_gpa_visibility - Set pages visible to host via hvcall.
> + *
> + * In Isolation VM, all guest memory is encripted from host and guest
> + * needs to set memory visible to host via hvcall before sharing memory
> + * with host.
> + */
> +int hv_mark_gpa_visibility(u16 count, const u64 pfn[], u32 visibility)
> +{
> + struct hv_gpa_range_for_visibility **input_pcpu, *input;
> + u16 pages_processed;
> + u64 hv_status;
> + unsigned long flags;
> +
> + /* no-op if partition isolation is not enabled */
> + if (!hv_is_isolation_supported())
> + return 0;
> +
> + if (count > HV_MAX_MODIFY_GPA_REP_COUNT) {
> + pr_err("Hyper-V: GPA count:%d exceeds supported:%lu\n", count,
> + HV_MAX_MODIFY_GPA_REP_COUNT);
> + return -EINVAL;
> + }
> +
> + local_irq_save(flags);
> + input_pcpu = (struct hv_gpa_range_for_visibility **)
> + this_cpu_ptr(hyperv_pcpu_input_arg);
> + input = *input_pcpu;
> + if (unlikely(!input)) {
> + local_irq_restore(flags);
> + return -EINVAL;
> + }
> +
> + input->partition_id = HV_PARTITION_ID_SELF;
> + input->host_visibility = visibility;
> + input->reserved0 = 0;
> + input->reserved1 = 0;
> + memcpy((void *)input->gpa_page_list, pfn, count * sizeof(*pfn));
> + hv_status = hv_do_rep_hypercall(
> + HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY, count,
> + 0, input, _processed);
> + local_irq_restore(flags);
> +
> + if (!(hv_status & HV_HYPERCALL_RESULT_MASK))
> + return 0;
> +
> + return hv_status & HV_HYPERCALL_RESULT_MASK;
> +}
> +EXPORT_SYMBOL(hv_mark_gpa_visibility);
> +
> +/*
> + * hv_set_mem_host_visibility - Set specified memory visible to host.
> + *
> + * In Isolation VM, all guest memory is encrypted from host and guest
> + * needs to set memory visible to host via hvcall before sharing memory
> + * with host. This function works as wrap of hv_mark_gpa_visibility()
> + * with memory base and size.
> + */
> +int hv_set_mem_host_visibility(void *kbuffer, size_t size,
> +enum vmbus_page_visibility visibility)
> +{
> + int pagecount = size >> HV_HYP_PAGE_SHIFT;
> + u64 *pfn_array;
> + int ret = 0;
> + int i, pfn;
> +
> + if (!hv_is_isolation_supported())
> + return 0;
> +
> + pfn_array = vzalloc(HV_HYP_PAGE_SIZE);
> + if (!pfn_array)
> + return -ENOMEM;
> +
> + for (i = 0, pfn = 0; i < pagecount; i++) {
> + pfn_array[pfn] = virt_to_hvpfn(kbuffer + i * HV_HYP_PAGE_SIZE);
> + pfn++;
> +
> + if (pfn == HV_MAX_MODIFY_GPA_REP_COUNT || i == pagecount - 1) {
> + ret |= hv_mark_gpa_visibility(pfn, pfn_array, 
> visibility);
> + pfn = 0;
> +
> + if (ret)
> + goto err_free_pfn_array;
> + }
> + }
> +
> + err_free_pfn_array:
> + vfree(pfn_array);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(hv_set_mem_host_visibility);
> +
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
> b/arch/x86/include/asm/hyperv-tlfs.h
> index 606f5cc579b2..632281b91b44 100644
> --- 

Re: [Xen-devel] [RFC 6/6] arm64: hyperv: Enable vDSO

2019-12-17 Thread Vitaly Kuznetsov
Boqun Feng  writes:

> Similar to x86, add a new vclock_mode VCLOCK_HVCLOCK, and reuse the
> hv_read_tsc_page() for userspace to read tsc page clocksource.
>
> Signed-off-by: Boqun Feng (Microsoft) 
> ---
>  arch/arm64/include/asm/clocksource.h   |  3 ++-
>  arch/arm64/include/asm/mshyperv.h  |  2 +-
>  arch/arm64/include/asm/vdso/gettimeofday.h | 19 +++
>  3 files changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/include/asm/clocksource.h 
> b/arch/arm64/include/asm/clocksource.h
> index fbe80057468c..c6acd45fe748 100644
> --- a/arch/arm64/include/asm/clocksource.h
> +++ b/arch/arm64/include/asm/clocksource.h
> @@ -4,7 +4,8 @@
>  
>  #define VCLOCK_NONE  0   /* No vDSO clock available. */
>  #define VCLOCK_CNTVCT1   /* vDSO should use cntvcnt  
> */
> -#define VCLOCK_MAX   1
> +#define VCLOCK_HVCLOCK   2   /* vDSO should use vread_hvclock()  
> */
> +#define VCLOCK_MAX   2
>  
>  struct arch_clocksource_data {
>   int vclock_mode;
> diff --git a/arch/arm64/include/asm/mshyperv.h 
> b/arch/arm64/include/asm/mshyperv.h
> index 0afb00e3501d..7c85dd816dca 100644
> --- a/arch/arm64/include/asm/mshyperv.h
> +++ b/arch/arm64/include/asm/mshyperv.h
> @@ -90,7 +90,7 @@ extern void hv_get_vpreg_128(u32 reg, struct 
> hv_get_vp_register_output *result);
>  #define hv_set_reference_tsc(val) \
>   hv_set_vpreg(HV_REGISTER_REFERENCE_TSC, val)
>  #define hv_set_clocksource_vdso(val) \
> - ((val).archdata.vclock_mode = VCLOCK_NONE)
> + ((val).archdata.vclock_mode = VCLOCK_HVCLOCK)
>  
>  #if IS_ENABLED(CONFIG_HYPERV)
>  #define hv_enable_stimer0_percpu_irq(irq)enable_percpu_irq(irq, 0)
> diff --git a/arch/arm64/include/asm/vdso/gettimeofday.h 
> b/arch/arm64/include/asm/vdso/gettimeofday.h
> index e6e3fe0488c7..7e689b903f4d 100644
> --- a/arch/arm64/include/asm/vdso/gettimeofday.h
> +++ b/arch/arm64/include/asm/vdso/gettimeofday.h
> @@ -67,6 +67,20 @@ int clock_getres_fallback(clockid_t _clkid, struct 
> __kernel_timespec *_ts)
>   return ret;
>  }
>  
> +#ifdef CONFIG_HYPERV_TIMER
> +/* This will override the default hv_get_raw_timer() */
> +#define hv_get_raw_timer() __arch_counter_get_cntvct()
> +#include 
> +
> +extern struct ms_hyperv_tsc_page
> +_hvclock_page __attribute__((visibility("hidden")));
> +
> +static u64 vread_hvclock(void)
> +{
> + return hv_read_tsc_page(&_hvclock_page);
> +}
> +#endif

The function is almost the same on x86 (&_hvclock_page ->
_page), would it maybe make sense to move this to arch neutral
clocksource/hyperv_timer.h?

> +
>  static __always_inline u64 __arch_get_hw_counter(s32 clock_mode)
>  {
>   u64 res;
> @@ -78,6 +92,11 @@ static __always_inline u64 __arch_get_hw_counter(s32 
> clock_mode)
>   if (clock_mode == VCLOCK_NONE)
>   return __VDSO_USE_SYSCALL;
>  
> +#ifdef CONFIG_HYPERV_TIMER
> + if (likely(clock_mode == VCLOCK_HVCLOCK))
> + return vread_hvclock();

I'm not sure likely() is justified here: it'll make ALL builds which
enable CONFIG_HYPERV_TIMER (e.g. distro kernels) to prefer
VCLOCK_HVCLOCK, even if the kernel is not running on Hyper-V.

> +#endif
> +
>   /*
>* This isb() is required to prevent that the counter value
>* is speculated.

-- 
Vitaly


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread Vitaly Kuznetsov
Dave Hansen  writes:

> On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
>> It is more than just memmaps (e.g. forking udev process doing memory
>> onlining also needs memory) but yes, the main idea is to make the
>> onlining synchronous with hotplug.
>
> That's a good theoretical concern.
>
> But, is it a problem we need to solve in practice?

Yes, unfortunately. It was previously discovered that when we try to
hotplug tons of memory to a low memory system (this is a common scenario
with VMs) we end up with OOM because for all new memory blocks we need
to allocate page tables, struct pages, ... and we need memory to do
that. The userspace program doing memory onlining also needs memory to
run and in case it prefers to fork to handle hundreds of notfifications
... well, it may get OOMkilled before it manages to online anything.

Allocating all kernel objects from the newly hotplugged blocks would
definitely help to manage the situation but as I said this won't solve
the 'forking udev' problem completely (it will likely remain in
'extreme' cases only. We can probably work around it by onlining with a
dedicated process which doesn't do memory allocation).

-- 
Vitaly

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread Vitaly Kuznetsov
Michal Hocko  writes:

> On Wed 03-10-18 15:38:04, Vitaly Kuznetsov wrote:
>> David Hildenbrand  writes:
>> 
>> > On 02/10/2018 15:47, Michal Hocko wrote:
>> ...
>> >> 
>> >> Why do you need a generic hotplug rule in the first place? Why don't you
>> >> simply provide different set of rules for different usecases? Let users
>> >> decide which usecase they prefer rather than try to be clever which
>> >> almost always hits weird corner cases.
>> >> 
>> >
>> > Memory hotplug has to work as reliable as we can out of the box. Letting
>> > the user make simple decisions like "oh, I am on hyper-V, I want to
>> > online memory to the normal zone" does not feel right. But yes, we
>> > should definitely allow to make modifications.
>> 
>> Last time I was thinking about the imperfectness of the auto-online
>> solution we have and any other solution we're able to suggest an idea
>> came to my mind - what if we add an eBPF attach point to the
>> auto-onlining mechanism effecively offloading decision-making to
>> userspace. We'll of couse need to provide all required data (e.g. how
>> memory blocks are aligned with physical DIMMs as it makes no sense to
>> online part of DIMM as normal and the rest as movable as it's going to
>> be impossible to unplug such DIMM anyways).
>
> And how does that differ from the notification mechanism we have? Just
> by not relying on the process scheduling? If yes then this revolves
> around the implementation detail that you care about time-to-hot-add
> vs. time-to-online. And that is a solveable problem - just allocate
> memmaps from the hot-added memory.

It is more than just memmaps (e.g. forking udev process doing memory
onlining also needs memory) but yes, the main idea is to make the
onlining synchronous with hotplug.

>
> As David said some of the memory cannot be onlined without further steps
> (e.g. when it is standby as David called it) and then I fail to see how
> eBPF help in any way.

and also, we can fight till the end of days here trying to come up with
an onlining solution which would work for everyone and eBPF would move
this decision to distro level.

-- 
Vitaly

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread Vitaly Kuznetsov
David Hildenbrand  writes:

> On 02/10/2018 15:47, Michal Hocko wrote:
...
>> 
>> Why do you need a generic hotplug rule in the first place? Why don't you
>> simply provide different set of rules for different usecases? Let users
>> decide which usecase they prefer rather than try to be clever which
>> almost always hits weird corner cases.
>> 
>
> Memory hotplug has to work as reliable as we can out of the box. Letting
> the user make simple decisions like "oh, I am on hyper-V, I want to
> online memory to the normal zone" does not feel right. But yes, we
> should definitely allow to make modifications.

Last time I was thinking about the imperfectness of the auto-online
solution we have and any other solution we're able to suggest an idea
came to my mind - what if we add an eBPF attach point to the
auto-onlining mechanism effecively offloading decision-making to
userspace. We'll of couse need to provide all required data (e.g. how
memory blocks are aligned with physical DIMMs as it makes no sense to
online part of DIMM as normal and the rest as movable as it's going to
be impossible to unplug such DIMM anyways).

-- 
Vitaly

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC] x86/lapic: remove the PIT usage to calibrate the lapic timer

2018-09-19 Thread Vitaly Kuznetsov
Roger Pau Monne  writes:

> And instead use NOW which is based on the TSC. This was already used
> when running in shim mode, since there's likely no PIT in that
> environment.
>
> Remove printing the CPU frequency, since it's already printed earlier
> at boot, and getting the CPU frequency against the TSC without any
> external reference timer is pointless.
>
> The motivation behind this change is to allow Xen to boot on HyperV
> gen2 instances, which lack a PIT.

When on Hyper-V, LAPIC frequency can easily be queried from
HV_X64_MSR_APIC_FREQUENCY (and that's what Linux does nowdays). This, of
course, only if Hyper-V provides the interface but it always does.

-- 
  Vitaly

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH] xen/manage: don't complain about an empty value in control/sysrq node

2018-09-06 Thread Vitaly Kuznetsov
When guest receives a sysrq request from the host it acknowledges it by
writing '\0' to control/sysrq xenstore node. This, however, make xenstore
watch fire again but xenbus_scanf() fails to parse empty value with "%c"
format string:

 sysrq: SysRq : Emergency Sync
 Emergency Sync complete
 xen:manage: Error -34 reading sysrq code in control/sysrq

Ignore -ERANGE the same way we already ignore -ENOENT, empty value in
control/sysrq is totally legal.

Signed-off-by: Vitaly Kuznetsov 
---
This is a follow-up to my Xen toolstack patch:
https://lists.xenproject.org/archives/html/xen-devel/2018-09/msg00266.html
without it we're seeing -EPERM on write and the issue I'm trying to address
here stays hidden.
---
 drivers/xen/manage.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index c93d8ef8df34..5bb01a62f214 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -280,9 +280,11 @@ static void sysrq_handler(struct xenbus_watch *watch, 
const char *path,
/*
 * The Xenstore watch fires directly after registering it and
 * after a suspend/resume cycle. So ENOENT is no error but
-* might happen in those cases.
+* might happen in those cases. ERANGE is observed when we get
+* an empty value (''), this happens when we acknowledge the
+* request by writing '\0' below.
 */
-   if (err != -ENOENT)
+   if (err != -ENOENT && err != -ERANGE)
pr_err("Error %d reading sysrq code in control/sysrq\n",
   err);
xenbus_transaction_end(xbt, 1);
-- 
2.14.4


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] libxl: create control/sysrq xenstore node

2018-09-05 Thread Vitaly Kuznetsov
Wei Liu  writes:

> Also CC Linux maintainers.
>
> On Tue, Sep 04, 2018 at 07:27:31PM +0200, Vitaly Kuznetsov wrote:
>> Wei Liu  writes:
>> 
>> > On Tue, Sep 04, 2018 at 01:39:29PM +0200, Vitaly Kuznetsov wrote:
>> >> 'xl sysrq' command doesn't work with modern Linux guests with the 
>> >> following
>> >> message in guest's log:
>> >> 
>> >>  xen:manage: sysrq_handler: Error -13 writing sysrq in control/sysrq
>> >> 
>> >> xenstore trace confirms:
>> >> 
>> >>  IN 0x24bd9a0 20180904 04:36:32 WRITE (control/sysrq )
>> >>  OUT 0x24bd9a0 20180904 04:36:32 ERROR (EACCES )
>> >> 
>> >> The problem seems to be in the fact that we don't pre-create control/sysrq
>> >> xenstore node and libxl_send_sysrq() doing libxl__xs_printf() creates it 
>> >> as
>> >> read-only. As we want to allow guests to clean 'control/sysrq' after the
>> >> requested action is performed, we need to make this node writable.
>> >> 
>> >> Signed-off-by: Vitaly Kuznetsov 
>> >
>> > Hmm... this node isn't documented.
>> >
>> 
>> But libxl already writes it, see libxl_send_sysrq().
>
> I think your patch is fine.
>
> What I wanted to do is to document this node. :p
>
>> 
>> > Can you point me to the code in Linux?
>> >
>> 
>> Sure,
>> 
>> drivers/xen/manage.c:
>
> Alright, it appears DomU only expects a single character to be written.
>
> I have written the following diff, do you think it is sensible? I
> believe this is Linux only, but I have CC Roger for correction.
>
> diff --git a/docs/misc/xenstore-paths.markdown 
> b/docs/misc/xenstore-paths.markdown
> index 60c8b3fbe5..33d281915c 100644
> --- a/docs/misc/xenstore-paths.markdown
> +++ b/docs/misc/xenstore-paths.markdown
> @@ -410,6 +410,14 @@ by udev ("0") or will be run by the toolstack directly 
> ("1").
>
>  ### Platform Feature and Control Paths
>
> + ~/control/sysrq = (""|COMMAND) [w]
> +
> +This is the PV SysRq control node. A toolstack can write a single character
> +representing a magic SysRq key understood by the Linux kernel.  The guest
> +acknowledges a request by writing the empty string back to the command node.
> +
> +This protocol is Linux only.
> +
>   ~/control/shutdown = (""|COMMAND) [w]
>

Makes perfect sense to me and sounds correct, thanks!

-- 
  Vitaly

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] libxl: create control/sysrq xenstore node

2018-09-04 Thread Vitaly Kuznetsov
Wei Liu  writes:

> On Tue, Sep 04, 2018 at 01:39:29PM +0200, Vitaly Kuznetsov wrote:
>> 'xl sysrq' command doesn't work with modern Linux guests with the following
>> message in guest's log:
>> 
>>  xen:manage: sysrq_handler: Error -13 writing sysrq in control/sysrq
>> 
>> xenstore trace confirms:
>> 
>>  IN 0x24bd9a0 20180904 04:36:32 WRITE (control/sysrq )
>>  OUT 0x24bd9a0 20180904 04:36:32 ERROR (EACCES )
>> 
>> The problem seems to be in the fact that we don't pre-create control/sysrq
>> xenstore node and libxl_send_sysrq() doing libxl__xs_printf() creates it as
>> read-only. As we want to allow guests to clean 'control/sysrq' after the
>> requested action is performed, we need to make this node writable.
>> 
>> Signed-off-by: Vitaly Kuznetsov 
>
> Hmm... this node isn't documented.
>

But libxl already writes it, see libxl_send_sysrq().

> Can you point me to the code in Linux?
>

Sure,

drivers/xen/manage.c:

static void sysrq_handler(struct xenbus_watch *watch, const char *path,
  const char *token)
{
char sysrq_key = '\0';
struct xenbus_transaction xbt;
int err;

 again:
err = xenbus_transaction_start();
if (err)
return;
err = xenbus_scanf(xbt, "control", "sysrq", "%c", _key);
if (err < 0) {
/*
 * The Xenstore watch fires directly after registering it and
 * after a suspend/resume cycle. So ENOENT is no error but
 * might happen in those cases.
 */
if (err != -ENOENT)
pr_err("Error %d reading sysrq code in control/sysrq\n",
   err);
xenbus_transaction_end(xbt, 1);
return;
}

if (sysrq_key != '\0') {
err = xenbus_printf(xbt, "control", "sysrq", "%c", '\0');
if (err) {
pr_err("%s: Error %d writing sysrq in control/sysrq\n",
   __func__, err);
xenbus_transaction_end(xbt, 1);
return;
}
}

err = xenbus_transaction_end(xbt, 0);
if (err == -EAGAIN)
goto again;

if (sysrq_key != '\0')
handle_sysrq(sysrq_key);
}

static struct xenbus_watch sysrq_watch = {
.node = "control/sysrq",
.callback = sysrq_handler
};


-- 
  Vitaly

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH] libxl: create control/sysrq xenstore node

2018-09-04 Thread Vitaly Kuznetsov
'xl sysrq' command doesn't work with modern Linux guests with the following
message in guest's log:

 xen:manage: sysrq_handler: Error -13 writing sysrq in control/sysrq

xenstore trace confirms:

 IN 0x24bd9a0 20180904 04:36:32 WRITE (control/sysrq )
 OUT 0x24bd9a0 20180904 04:36:32 ERROR (EACCES )

The problem seems to be in the fact that we don't pre-create control/sysrq
xenstore node and libxl_send_sysrq() doing libxl__xs_printf() creates it as
read-only. As we want to allow guests to clean 'control/sysrq' after the
requested action is performed, we need to make this node writable.

Signed-off-by: Vitaly Kuznetsov 
---
- Linux code will need to be modified too. With this patch we get something
like

 sysrq: SysRq : Emergency Sync
 xen:manage: Error -34 reading sysrq code in control/sysrq
 Emergency Sync complete

This new ERANGE fault happens because after we clean control/sysrq node
xenstore watch in the guest is re-asserted but the value is empty ("") so
xenbus_scanf() fails "%c" parsing.

It seems the feature is long broken.
---
 tools/libxl/libxl_create.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index b7b44e280b..d60f952f73 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -692,6 +692,9 @@ retry_transaction:
 GCSPRINTF("%s/control/feature-s4", dom_path),
 rwperm, ARRAY_SIZE(rwperm));
 }
+libxl__xs_mknod(gc, t,
+GCSPRINTF("%s/control/sysrq", dom_path),
+rwperm, ARRAY_SIZE(rwperm));
 libxl__xs_mknod(gc, t,
 GCSPRINTF("%s/device/suspend/event-channel", dom_path),
 rwperm, ARRAY_SIZE(rwperm));
-- 
2.14.4


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH net-next] xen-netfront: fix queue name setting

2018-07-20 Thread Vitaly Kuznetsov
Commit f599c64fdf7d ("xen-netfront: Fix race between device setup and
open") changed the initialization order: xennet_create_queues() now
happens before we do register_netdev() so using netdev->name in
xennet_init_queue() is incorrect, we end up with the following in
/proc/interrupts:

 60:139  0   xen-dyn-event eth%d-q0-tx
 61:265  0   xen-dyn-event eth%d-q0-rx
 62:234  0   xen-dyn-event eth%d-q1-tx
 63:  1  0   xen-dyn-event eth%d-q1-rx

and this looks ugly. Actually, using early netdev name (even when it's
already set) is also not ideal: nowadays we tend to rename eth devices
and queue name may end up not corresponding to the netdev name.

Use nodename from xenbus device for queue naming: this can't change in VM's
lifetime. Now /proc/interrupts looks like

 62:202  0   xen-dyn-event device/vif/0-q0-tx
 63:317  0   xen-dyn-event device/vif/0-q0-rx
 64:262  0   xen-dyn-event device/vif/0-q1-tx
 65: 17  0   xen-dyn-event device/vif/0-q1-rx

Fixes: f599c64fdf7d ("xen-netfront: Fix race between device setup and open")
Signed-off-by: Vitaly Kuznetsov 
---
 drivers/net/xen-netfront.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index a57daecf1d57..1b40b648ed5c 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -1604,7 +1604,7 @@ static int xennet_init_queue(struct netfront_queue *queue)
timer_setup(>rx_refill_timer, rx_refill_timeout, 0);
 
snprintf(queue->name, sizeof(queue->name), "%s-q%u",
-queue->info->netdev->name, queue->id);
+queue->info->xbdev->nodename, queue->id);
 
/* Initialise tx_skbs as a free chain containing every entry. */
queue->tx_skb_freelist = 0;
-- 
2.14.4


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel