Re: [Xen-devel] Linux as 32-bit Dom0?

2017-11-22 Thread Juergen Gross
On 22/11/17 15:05, Jan Beulich wrote:
> Jürgen, Boris,
> 
> am I trying something that's not allowed, but selectable via Kconfig?
> On system with multiple IO-APICs (I assume that's what triggers the
> problem) I get
> 
> Kernel panic - not syncing: Max apic_id exceeded!

Generally I don't think 32 bit dom0 is forbidden, but rarely used. I
wouldn't be too sad in case we'd decide to drop that support. ;-)

Can you please be a little bit more specific?

How many IOAPICs? From the code I guess this is an INTEL system with not
too recent IOAPIC versions (<0x14)?

Having a little bit more of the boot log might help, too.


Juergen

> 
> CPU: 0 PID: 0 Comm: swapper Not tainted 4.14.1-2017-11-21-xen0 #6
> Hardware name: ...
> Call Trace:
>  ? show_stack+0x20/0x50
>  ? dump_stack+0x7e/0xc0
>  ? panic+0x99/0x220
>  ? io_apic_get_unique_id+0x207/0x210
>  ? __raw_callee_save_xen_restore_fl+0x6/0x8
>  ? xen_flush_tlb_single+0x6f/0x80
>  ? set_pte_vaddr+0xef/0x110
>  ? xen_io_apic_read+0x36/0x90
>  ? mp_register_ioapic+0x2b7/0x410
>  ? acpi_os_map_iomem+0x14d/0x210
>  ? acpi_parse_ioapic+0x6b/0x6f
>  ? acpi_parse_entries_array+0xa1/0x16d
>  ? acpi_tb_get_table+0x95/0x9d
>  ? acpi_ut_release_mutex+0x109/0x10e
>  ? acpi_table_parse_entries_array+0x98/0xa8
>  ? acpi_table_parse_entries+0x30/0x35
>  ? acpi_boot_init+0x65/0x65
>  ? acpi_table_parse_madt+0x1b/0x1f
>  ? acpi_boot_init+0x65/0x65
>  ? acpi_parse_madt_ioapic_entries+0x4a/0x119
>  ? mutex_lock+0x8/0x30
>  ? acpi_process_madt+0xbe/0x105
>  ? acpi_boot_init+0x3d/0x65
>  ? setup_arch+0x67a/0x83f
>  ? 0xc100
>  ? start_kernel+0x46/0x361
>  ? x86_early_init_platform_quirks+0x4d/0x90
>  ? i386_start_kernel+0x22/0x88
>  ? xen_start_kernel+0x453/0x639
> (XEN) Hardware Dom0 crashed: 'noreboot' set - not rebooting.
> 
> I admit I have a few custom patches in that tree, but I'm reasonably
> certain that none of them comes even close to having such an effect.
> 
> Jan
> 


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 21/30] xen: deprecate pci_get_bus_and_slot()

2017-11-22 Thread Juergen Gross
On 22/11/17 06:31, Sinan Kaya wrote:
> pci_get_bus_and_slot() is restrictive such that it assumes domain=0 as
> where a PCI device is present. This restricts the device drivers to be
> reused for other domain numbers.
> 
> Use pci_get_domain_bus_and_slot() with a domain number of 0 where we can't
> extract the domain number. Other places, use the actual domain number from
> the device.
> 
> Signed-off-by: Sinan Kaya <ok...@codeaurora.org>

Reviewed-by: Juergen Gross <jgr...@suse.com>


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen PV seems to be broken on Linus' tree

2017-11-22 Thread Juergen Gross
On 22/11/17 05:46, Andy Lutomirski wrote:
> On Tue, Nov 21, 2017 at 8:11 PM, Andy Lutomirski  wrote:
>> On Tue, Nov 21, 2017 at 7:33 PM, Andy Lutomirski  wrote:
>>> I'm doing:
>>>
>>> /usr/bin/qemu-system-x86_64 -machine accel=kvm:tcg -cpu host -net none
>>> -nographic -kernel xen-4.8.2 -initrd './arch/x86/boot/bzImage' -m 2G
>>> -smp 2 -append console=com1
>>>
>>> With Linus' commit c8a0739b185d11d6e2ca7ad9f5835841d1cfc765 and the
>>> attached config.
>>>
>>> It dies with a bunch of sensible log lines and then:
>>>
>>> (XEN) d0v0 Unhandled invalid opcode fault/trap [#6, ec=]
>>> (XEN) domain_crash_sync called from entry.S: fault at 82d08023961a
>>> entry.o#create_bounce_frame+0x137/0x146
>>> (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
>>> (XEN) [ Xen-4.8.2  x86_64  debug=n   Not tainted ]
>>> (XEN) CPU:0
>>> (XEN) RIP:e033:[]
>>> (XEN) RFLAGS: 0296   EM: 1   CONTEXT: pv guest (d0v0)
>>> (XEN) rax: 002f   rbx: 81e65a48   rcx: 81e71288
>>> (XEN) rdx: 81e27500   rsi: 0001   rdi: 81133f88
>>> (XEN) rbp:    rsp: 81e03e78   r8:  
>>> (XEN) r9:  0001   r10:    r11: 
>>> (XEN) r12:    r13: 0001   r14: 0001
>>> (XEN) r15:    cr0: 8005003b   cr4: 003506e0
>>> (XEN) cr3: 7b0b3000   cr2: 
>>> (XEN) ds:    es:    fs:    gs:    ss: e02b   cs: e033
>>> (XEN) Guest stack trace from rsp=81e03e78:
>>> (XEN)81e71288  811226eb 0001e030
>>> (XEN)00010096 81e03eb8 e02b 811226eb
>>> (XEN)81122c2e 0200  
>>> (XEN)0030 81c69cf5 81080b20 81080560
>>> (XEN) 810d3741 8107b420 81094660
>>>
>>> Is this familiar?
>>>
>>> I'll feel really dumb if it ends up being my fault.
>>
>> Nah, it's broken at least back to v4.13, and I suspect it's config
>> related.  objdump gives me this:
>>
>> 8112b0e1:   e9 e8 fe ff ff  jmpq
>> 8112afce 
>> 8112b0e6:   48 c7 c6 2d f8 c8 81mov
>> $0x81c8f82d,%rsi
>> 8112b0ed:   48 c7 c7 58 b9 c8 81mov
>> $0x81c8b958,%rdi
>> 8112b0f4:   e8 13 2d 01 00  callq  8113de0c 
>> 
>> 8112b0f9:   0f ff   (bad)   <-- crash here
>>
>> That's "ud0", which is used by WARN.  So we're probably hitting an
>> early warning and Xen probably has something busted with early
>> exception handling.
>>
>> Anyone want to debug it and fix it?
> 
> Well, I think I debugged it.  x86_64 has a shiny function
> idt_setup_early_handler(), and Xen doesn't call it.  Fixing the
> problem may be as simple as calling it at an appropriate time and
> doing whatever asm magic is needed to deal with Xen's weird IDT
> calling convention.

Hmm, yes, this should work. I'll have a try.

BTW: I don't think this ever worked.


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2] mini-os: add config item for printing via hypervisor

2017-11-22 Thread Juergen Gross
Today Mini-OS will print all console output via the hypervisor, too.

Make this behavior configurable instead and default it to "off".

Signed-off-by: Juergen Gross <jgr...@suse.com>
Acked-by: Samuel Thibault <samuel.thiba...@ens-lyon.org>
---
V2: keep comment (Samuel Thibault)
---
 Config.mk | 4 
 arch/x86/testbuild/all-no | 1 +
 arch/x86/testbuild/all-yes| 1 +
 arch/x86/testbuild/newxen-yes | 1 +
 console/console.c | 7 +--
 5 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/Config.mk b/Config.mk
index 0baedd1..f6a2afa 100644
--- a/Config.mk
+++ b/Config.mk
@@ -180,6 +180,9 @@ CONFIG_XENBUS ?= y
 CONFIG_XC ?=y
 CONFIG_LWIP ?= $(lwip)
 CONFIG_BALLOON ?= n
+# Setting CONFIG_USE_XEN_CONSOLE copies all print output to the Xen emergency
+# console apart of standard dom0 handled console.
+CONFIG_USE_XEN_CONSOLE ?= n
 
 # Export config items as compiler directives
 DEFINES-$(CONFIG_PARAVIRT) += -DCONFIG_PARAVIRT
@@ -197,6 +200,7 @@ DEFINES-$(CONFIG_FBFRONT) += -DCONFIG_FBFRONT
 DEFINES-$(CONFIG_CONSFRONT) += -DCONFIG_CONSFRONT
 DEFINES-$(CONFIG_XENBUS) += -DCONFIG_XENBUS
 DEFINES-$(CONFIG_BALLOON) += -DCONFIG_BALLOON
+DEFINES-$(CONFIG_USE_XEN_CONSOLE) += -DCONFIG_USE_XEN_CONSOLE
 
 DEFINES-y += -D__XEN_INTERFACE_VERSION__=$(XEN_INTERFACE_VERSION)
 
diff --git a/arch/x86/testbuild/all-no b/arch/x86/testbuild/all-no
index 78720c3..1c50bba 100644
--- a/arch/x86/testbuild/all-no
+++ b/arch/x86/testbuild/all-no
@@ -16,3 +16,4 @@ CONFIG_XENBUS = n
 CONFIG_XC = n
 CONFIG_LWIP = n
 CONFIG_BALLOON = n
+CONFIG_USE_XEN_CONSOLE = n
diff --git a/arch/x86/testbuild/all-yes b/arch/x86/testbuild/all-yes
index 303c56b..8732e69 100644
--- a/arch/x86/testbuild/all-yes
+++ b/arch/x86/testbuild/all-yes
@@ -17,3 +17,4 @@ CONFIG_XC = y
 # LWIP is special: it needs support from outside
 CONFIG_LWIP = n
 CONFIG_BALLOON = y
+CONFIG_USE_XEN_CONSOLE = y
diff --git a/arch/x86/testbuild/newxen-yes b/arch/x86/testbuild/newxen-yes
index 907a8a0..9c30c00 100644
--- a/arch/x86/testbuild/newxen-yes
+++ b/arch/x86/testbuild/newxen-yes
@@ -17,4 +17,5 @@ CONFIG_XC = y
 # LWIP is special: it needs support from outside
 CONFIG_LWIP = n
 CONFIG_BALLOON = y
+CONFIG_USE_XEN_CONSOLE = y
 XEN_INTERFACE_VERSION=__XEN_LATEST_INTERFACE_VERSION__
diff --git a/console/console.c b/console/console.c
index 2e04552..6a0b923 100644
--- a/console/console.c
+++ b/console/console.c
@@ -45,11 +45,6 @@
 #include 
 
 
-/* Copies all print output to the Xen emergency console apart
-   of standard dom0 handled console */
-#define USE_XEN_CONSOLE
-
-
 /* If console not initialised the printk will be sent to xen serial line 
NOTE: you need to enable verbose in xen/Rules.mk for it to work. */
 static int console_initialised = 0;
@@ -135,7 +130,7 @@ void print(int direct, const char *fmt, va_list args)
 (void)HYPERVISOR_console_io(CONSOLEIO_write, strlen(buf), buf);
 return;
 } else {
-#ifndef USE_XEN_CONSOLE
+#ifndef CONFIG_USE_XEN_CONSOLE
 if(!console_initialised)
 #endif
 (void)HYPERVISOR_console_io(CONSOLEIO_write, strlen(buf), buf);
-- 
2.12.3


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH] mini-os: add config item for printing via hypervisor

2017-11-21 Thread Juergen Gross
Today Mini-OS will print all console output via the hypervisor, too.

Make this behavior configurable instead and default it to "off".

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
 Config.mk | 2 ++
 arch/x86/testbuild/all-no | 1 +
 arch/x86/testbuild/all-yes| 1 +
 arch/x86/testbuild/newxen-yes | 1 +
 console/console.c | 7 +--
 5 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/Config.mk b/Config.mk
index 0baedd1..586dce0 100644
--- a/Config.mk
+++ b/Config.mk
@@ -180,6 +180,7 @@ CONFIG_XENBUS ?= y
 CONFIG_XC ?=y
 CONFIG_LWIP ?= $(lwip)
 CONFIG_BALLOON ?= n
+CONFIG_USE_XEN_CONSOLE ?= n
 
 # Export config items as compiler directives
 DEFINES-$(CONFIG_PARAVIRT) += -DCONFIG_PARAVIRT
@@ -197,6 +198,7 @@ DEFINES-$(CONFIG_FBFRONT) += -DCONFIG_FBFRONT
 DEFINES-$(CONFIG_CONSFRONT) += -DCONFIG_CONSFRONT
 DEFINES-$(CONFIG_XENBUS) += -DCONFIG_XENBUS
 DEFINES-$(CONFIG_BALLOON) += -DCONFIG_BALLOON
+DEFINES-$(CONFIG_USE_XEN_CONSOLE) += -DCONFIG_USE_XEN_CONSOLE
 
 DEFINES-y += -D__XEN_INTERFACE_VERSION__=$(XEN_INTERFACE_VERSION)
 
diff --git a/arch/x86/testbuild/all-no b/arch/x86/testbuild/all-no
index 78720c3..1c50bba 100644
--- a/arch/x86/testbuild/all-no
+++ b/arch/x86/testbuild/all-no
@@ -16,3 +16,4 @@ CONFIG_XENBUS = n
 CONFIG_XC = n
 CONFIG_LWIP = n
 CONFIG_BALLOON = n
+CONFIG_USE_XEN_CONSOLE = n
diff --git a/arch/x86/testbuild/all-yes b/arch/x86/testbuild/all-yes
index 303c56b..8732e69 100644
--- a/arch/x86/testbuild/all-yes
+++ b/arch/x86/testbuild/all-yes
@@ -17,3 +17,4 @@ CONFIG_XC = y
 # LWIP is special: it needs support from outside
 CONFIG_LWIP = n
 CONFIG_BALLOON = y
+CONFIG_USE_XEN_CONSOLE = y
diff --git a/arch/x86/testbuild/newxen-yes b/arch/x86/testbuild/newxen-yes
index 907a8a0..9c30c00 100644
--- a/arch/x86/testbuild/newxen-yes
+++ b/arch/x86/testbuild/newxen-yes
@@ -17,4 +17,5 @@ CONFIG_XC = y
 # LWIP is special: it needs support from outside
 CONFIG_LWIP = n
 CONFIG_BALLOON = y
+CONFIG_USE_XEN_CONSOLE = y
 XEN_INTERFACE_VERSION=__XEN_LATEST_INTERFACE_VERSION__
diff --git a/console/console.c b/console/console.c
index 2e04552..6a0b923 100644
--- a/console/console.c
+++ b/console/console.c
@@ -45,11 +45,6 @@
 #include 
 
 
-/* Copies all print output to the Xen emergency console apart
-   of standard dom0 handled console */
-#define USE_XEN_CONSOLE
-
-
 /* If console not initialised the printk will be sent to xen serial line 
NOTE: you need to enable verbose in xen/Rules.mk for it to work. */
 static int console_initialised = 0;
@@ -135,7 +130,7 @@ void print(int direct, const char *fmt, va_list args)
 (void)HYPERVISOR_console_io(CONSOLEIO_write, strlen(buf), buf);
 return;
 } else {
-#ifndef USE_XEN_CONSOLE
+#ifndef CONFIG_USE_XEN_CONSOLE
 if(!console_initialised)
 #endif
 (void)HYPERVISOR_console_io(CONSOLEIO_write, strlen(buf), buf);
-- 
2.12.3


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2] tools/libxl: mark special pages as reserved in e820 map for PVH

2017-11-21 Thread Juergen Gross
On 21/11/17 12:27, Jan Beulich wrote:
 On 21.11.17 at 12:06,  wrote:
>> The "special pages" for PVH guests include the frames for console and
>> Xenstore ring buffers. Those have to be marked as "Reserved" in the
>> guest's E820 map, as otherwise conflicts might arise later e.g. when
>> hotplugging memory into the guest.
> 
> Afaict this differs from v1 only in no longer adding the extra entry
> for HVM. How does this address the concerns raised on v1 wrt spec
> compliance? v1 having caused problems with hvmloader should not
> have resulted in simply excluding HVM here. That's even more so
> because we mean HVM and PVH to converge in the long run - I'd
> expect that to mean that no clear type distinction would exist
> anymore on libxl.

The difference is for HVM the HVMloader is creating the additional
"Reserved" entry.

> If you want to reserve Xenstore ring buffer and console page,
> why don't you reserve just the two (provided of course they
> live outside of any [fake] PCI device BAR), which then ought to
> also be compatible with plain HVM?

For PVH the "mmio" area is starting with the LAPIC and extends up
to 4GB. And all of this area conflicts with the HVMloader:

(d11) HVM Loader
(d11) Detected Xen v4.10-unstable
(d11) Fail to setup memory map due to conflict on dynamic reserved
memory range.
(d11) *** HVMLoader bug at e820.c:52
(d11) *** HVMLoader crashed.

I guess this should be fixed, but I wanted the patch to be as small
as possible to minimze the risk for 4.10.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH for-4.10] libxc: load acpi RSDP table at correct address

2017-11-21 Thread Juergen Gross
On 21/11/17 11:42, Andrew Cooper wrote:
> On 21/11/17 07:44, Jan Beulich wrote:
> On 20.11.17 at 17:59,  wrote:
>>> On 11/20/2017 11:43 AM, Jan Beulich wrote:
>>> On 20.11.17 at 17:28,  wrote:
> On 11/20/2017 11:26 AM, Jan Beulich wrote:
> On 20.11.17 at 17:14,  wrote:
>>> What could cause grub2 to fail to find space for the pointer in the
>>> first page? Will we ever have anything in EBDA (which is one of the
>>> possible RSDP locations)?
>> Well, the EBDA (see the B in its name) is again something that's
>> meaningless without there being a BIOS.
> Exactly. So it should always be available for grub to copy the pointer
> there.
 But what use would it be if grub copied it there? It just shouldn't
 be there, neither before nor after grub (just like grub doesn't
 introduce firmware into the system).
>>> So that the guest can find it using standard methods. If Xen can't
>>> guarantee ACPI-compliant placement of the pointer then someone has to
>>> help the guest find it in the expected place. We can do it with a
>>> dedicated entry point by setting the pointer explicitly (although
>>> admittedly this is not done correctly now) or we need to have firmware
>>> (grub2) place it in the "right" location.
>>>
>>> (It does look a bit hacky though)
>> Indeed. Of course ACPI without any actual firmware is sort of odd,
>> too. As to dedicated entry point and its alternatives: Xen itself
>> tells grub (aiui we're talking about a flavor of it running PVH itself)
>> where the RSDP is. Why can't grub forward that information in a
>> suitable way (e.g. via a new tag, or - for Linux - as a new entry
>> in the Linux boot header)?
> 
> Or if the worst comes to the worst, fabricate an acpi_rsdp= command line
> parameter?

This would be easy: just replace the #ifdef CONFIG_KEXEC in
drivers/acpi/osl.c of the Linux kernel with:

#if defined(CONFIG_KEXEC) || defined(CONFIG_XEN_PVH)

and the parameter is usable and active.

Another possibility would be to let grub copy the RSDP to below 1MB and
add the E820 entry for it.

In any case it seems save to let Xen place the RSDP just below 4GB,
together with console and Xenstore pages (so this area just expands).
grub can handle this either on its own or together with the kernel.

Lets see how Roger's plans with BSD look like.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2] tools/libxl: mark special pages as reserved in e820 map for PVH

2017-11-21 Thread Juergen Gross
The "special pages" for PVH guests include the frames for console and
Xenstore ring buffers. Those have to be marked as "Reserved" in the
guest's E820 map, as otherwise conflicts might arise later e.g. when
hotplugging memory into the guest.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
This is a bugfix for PVH guests. Please consider for 4.10.
---
 tools/libxl/libxl_x86.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
index 5f91fe4f92..d82013f6ed 100644
--- a/tools/libxl/libxl_x86.c
+++ b/tools/libxl/libxl_x86.c
@@ -530,6 +530,9 @@ int libxl__arch_domain_construct_memmap(libxl__gc *gc,
 if (d_config->rdms[i].policy != LIBXL_RDM_RESERVE_POLICY_INVALID)
 e820_entries++;
 
+/* Add mmio entry for PVH. */
+if (dom->mmio_size && d_config->b_info.type == LIBXL_DOMAIN_TYPE_PVH)
+e820_entries++;
 
 /* If we should have a highmem range. */
 if (highmem_size)
@@ -564,6 +567,14 @@ int libxl__arch_domain_construct_memmap(libxl__gc *gc,
 nr++;
 }
 
+/* mmio area */
+if (dom->mmio_size && d_config->b_info.type == LIBXL_DOMAIN_TYPE_PVH) {
+e820[nr].addr = dom->mmio_start;
+e820[nr].size = dom->mmio_size;
+e820[nr].type = E820_RESERVED;
+nr++;
+}
+
 for (i = 0; i < MAX_ACPI_MODULES; i++) {
 if (dom->acpi_modules[i].length) {
 e820[nr].addr = dom->acpi_modules[i].guest_addr_out & ~(page_size 
- 1);
-- 
2.12.3


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH for-4.10] libxc: load acpi RSDP table at correct address

2017-11-21 Thread Juergen Gross
On 21/11/17 09:46, Jan Beulich wrote:
 On 21.11.17 at 09:13,  wrote:
>> On 21/11/17 08:50, Jan Beulich wrote:
>> On 20.11.17 at 19:28,  wrote:
 On 20/11/17 17:14, Jan Beulich wrote:
 On 20.11.17 at 16:24,  wrote:
>> So without my patch the RSDP table is loaded e.g. at about 6.5MB when
>> I'm using grub2 (the loaded grub image is about 5.5MB in size and it
>> is being loaded at 1MB).
>>
>> When I'm using the PVH Linux kernel directly RSDP is just below 1MB
>> due to pure luck (the bzImage loader is still using the PV specific
>> ELF notes and this results in the loader believing RSDP is loadable
>> at this address, which is true, but the tests used to come to this
>> conclusion are just not applicable for PVH).
>>
>> So in your opinion we should revoke the PVH support from Xen 4.10,
>> Linux and maybe BSD because RSDP is loaded in middle of RAM of the
>> guest?
>
> So what's wrong with it being put wherever the next free memory
> location is being determined to be by the loader, just like is being
> done for other information, including modules (if any)?

 The RSDP table is marked as "Reserved" in the memory map. So putting it
 somewhere in the middle of the guest's memory will force the guest to
 use 4kB pages instead of 2MB or even 1GB pages. I'd really like to avoid
 this problem, as we've been hit by the very same in HVM guests before
 causing quite measurable performance drops.
>>>
>>> This is a valid point.
>>>
 So I'd rather put it in the first MB as most kernels have to deal with
 small pages at beginning of RAM today. An alternative would be to put
 it just below 4GB where e.g. the console and Xenstore page are located.
>>>
>>> Putting it in the first Mb implies that mappings there will continue to
>>> be 4k ones. I can't, however, see why for PVH that should be
>>> necessary: There's no BIOS and nothing legacy that needs to live
>>> there, so other than HVM it could benefit from using a 1Gb mapping
>>> even at address zero (even if this might be something that can't
>>> be achieved right away). So yes, if anything, the allocation should
>>> be made top down starting from 4Gb. Otoh, I don't see a strict
>>> need for this area to live below 4Gb in the first place.
>>
>> The physical RSDP address in the PVH start info block is 32 bits
>> only. So it can't be above 4GB.
> 
> struct hvm_start_info {
> uint32_t magic; /* Contains the magic value 0x336ec578   
> */
> /* ("xEn3" with the 0x80 bit of the "E" 
> set).*/
> uint32_t version;   /* Version of this structure.
> */
> uint32_t flags; /* SIF_xxx flags.
> */
> uint32_t nr_modules;/* Number of modules passed to the kernel.   
> */
> uint64_t modlist_paddr; /* Physical address of an array of   
> */
> /* hvm_modlist_entry.
> */
> uint64_t cmdline_paddr; /* Physical address of the command line. 
> */
> uint64_t rsdp_paddr;/* Physical address of the RSDP ACPI data
> */
> /* structure.
> */
> };

Oh, it seems I have been looking into an outdated document. Thanks for
the correction.

> Granted a comment a few lines up in the public header says "NB: Xen
> on x86 will always try to place all the data below the 4GiB boundary."

Okay.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH for-4.10] libxc: load acpi RSDP table at correct address

2017-11-21 Thread Juergen Gross
On 21/11/17 08:50, Jan Beulich wrote:
 On 20.11.17 at 19:28,  wrote:
>> On 20/11/17 17:14, Jan Beulich wrote:
>> On 20.11.17 at 16:24,  wrote:
 On 20/11/17 15:20, Jan Beulich wrote:
 On 20.11.17 at 15:14,  wrote:
>> On 20/11/17 14:56, Boris Ostrovsky wrote:
>>> On 11/20/2017 06:50 AM, Jan Beulich wrote:
>>> On 20.11.17 at 12:20,  wrote:
> Which restriction? I'm loading the RSDP table to its architectural
> correct addres if possible, otherwise it will be loaded to the same
> address as without my patch. So I'm not adding a restriction, but
> removing one.
 What is "architecturally correct" in PVH can't be read out of
 specs other than what we write down. When there's no BIOS,
 placing anything right below the 1Mb boundary is at least
 bogus.
>>>
>>> Unless it's a UEFI boot -- where else would you put it? Aren't these two
>>> (UEFI and non-UEFI) the only two options that the ACPI spec provides?
>>
>> I think Jan is right: for PVH its _our_ job to define the correct
>> placement. Which still can be the same as in the BIOS case, making
>> it easier to adapt any guest systems.
>>
>> So I'd say: The RSDP address in PVH case is passed in the PVH start
>> info block to the guest. In case there is no conflict with the
>> physical load address of the guest kernel the preferred address of
>> the RSDP is right below the 1MB boundary.
>>
>> Would this wording be okay?
>
> To be honest (and in case it wasn't sufficiently clear form my
> earlier replies) - I'm pretty much opposed to this below-1Mb thing.
> There ought to be just plain RAM there for PVH.

 So without my patch the RSDP table is loaded e.g. at about 6.5MB when
 I'm using grub2 (the loaded grub image is about 5.5MB in size and it
 is being loaded at 1MB).

 When I'm using the PVH Linux kernel directly RSDP is just below 1MB
 due to pure luck (the bzImage loader is still using the PV specific
 ELF notes and this results in the loader believing RSDP is loadable
 at this address, which is true, but the tests used to come to this
 conclusion are just not applicable for PVH).

 So in your opinion we should revoke the PVH support from Xen 4.10,
 Linux and maybe BSD because RSDP is loaded in middle of RAM of the
 guest?
>>>
>>> So what's wrong with it being put wherever the next free memory
>>> location is being determined to be by the loader, just like is being
>>> done for other information, including modules (if any)?
>>
>> The RSDP table is marked as "Reserved" in the memory map. So putting it
>> somewhere in the middle of the guest's memory will force the guest to
>> use 4kB pages instead of 2MB or even 1GB pages. I'd really like to avoid
>> this problem, as we've been hit by the very same in HVM guests before
>> causing quite measurable performance drops.
> 
> This is a valid point.
> 
>> So I'd rather put it in the first MB as most kernels have to deal with
>> small pages at beginning of RAM today. An alternative would be to put
>> it just below 4GB where e.g. the console and Xenstore page are located.
> 
> Putting it in the first Mb implies that mappings there will continue to
> be 4k ones. I can't, however, see why for PVH that should be
> necessary: There's no BIOS and nothing legacy that needs to live
> there, so other than HVM it could benefit from using a 1Gb mapping
> even at address zero (even if this might be something that can't
> be achieved right away). So yes, if anything, the allocation should
> be made top down starting from 4Gb. Otoh, I don't see a strict
> need for this area to live below 4Gb in the first place.

The physical RSDP address in the PVH start info block is 32 bits
only. So it can't be above 4GB.

 Doing it in a proper way you are outlining above would render
 current PVH guests unusable.
>>>
>>> I'm afraid I don't understand which outline of mine you refer to.
>>> Iirc all I said was that placing it below 1Mb is bogus. Top of RAM
>>> (as I think I saw being mentioned elsewhere) probably isn't much
>>> better. Special locations should be used only if there's really no
>>> other way to convey information.
>>
>> I believe it is better to define where reserved memory areas are located
>> in order to make it easier for a kernel to deal with the consequences. A
>> location like "somewhere in memory, either just after the loaded kernel,
>> or after the boot loader which was loaded previously" seems not to be a
>> proper solution. Especially as a change in size of the boot loader could
>> make it impossible to load the kernel at the desired location, as this
>> would contradict the information returned by the hypervisor in the
>> memory map.
> 
> Well, you imply here that the kernel has to search for the structure.
> In that case having 

Re: [Xen-devel] [PATCH for-4.10] libxc: load acpi RSDP table at correct address

2017-11-20 Thread Juergen Gross
On 20/11/17 17:14, Boris Ostrovsky wrote:
> On 11/20/2017 10:27 AM, Juergen Gross wrote:
>> On 20/11/17 15:25, Boris Ostrovsky wrote:
>>> On 11/20/2017 09:14 AM, Juergen Gross wrote:
>>>> On 20/11/17 14:56, Boris Ostrovsky wrote:
>>>>> On 11/20/2017 06:50 AM, Jan Beulich wrote:
>>>>>>>>> On 20.11.17 at 12:20, <jgr...@suse.com> wrote:
>>>>>>> Which restriction? I'm loading the RSDP table to its architectural
>>>>>>> correct addres if possible, otherwise it will be loaded to the same
>>>>>>> address as without my patch. So I'm not adding a restriction, but
>>>>>>> removing one.
>>>>>> What is "architecturally correct" in PVH can't be read out of
>>>>>> specs other than what we write down. When there's no BIOS,
>>>>>> placing anything right below the 1Mb boundary is at least
>>>>>> bogus.
>>>>> Unless it's a UEFI boot -- where else would you put it? Aren't these two
>>>>> (UEFI and non-UEFI) the only two options that the ACPI spec provides?
>>>> I think Jan is right: for PVH its _our_ job to define the correct
>>>> placement. 
>>> Yes, and if it is placed in a non-standard location then the guest will
>>> have to deal with it in a non-standard way. Which we can in Linux by
>>> setting acpi_rsdp pointer in the special PVH entry point, before jumping
>>> to Linux "standard" entry --- startup_{32|64}().
>>>
>>> But if your goal is to avoid that special entry point (and thus not set
>>> acpi_rsdp) then how do you expect kernel to find RSDP?
>>>
>>>> Which still can be the same as in the BIOS case, making
>>>> it easier to adapt any guest systems.
>>>>
>>>> So I'd say: The RSDP address in PVH case is passed in the PVH start
>>>> info block to the guest. In case there is no conflict with the
>>>> physical load address of the guest kernel the preferred address of
>>>> the RSDP is right below the 1MB boundary.
>>> And what do we do if there *is* a conflict?
>> Either as without my patch: use the first available free memory page.
>>
>> Or: add a domain config parameter for specifying the RSDP address
>> (e.g. default: as today, top: end of RAM, legacy: just below 1MB, or
>> a specific value) and fail to load in case of a conflict.
> 
> This feels like a band-aid to work around a problem that we want to fix
> in the long term anyway.
> 
> What could cause grub2 to fail to find space for the pointer in the
> first page? Will we ever have anything in EBDA (which is one of the
> possible RSDP locations)?

This isn't something grub2 has to deal with. The RSDP is in a reserved
area of the memory, so it can't be relocated by grub2.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH for-4.10] libxc: load acpi RSDP table at correct address

2017-11-20 Thread Juergen Gross
On 20/11/17 17:14, Jan Beulich wrote:
 On 20.11.17 at 16:24,  wrote:
>> On 20/11/17 15:20, Jan Beulich wrote:
>> On 20.11.17 at 15:14,  wrote:
 On 20/11/17 14:56, Boris Ostrovsky wrote:
> On 11/20/2017 06:50 AM, Jan Beulich wrote:
> On 20.11.17 at 12:20,  wrote:
>>> Which restriction? I'm loading the RSDP table to its architectural
>>> correct addres if possible, otherwise it will be loaded to the same
>>> address as without my patch. So I'm not adding a restriction, but
>>> removing one.
>> What is "architecturally correct" in PVH can't be read out of
>> specs other than what we write down. When there's no BIOS,
>> placing anything right below the 1Mb boundary is at least
>> bogus.
>
> Unless it's a UEFI boot -- where else would you put it? Aren't these two
> (UEFI and non-UEFI) the only two options that the ACPI spec provides?

 I think Jan is right: for PVH its _our_ job to define the correct
 placement. Which still can be the same as in the BIOS case, making
 it easier to adapt any guest systems.

 So I'd say: The RSDP address in PVH case is passed in the PVH start
 info block to the guest. In case there is no conflict with the
 physical load address of the guest kernel the preferred address of
 the RSDP is right below the 1MB boundary.

 Would this wording be okay?
>>>
>>> To be honest (and in case it wasn't sufficiently clear form my
>>> earlier replies) - I'm pretty much opposed to this below-1Mb thing.
>>> There ought to be just plain RAM there for PVH.
>>
>> So without my patch the RSDP table is loaded e.g. at about 6.5MB when
>> I'm using grub2 (the loaded grub image is about 5.5MB in size and it
>> is being loaded at 1MB).
>>
>> When I'm using the PVH Linux kernel directly RSDP is just below 1MB
>> due to pure luck (the bzImage loader is still using the PV specific
>> ELF notes and this results in the loader believing RSDP is loadable
>> at this address, which is true, but the tests used to come to this
>> conclusion are just not applicable for PVH).
>>
>> So in your opinion we should revoke the PVH support from Xen 4.10,
>> Linux and maybe BSD because RSDP is loaded in middle of RAM of the
>> guest?
> 
> So what's wrong with it being put wherever the next free memory
> location is being determined to be by the loader, just like is being
> done for other information, including modules (if any)?

The RSDP table is marked as "Reserved" in the memory map. So putting it
somewhere in the middle of the guest's memory will force the guest to
use 4kB pages instead of 2MB or even 1GB pages. I'd really like to avoid
this problem, as we've been hit by the very same in HVM guests before
causing quite measurable performance drops.

So I'd rather put it in the first MB as most kernels have to deal with
small pages at beginning of RAM today. An alternative would be to put
it just below 4GB where e.g. the console and Xenstore page are located.

> 
>> Doing it in a proper way you are outlining above would render
>> current PVH guests unusable.
> 
> I'm afraid I don't understand which outline of mine you refer to.
> Iirc all I said was that placing it below 1Mb is bogus. Top of RAM
> (as I think I saw being mentioned elsewhere) probably isn't much
> better. Special locations should be used only if there's really no
> other way to convey information.

I believe it is better to define where reserved memory areas are located
in order to make it easier for a kernel to deal with the consequences. A
location like "somewhere in memory, either just after the loaded kernel,
or after the boot loader which was loaded previously" seems not to be a
proper solution. Especially as a change in size of the boot loader could
make it impossible to load the kernel at the desired location, as this
would contradict the information returned by the hypervisor in the
memory map.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH for-4.10] libxc: load acpi RSDP table at correct address

2017-11-20 Thread Juergen Gross
On 20/11/17 15:25, Boris Ostrovsky wrote:
> On 11/20/2017 09:14 AM, Juergen Gross wrote:
>> On 20/11/17 14:56, Boris Ostrovsky wrote:
>>> On 11/20/2017 06:50 AM, Jan Beulich wrote:
>>>>>>> On 20.11.17 at 12:20, <jgr...@suse.com> wrote:
>>>>> Which restriction? I'm loading the RSDP table to its architectural
>>>>> correct addres if possible, otherwise it will be loaded to the same
>>>>> address as without my patch. So I'm not adding a restriction, but
>>>>> removing one.
>>>> What is "architecturally correct" in PVH can't be read out of
>>>> specs other than what we write down. When there's no BIOS,
>>>> placing anything right below the 1Mb boundary is at least
>>>> bogus.
>>> Unless it's a UEFI boot -- where else would you put it? Aren't these two
>>> (UEFI and non-UEFI) the only two options that the ACPI spec provides?
>> I think Jan is right: for PVH its _our_ job to define the correct
>> placement. 
> 
> Yes, and if it is placed in a non-standard location then the guest will
> have to deal with it in a non-standard way. Which we can in Linux by
> setting acpi_rsdp pointer in the special PVH entry point, before jumping
> to Linux "standard" entry --- startup_{32|64}().
> 
> But if your goal is to avoid that special entry point (and thus not set
> acpi_rsdp) then how do you expect kernel to find RSDP?
> 
>> Which still can be the same as in the BIOS case, making
>> it easier to adapt any guest systems.
>>
>> So I'd say: The RSDP address in PVH case is passed in the PVH start
>> info block to the guest. In case there is no conflict with the
>> physical load address of the guest kernel the preferred address of
>> the RSDP is right below the 1MB boundary.
> 
> And what do we do if there *is* a conflict?

Either as without my patch: use the first available free memory page.

Or: add a domain config parameter for specifying the RSDP address
(e.g. default: as today, top: end of RAM, legacy: just below 1MB, or
a specific value) and fail to load in case of a conflict.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH for-4.10] libxc: load acpi RSDP table at correct address

2017-11-20 Thread Juergen Gross
On 20/11/17 15:20, Jan Beulich wrote:
 On 20.11.17 at 15:14,  wrote:
>> On 20/11/17 14:56, Boris Ostrovsky wrote:
>>> On 11/20/2017 06:50 AM, Jan Beulich wrote:
>>> On 20.11.17 at 12:20,  wrote:
> Which restriction? I'm loading the RSDP table to its architectural
> correct addres if possible, otherwise it will be loaded to the same
> address as without my patch. So I'm not adding a restriction, but
> removing one.
 What is "architecturally correct" in PVH can't be read out of
 specs other than what we write down. When there's no BIOS,
 placing anything right below the 1Mb boundary is at least
 bogus.
>>>
>>> Unless it's a UEFI boot -- where else would you put it? Aren't these two
>>> (UEFI and non-UEFI) the only two options that the ACPI spec provides?
>>
>> I think Jan is right: for PVH its _our_ job to define the correct
>> placement. Which still can be the same as in the BIOS case, making
>> it easier to adapt any guest systems.
>>
>> So I'd say: The RSDP address in PVH case is passed in the PVH start
>> info block to the guest. In case there is no conflict with the
>> physical load address of the guest kernel the preferred address of
>> the RSDP is right below the 1MB boundary.
>>
>> Would this wording be okay?
> 
> To be honest (and in case it wasn't sufficiently clear form my
> earlier replies) - I'm pretty much opposed to this below-1Mb thing.
> There ought to be just plain RAM there for PVH.

So without my patch the RSDP table is loaded e.g. at about 6.5MB when
I'm using grub2 (the loaded grub image is about 5.5MB in size and it
is being loaded at 1MB).

When I'm using the PVH Linux kernel directly RSDP is just below 1MB
due to pure luck (the bzImage loader is still using the PV specific
ELF notes and this results in the loader believing RSDP is loadable
at this address, which is true, but the tests used to come to this
conclusion are just not applicable for PVH).

So in your opinion we should revoke the PVH support from Xen 4.10,
Linux and maybe BSD because RSDP is loaded in middle of RAM of the
guest? Doing it in a proper way you are outlining above would render
current PVH guests unusable.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH for-4.10] libxc: load acpi RSDP table at correct address

2017-11-20 Thread Juergen Gross
On 20/11/17 14:56, Boris Ostrovsky wrote:
> On 11/20/2017 06:50 AM, Jan Beulich wrote:
> On 20.11.17 at 12:20,  wrote:
>>> Which restriction? I'm loading the RSDP table to its architectural
>>> correct addres if possible, otherwise it will be loaded to the same
>>> address as without my patch. So I'm not adding a restriction, but
>>> removing one.
>> What is "architecturally correct" in PVH can't be read out of
>> specs other than what we write down. When there's no BIOS,
>> placing anything right below the 1Mb boundary is at least
>> bogus.
> 
> Unless it's a UEFI boot -- where else would you put it? Aren't these two
> (UEFI and non-UEFI) the only two options that the ACPI spec provides?

I think Jan is right: for PVH its _our_ job to define the correct
placement. Which still can be the same as in the BIOS case, making
it easier to adapt any guest systems.

So I'd say: The RSDP address in PVH case is passed in the PVH start
info block to the guest. In case there is no conflict with the
physical load address of the guest kernel the preferred address of
the RSDP is right below the 1MB boundary.

Would this wording be okay?


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH for-4.10] libxc: load acpi RSDP table at correct address

2017-11-20 Thread Juergen Gross
On 20/11/17 11:57, Andrew Cooper wrote:
> On 20/11/17 10:43, Juergen Gross wrote:
>> On 20/11/17 11:21, Andrew Cooper wrote:
>>> On 20/11/17 10:04, Juergen Gross wrote:
>>>> On 20/11/17 10:58, Andrew Cooper wrote:
>>>>> On 20/11/2017 09:55, Juergen Gross wrote:
>>>>>> On 20/11/17 10:51, Roger Pau Monné wrote:
>>>>>>> Adding xen-devel, dropped it on my reply. 
>>>>>>>
>>>>>>> Replying from my phone, sorry for the formatting. 
>>>>>>>
>>>>>>>
>>>>>>> El 20 nov. 2017 9:35, "Juergen Gross" <jgr...@suse.com
>>>>>>> <mailto:jgr...@suse.com>> escribió:
>>>>>>>
>>>>>>> For PVH domains loading of the ACPI RSDP table is done via
>>>>>>> allocating
>>>>>>> a domain loader segment after having loaded the kernel. This
>>>>>>> leads to
>>>>>>> the RSDP table being loaded at an arbitrary guest address 
>>>>>>> instead of
>>>>>>> the architectural correct address just below 1MB. 
>>>>>>>
>>>>>>>
>>>>>>> AFAIK this is only true for legacy BIOS boot, when using UEFI the
>>>>>>> RSDP can be anywhere in memory, hence grub2 must already have an
>>>>>>> alternative way of finding the RSDP apart from scanning the low 1MB.
>>>>>> The problem isn't grub2, but the loaded linux kernel. Without this
>>>>>> patch Linux won't find the RSDP when booted in a PVH domain via grub2.
>>>>>>
>>>>>> I could modify grub2 even further to move the RSDP to the correct
>>>>>> address, but I think doing it correctly on Xen side is the better
>>>>>> option.
>>>>> Why?  The PVH info block contains a pointer directly to the RSDP, and
>>>>> Linux should be following this rather than scanning for it using the
>>>>> legacy method.
>>>> Oh no, please not this discussion again.
>>>>
>>>> We already had a very long discussion how to do PVH support in grub2,
>>>> and the outcome was to try to use the standard boot entry of the kernel
>>>> instead the PVH sepcific one.
>>>>
>>>> The Linux kernel right now doesn't make use of the RSDP pointer in the
>>>> PVH info block, so I think we shouldn't change this when using grub2.
>>> I clearly missed the previous discussion, and I don't advocate using yet
>>> another PVH-specific entry point, but how does Linux cope in other
>>> non-BIOS environments?  Does it genuinely rely exclusively on the legacy
>>> mechanism?
>> Looking at the code I think so, yes. Maybe there are cases where no RSDP
>> is needed, but in the grub2/PVH case we need it to distinguish PVH from
>> HVM.
> 
> In which case, being a Linux limitation, I think it is wrong to
> unilaterally apply this restriction to all other PVH guests.

Which restriction? I'm loading the RSDP table to its architectural
correct addres if possible, otherwise it will be loaded to the same
address as without my patch. So I'm not adding a restriction, but
removing one.

> Doing this in grub seems like the more appropriate place IMO.

I don't think so.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen-netfront: remove warning when unloading module

2017-11-20 Thread Juergen Gross
On 20/11/17 11:49, Wei Liu wrote:
> CC netfront maintainers.
> 
> On Mon, Nov 20, 2017 at 11:41:09AM +0100, Eduardo Otubo wrote:
>> When unloading module xen_netfront from guest, dmesg would output
>> warning messages like below:
>>
>>   [  105.236836] xen:grant_table: WARNING: g.e. 0x903 still in use!
>>   [  105.236839] deferring g.e. 0x903 (pfn 0x35805)
>>
>> This problem relies on netfront and netback being out of sync. By the time
>> netfront revokes the g.e.'s netback didn't have enough time to free all of
>> them, hence displaying the warnings on dmesg.
>>
>> The trick here is to make netfront to wait until netback frees all the g.e.'s
>> and only then continue to cleanup for the module removal, and this is done by
>> manipulating both device states.
>>
>> Signed-off-by: Eduardo Otubo 
>> ---
>>  drivers/net/xen-netfront.c | 11 +++
>>  1 file changed, 11 insertions(+)
>>
>> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
>> index 8b8689c6d887..b948e2a1ce40 100644
>> --- a/drivers/net/xen-netfront.c
>> +++ b/drivers/net/xen-netfront.c
>> @@ -2130,6 +2130,17 @@ static int xennet_remove(struct xenbus_device *dev)
>>  
>>  dev_dbg(>dev, "%s\n", dev->nodename);
>>  
>> +xenbus_switch_state(dev, XenbusStateClosing);
>> +while (xenbus_read_driver_state(dev->otherend) != XenbusStateClosing){
>> +cpu_relax();
>> +schedule();
>> +}
>> +xenbus_switch_state(dev, XenbusStateClosed);
>> +while (dev->xenbus_state != XenbusStateClosed){
>> +cpu_relax();
>> +schedule();
>> +}

I really don't like the busy waits.

Can't you use e.g. a wait queue and wait_event_interruptible() instead?

BTW: what happens if the device is already in closed state if you enter
xennet_remove()? In case this is impossible, please add a comment to
indicate you've thought about that case.

Other than that: you should run ./scripts/checkpatch.p1 against your
patch to avoid common style problems.


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH for-4.10] libxc: load acpi RSDP table at correct address

2017-11-20 Thread Juergen Gross
On 20/11/17 11:21, Andrew Cooper wrote:
> On 20/11/17 10:04, Juergen Gross wrote:
>> On 20/11/17 10:58, Andrew Cooper wrote:
>>> On 20/11/2017 09:55, Juergen Gross wrote:
>>>> On 20/11/17 10:51, Roger Pau Monné wrote:
>>>>> Adding xen-devel, dropped it on my reply. 
>>>>>
>>>>> Replying from my phone, sorry for the formatting. 
>>>>>
>>>>>
>>>>> El 20 nov. 2017 9:35, "Juergen Gross" <jgr...@suse.com
>>>>> <mailto:jgr...@suse.com>> escribió:
>>>>>
>>>>> For PVH domains loading of the ACPI RSDP table is done via
>>>>> allocating
>>>>> a domain loader segment after having loaded the kernel. This
>>>>> leads to
>>>>> the RSDP table being loaded at an arbitrary guest address instead 
>>>>> of
>>>>> the architectural correct address just below 1MB. 
>>>>>
>>>>>
>>>>> AFAIK this is only true for legacy BIOS boot, when using UEFI the
>>>>> RSDP can be anywhere in memory, hence grub2 must already have an
>>>>> alternative way of finding the RSDP apart from scanning the low 1MB.
>>>> The problem isn't grub2, but the loaded linux kernel. Without this
>>>> patch Linux won't find the RSDP when booted in a PVH domain via grub2.
>>>>
>>>> I could modify grub2 even further to move the RSDP to the correct
>>>> address, but I think doing it correctly on Xen side is the better
>>>> option.
>>> Why?  The PVH info block contains a pointer directly to the RSDP, and
>>> Linux should be following this rather than scanning for it using the
>>> legacy method.
>> Oh no, please not this discussion again.
>>
>> We already had a very long discussion how to do PVH support in grub2,
>> and the outcome was to try to use the standard boot entry of the kernel
>> instead the PVH sepcific one.
>>
>> The Linux kernel right now doesn't make use of the RSDP pointer in the
>> PVH info block, so I think we shouldn't change this when using grub2.
> 
> I clearly missed the previous discussion, and I don't advocate using yet
> another PVH-specific entry point, but how does Linux cope in other
> non-BIOS environments?  Does it genuinely rely exclusively on the legacy
> mechanism?

Looking at the code I think so, yes. Maybe there are cases where no RSDP
is needed, but in the grub2/PVH case we need it to distinguish PVH from
HVM.

Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH for-4.10] libxc: load acpi RSDP table at correct address

2017-11-20 Thread Juergen Gross
On 20/11/17 10:58, Andrew Cooper wrote:
> On 20/11/2017 09:55, Juergen Gross wrote:
>> On 20/11/17 10:51, Roger Pau Monné wrote:
>>> Adding xen-devel, dropped it on my reply. 
>>>
>>> Replying from my phone, sorry for the formatting. 
>>>
>>>
>>> El 20 nov. 2017 9:35, "Juergen Gross" <jgr...@suse.com
>>> <mailto:jgr...@suse.com>> escribió:
>>>
>>> For PVH domains loading of the ACPI RSDP table is done via
>>> allocating
>>> a domain loader segment after having loaded the kernel. This
>>> leads to
>>> the RSDP table being loaded at an arbitrary guest address instead of
>>> the architectural correct address just below 1MB. 
>>>
>>>
>>> AFAIK this is only true for legacy BIOS boot, when using UEFI the
>>> RSDP can be anywhere in memory, hence grub2 must already have an
>>> alternative way of finding the RSDP apart from scanning the low 1MB.
>> The problem isn't grub2, but the loaded linux kernel. Without this
>> patch Linux won't find the RSDP when booted in a PVH domain via grub2.
>>
>> I could modify grub2 even further to move the RSDP to the correct
>> address, but I think doing it correctly on Xen side is the better
>> option.
> 
> Why?  The PVH info block contains a pointer directly to the RSDP, and
> Linux should be following this rather than scanning for it using the
> legacy method.

Oh no, please not this discussion again.

We already had a very long discussion how to do PVH support in grub2,
and the outcome was to try to use the standard boot entry of the kernel
instead the PVH sepcific one.

The Linux kernel right now doesn't make use of the RSDP pointer in the
PVH info block, so I think we shouldn't change this when using grub2.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH for-4.10] libxc: load acpi RSDP table at correct address

2017-11-20 Thread Juergen Gross
On 20/11/17 10:51, Roger Pau Monné wrote:
> Adding xen-devel, dropped it on my reply. 
> 
> Replying from my phone, sorry for the formatting. 
> 
> 
> El 20 nov. 2017 9:35, "Juergen Gross" <jgr...@suse.com
> <mailto:jgr...@suse.com>> escribió:
> 
> For PVH domains loading of the ACPI RSDP table is done via
> allocating
> a domain loader segment after having loaded the kernel. This
> leads to
> the RSDP table being loaded at an arbitrary guest address instead of
> the architectural correct address just below 1MB. 
> 
> 
> AFAIK this is only true for legacy BIOS boot, when using UEFI the
> RSDP can be anywhere in memory, hence grub2 must already have an
> alternative way of finding the RSDP apart from scanning the low 1MB.

The problem isn't grub2, but the loaded linux kernel. Without this
patch Linux won't find the RSDP when booted in a PVH domain via grub2.

I could modify grub2 even further to move the RSDP to the correct
address, but I think doing it correctly on Xen side is the better
option.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH for-4.10] libxc: load acpi RSDP table at correct address

2017-11-20 Thread Juergen Gross
For PVH domains loading of the ACPI RSDP table is done via allocating
a domain loader segment after having loaded the kernel. This leads to
the RSDP table being loaded at an arbitrary guest address instead of
the architectural correct address just below 1MB.

When using the Linux kernel this is currently no problem as the
bzImage loader is being used, which is loading ACPI tables via an
alternative method.

Using grub2 however exposes this problem leading to the selected
kernel no longer being able to find the RSDP table.

To solve this issue allow to load the RSDP table below the already
loaded kernel if this space hasn't been used before.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
Not sure if this is acceptable for 4.10, but I think PVH guest support
should include the possibility to use grub2 as a boot loader.

So please consider this patch for 4.10.
---
 tools/libxc/xc_dom_hvmloader.c | 39 +++
 1 file changed, 31 insertions(+), 8 deletions(-)

diff --git a/tools/libxc/xc_dom_hvmloader.c b/tools/libxc/xc_dom_hvmloader.c
index 59f94e51e5..2284c7f9df 100644
--- a/tools/libxc/xc_dom_hvmloader.c
+++ b/tools/libxc/xc_dom_hvmloader.c
@@ -135,20 +135,43 @@ static int module_init_one(struct xc_dom_image *dom,
 {
 struct xc_dom_seg seg;
 void *dest;
+xen_pfn_t start, end;
+unsigned int page_size = XC_DOM_PAGE_SIZE(dom);
 
 if ( module->length )
 {
-if ( xc_dom_alloc_segment(dom, , name, 0, module->length) )
-goto err;
-dest = xc_dom_seg_to_ptr(dom, );
-if ( dest == NULL )
+/*
+ * Check for module located below kernel.
+ * Make sure not to be fooled by a kernel based on virtual address.
+ */
+if ( module->guest_addr_out && !(dom->kernel_seg.vstart >> 32) &&
+ module->guest_addr_out + module->length <= dom->kernel_seg.vstart 
)
 {
-DOMPRINTF("%s: xc_dom_seg_to_ptr(dom, ) => NULL",
-  __FUNCTION__);
-goto err;
+start = module->guest_addr_out / page_size;
+end = (module->guest_addr_out + module->length + page_size - 1) /
+  page_size;
+dest = xc_dom_pfn_to_ptr(dom, start, end - start);
+if ( dest == NULL )
+{
+DOMPRINTF("%s: xc_dom_pfn_to_ptr() => NULL", __FUNCTION__);
+goto err;
+}
+dest += module->guest_addr_out - start * page_size;
+}
+else
+{
+if ( xc_dom_alloc_segment(dom, , name, 0, module->length) )
+goto err;
+dest = xc_dom_seg_to_ptr(dom, );
+if ( dest == NULL )
+{
+DOMPRINTF("%s: xc_dom_seg_to_ptr(dom, ) => NULL",
+  __FUNCTION__);
+goto err;
+}
+module->guest_addr_out = seg.vstart;
 }
 memcpy(dest, module->data, module->length);
-module->guest_addr_out = seg.vstart;
 
 assert(dom->mmio_start > 0 && dom->mmio_start < UINT32_MAX);
 if ( module->guest_addr_out > dom->mmio_start ||
-- 
2.12.3


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 03/13] x86/paravirt: Convert native patch assembly code strings to macros

2017-11-18 Thread Juergen Gross
On 17/11/17 20:42, Josh Poimboeuf wrote:
> On Fri, Nov 17, 2017 at 08:10:13PM +0100, Juergen Gross wrote:
>> On 17/11/17 19:07, Borislav Petkov wrote:
>>> On Wed, Oct 04, 2017 at 10:58:24AM -0500, Josh Poimboeuf wrote:
>>>> Convert the hard-coded native patch assembly code strings to macros to
>>>> facilitate sharing common code between 32-bit and 64-bit.
>>>>
>>>> These macros will also be used by a future patch which requires the GCC
>>>> extended asm syntax of two '%' characters instead of one when specifying
>>>> a register name.
>>>>
>>>> Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
>>>> ---
>>>>  arch/x86/include/asm/special_insns.h | 24 
>>>>  arch/x86/kernel/paravirt_patch_32.c  | 21 +++--
>>>>  arch/x86/kernel/paravirt_patch_64.c  | 29 +++--
>>>>  3 files changed, 50 insertions(+), 24 deletions(-)
>>>>
>>>> diff --git a/arch/x86/include/asm/special_insns.h 
>>>> b/arch/x86/include/asm/special_insns.h
>>>> index ac402c6fc24b..0549c5f2c1b3 100644
>>>> --- a/arch/x86/include/asm/special_insns.h
>>>> +++ b/arch/x86/include/asm/special_insns.h
>>>> @@ -6,6 +6,30 @@
>>>>  
>>>>  #include 
>>>>  
>>>> +#ifdef CONFIG_X86_64
>>>> +# define _REG_ARG1"%rdi"
>>>> +# define NATIVE_IDENTITY_32   "mov %edi, %eax"
>>>
>>> Yeah, that "identity" looks strange. How about NATIVE_NOOP and
>>> NATIVE_NOOP_32 ?
>>
>> Those are not NOPs. They return the identical value which was passed to
>> them. So identity isn't a bad name after all.
> 
> Right, like the math identity function:
> 
>   https://en.wikipedia.org/wiki/Identity_function
> 
>>>> +# define NATIVE_USERGS_SYSRET64   "swapgs; sysretq"
>>>> +#else
>>>> +# define _REG_ARG1"%eax"
>>>> +#endif
>>>> +
>>>> +#define _REG_RET  "%" _ASM_AX
>>>> +
>>>> +#define NATIVE_ZERO   "xor " _REG_ARG1 ", " _REG_ARG1
>>>
>>> NATIVE_ZERO_OUT
>>>
>>> I guess. NATIVE_ZERO reads like the native representation of 0 :-)
>>
>> NATIVE_ZERO_ARG1?
> 
> On a slight tangent, does anybody know why it zeros the arg?

Why are _you_ asking? You've introduced it.

> The only place it's used is here:
> 
> #if defined(CONFIG_PARAVIRT_SPINLOCKS)
> DEF_NATIVE(pv_lock_ops,   queued_spin_unlock, 
> NATIVE_QUEUED_SPIN_UNLOCK);
> DEF_NATIVE(pv_lock_ops,   vcpu_is_preempted,  NATIVE_ZERO);
> #endif
> 
> Isn't that a bug?  Seems like it should _return_ zero.  Zeroing the arg
> shouldn't have any effect.

Right. Before that patch it _did_ return zero instead of zeroing arg1.

> If I'm right, we could call it NATIVE_FALSE.

I'd prefer NATIVE_ZERO, as it will be usable for non-boolean cases, too.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 03/13] x86/paravirt: Convert native patch assembly code strings to macros

2017-11-17 Thread Juergen Gross
On 17/11/17 19:07, Borislav Petkov wrote:
> On Wed, Oct 04, 2017 at 10:58:24AM -0500, Josh Poimboeuf wrote:
>> Convert the hard-coded native patch assembly code strings to macros to
>> facilitate sharing common code between 32-bit and 64-bit.
>>
>> These macros will also be used by a future patch which requires the GCC
>> extended asm syntax of two '%' characters instead of one when specifying
>> a register name.
>>
>> Signed-off-by: Josh Poimboeuf 
>> ---
>>  arch/x86/include/asm/special_insns.h | 24 
>>  arch/x86/kernel/paravirt_patch_32.c  | 21 +++--
>>  arch/x86/kernel/paravirt_patch_64.c  | 29 +++--
>>  3 files changed, 50 insertions(+), 24 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/special_insns.h 
>> b/arch/x86/include/asm/special_insns.h
>> index ac402c6fc24b..0549c5f2c1b3 100644
>> --- a/arch/x86/include/asm/special_insns.h
>> +++ b/arch/x86/include/asm/special_insns.h
>> @@ -6,6 +6,30 @@
>>  
>>  #include 
>>  
>> +#ifdef CONFIG_X86_64
>> +# define _REG_ARG1  "%rdi"
>> +# define NATIVE_IDENTITY_32 "mov %edi, %eax"
> 
> Yeah, that "identity" looks strange. How about NATIVE_NOOP and
> NATIVE_NOOP_32 ?

Those are not NOPs. They return the identical value which was passed to
them. So identity isn't a bad name after all.

> 
>> +# define NATIVE_USERGS_SYSRET64 "swapgs; sysretq"
>> +#else
>> +# define _REG_ARG1  "%eax"
>> +#endif
>> +
>> +#define _REG_RET"%" _ASM_AX
>> +
>> +#define NATIVE_ZERO "xor " _REG_ARG1 ", " _REG_ARG1
> 
> NATIVE_ZERO_OUT
> 
> I guess. NATIVE_ZERO reads like the native representation of 0 :-)

NATIVE_ZERO_ARG1?


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH for-4.10] tools/libxl: mark hvm mmio area as reserved in e820 map

2017-11-17 Thread Juergen Gross
On 17/11/17 12:47, Juergen Gross wrote:
> Make sure the HVM mmio area (especially console and Xenstore pages) is
> marked as "reserved" in the guest's E820 map, as otherwise conflicts
> might arise later, e.g. when hotplugging memory into the guest.
> 
> Signed-off-by: Juergen Gross <jgr...@suse.com>
> ---
> This is a bugfix for PVH and HVM guests. Please consider for 4.10.

Please ignore this patch, it upsets HVMloader.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH for-4.10] tools/libxl: mark hvm mmio area as reserved in e820 map

2017-11-17 Thread Juergen Gross
On 17/11/17 13:26, Jan Beulich wrote:
 On 17.11.17 at 12:47,  wrote:
>> Make sure the HVM mmio area (especially console and Xenstore pages) is
>> marked as "reserved" in the guest's E820 map, as otherwise conflicts
>> might arise later, e.g. when hotplugging memory into the guest.
> 
> This is very certainly wrong. Have you looked at a couple of physical
> machines? Have you found an E820_RESERVED area on any of them for
> the MMIO hole? Two examples I can present right away:
> 
> <6>BIOS-e820: [mem 0xc93f-0xc9f8cfff] reserved
> <6>BIOS-e820: [mem 0xc9f8d000-0xc9fdefff] ACPI data
> <6>BIOS-e820: [mem 0xc9fdf000-0xcac82fff] ACPI NVS
> <6>BIOS-e820: [mem 0xcac83000-0xcb172fff] reserved
> <6>BIOS-e820: [mem 0xcb173000-0xcb173fff] usable
> <6>BIOS-e820: [mem 0xcb174000-0xcb181fff] reserved
> <6>BIOS-e820: [mem 0xcb182000-0xccff] usable
> <6>BIOS-e820: [mem 0xcd00-0xcdff] reserved
> <6>BIOS-e820: [mem 0xd000-0xdfff] reserved
> <6>BIOS-e820: [mem 0xfed1c000-0xfed1] reserved
> <6>BIOS-e820: [mem 0xff00-0x] reserved
> 
> and
> 
> (XEN)  cf4bd000 - cf4bf000 (reserved)
> (XEN)  cf4bf000 - cf636000 (usable)
> (XEN)  cf636000 - cf7bf000 (ACPI NVS)
> (XEN)  cf7bf000 - cf7df000 (usable)
> (XEN)  cf7df000 - cf7ff000 (ACPI data)
> (XEN)  cf7ff000 - cf80 (usable)
> (XEN)  cf80 - d000 (reserved)
> (XEN)  f800 - fd00 (reserved)
> (XEN)  ffe0 - 0001 (reserved)
> 
> Things covered by E820_RESERVED include the MCFG area, yes, but
> not most other parts. The OS has to either be careful or consult
> ACPI for further resource usage details. In particular, the ACPI spec
> says
> 
> "The platform boot firmware does not return a range description for
>  the memory mapping of PCI devices, ISA Option ROMs, and ISA Plug
>  and Play cards because the OS has mechanisms available to detect
>  them."
> 
> See the section "E820 Assumptions and Limitations" for further details.

So is it _wrong_ to return the mmio area as reserved? We at least want
the shared console and Xenstore page to be marked as reserved, and those
are part of the mmio area.

We could, of course, just report the HVM special pages as reserved, but
this would IMO be more hacky than reporting just the mmio area.

Oh yes, and the LAPIC, of course. Again part of mmio area.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH for-4.10] tools/libxl: mark hvm mmio area as reserved in e820 map

2017-11-17 Thread Juergen Gross
Make sure the HVM mmio area (especially console and Xenstore pages) is
marked as "reserved" in the guest's E820 map, as otherwise conflicts
might arise later, e.g. when hotplugging memory into the guest.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
This is a bugfix for PVH and HVM guests. Please consider for 4.10.
---
 tools/libxl/libxl_x86.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
index 5f91fe4f92..664bf8bd64 100644
--- a/tools/libxl/libxl_x86.c
+++ b/tools/libxl/libxl_x86.c
@@ -530,6 +530,9 @@ int libxl__arch_domain_construct_memmap(libxl__gc *gc,
 if (d_config->rdms[i].policy != LIBXL_RDM_RESERVE_POLICY_INVALID)
 e820_entries++;
 
+/* Add mmio entry. */
+if (dom->mmio_size)
+e820_entries++;
 
 /* If we should have a highmem range. */
 if (highmem_size)
@@ -564,6 +567,14 @@ int libxl__arch_domain_construct_memmap(libxl__gc *gc,
 nr++;
 }
 
+/* mmio area */
+if (dom->mmio_size) {
+e820[nr].addr = dom->mmio_start;
+e820[nr].size = dom->mmio_size;
+e820[nr].type = E820_RESERVED;
+nr++;
+}
+
 for (i = 0; i < MAX_ACPI_MODULES; i++) {
 if (dom->acpi_modules[i].length) {
 e820[nr].addr = dom->acpi_modules[i].guest_addr_out & ~(page_size 
- 1);
-- 
2.12.3


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 2/3] credit2: libxl related changes to add support for runqueue per cpupool.

2017-11-16 Thread Juergen Gross
On 16/11/17 22:10, Anshul Makkar wrote:
> [Trimming the Cc-list a bit]
> 
> 
> On 9/14/17 7:37 AM, Juergen Gross wrote:
>> On 12/09/17 02:45, anshulmakkar wrote:
>>> Introduces scheduler specific parameter at libxl level which are
>>> passed on to libxc. eg runqueue for credit2
>>>
>>> Signed-off-by: Anshul Makkar <anshulmak...@gmail.com>
>>>
>>>   int libxl_cpupool_destroy(libxl_ctx *ctx, uint32_t poolid);
>>>   int libxl_cpupool_rename(libxl_ctx *ctx, const char *name, uint32_t
>>> poolid);
>>>   int libxl_cpupool_cpuadd(libxl_ctx *ctx, uint32_t poolid, int cpu);
>>> diff --git a/tools/libxl/libxl_cpupool.c b/tools/libxl/libxl_cpupool.c
>>> index 85b0688..e3ce7b3 100644
>>> --- a/tools/libxl/libxl_cpupool.c
>>> +++ b/tools/libxl/libxl_cpupool.c
>>> @@ -130,7 +130,7 @@ int libxl_get_freecpus(libxl_ctx *ctx,
>>> libxl_bitmap *cpumap)
>>>   int libxl_cpupool_create(libxl_ctx *ctx, const char *name,
>>>    libxl_scheduler sched,
>>>    libxl_bitmap cpumap, libxl_uuid *uuid,
>>> - uint32_t *poolid)
>>> + uint32_t *poolid, const
>>> libxl_scheduler_params *sched_params)
>>>   {
>>>   GC_INIT(ctx);
>>>   int rc;
>>> @@ -138,6 +138,7 @@ int libxl_cpupool_create(libxl_ctx *ctx, const
>>> char *name,
>>>   xs_transaction_t t;
>>>   char *uuid_string;
>>>   uint32_t xcpoolid;
>>> +    xc_schedparam_t xc_sched_param;
>>>     /* Accept '0' as 'any poolid' for backwards compatibility */
>>>   if ( *poolid == LIBXL_CPUPOOL_POOLID_ANY
>>> @@ -151,8 +152,18 @@ int libxl_cpupool_create(libxl_ctx *ctx, const
>>> char *name,
>>>   GC_FREE;
>>>   return ERROR_NOMEM;
>>>   }
>>> +    if (sched_params)
>>> +    {
>>> +    xc_sched_param.u.sched_credit2.ratelimit_us =
>>> +   
>>> sched_params->u.credit2.ratelimit_us;
>>> +    xc_sched_param.u.sched_credit2.runq =
>>> sched_params->u.credit2.runqueue;
>>> +    xc_sched_param.u.sched_credit.tslice_ms =
>>> sched_params->u.credit.tslice_ms;
>>> +    xc_sched_param.u.sched_credit.ratelimit_us =
>>> sched_params->u.credit.ratelimit_us;
>> Don't you need some input parameter validation here?
> Agree. Will perform validation.
>>> +    }
>>> +    else
>>> +    xc_sched_param.u.sched_credit2.runq =
>>> LIBXL_CREDIT2_RUNQUEUE_DEFAULT;
>> So you are passing the LIBXL defines down to the hypervisor expecting
>> they match. I think this is a major layering violation.
> I need to pass the DEFAULT runq arrangement if the user has not selected
> any option and I want to do it near to the top level (libxc) so that
> consistency
> can be maintained at the lower scheduler layer.
> Please can you suggest alternative that will maintain layering consistency.

So either have some glue code translating the LIBXL defines to the
hypervisor ones, or add some statements triggering a build failure in
case they don't match.


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 02/13] x86/paravirt: Fix output constraint macro names

2017-11-16 Thread Juergen Gross
On 16/11/17 21:50, Josh Poimboeuf wrote:
> On Wed, Oct 25, 2017 at 11:33:43AM +0200, Juergen Gross wrote:
>> On 04/10/17 17:58, Josh Poimboeuf wrote:
>>> Some of the paravirt '*_CLOBBERS' macros refer to output constraints
>>> instead of clobbers, which makes the code extra confusing.  Rename the
>>> output constraint related macros to '*_OUTPUTS'.
>>>
>>> Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
>>
>> I'm fine with the changes, but you might want to rename the "call_clbr"
>> parameter of PVOP_[V]CALL, too, e.g. to "outputs".
> 
> Yeah, good catch.
> 
>> You could then drop the "CALL_" from the macros, too.
> 
> Hm, which macros are you referring to, and why?

Good question. I think I didn't take the *CALLEE* macros into account.
So please ignore this remark.


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 10/13] x86/alternative: Support indirect call replacement

2017-11-16 Thread Juergen Gross
On 16/11/17 22:19, Josh Poimboeuf wrote:
> On Wed, Oct 25, 2017 at 01:25:02PM +0200, Juergen Gross wrote:
>> On 04/10/17 17:58, Josh Poimboeuf wrote:
>>> Add alternative patching support for replacing an instruction with an
>>> indirect call.  This will be needed for the paravirt alternatives.
>>>
>>> Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
>>> ---
>>>  arch/x86/kernel/alternative.c | 22 +++---
>>>  1 file changed, 15 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
>>> index 3344d3382e91..81c577c7deba 100644
>>> --- a/arch/x86/kernel/alternative.c
>>> +++ b/arch/x86/kernel/alternative.c
>>> @@ -410,20 +410,28 @@ void __init_or_module noinline 
>>> apply_alternatives(struct alt_instr *start,
>>> insnbuf_sz = a->replacementlen;
>>>  
>>> /*
>>> -* 0xe8 is a relative jump; fix the offset.
>>> -*
>>> -* Instruction length is checked before the opcode to avoid
>>> -* accessing uninitialized bytes for zero-length replacements.
>>> +* Fix the address offsets for call and jump instructions which
>>> +* use PC-relative addressing.
>>>  */
>>> if (a->replacementlen == 5 && *insnbuf == 0xe8) {
>>> +   /* direct call */
>>> *(s32 *)(insnbuf + 1) += replacement - instr;
>>> -   DPRINTK("Fix CALL offset: 0x%x, CALL 0x%lx",
>>> +   DPRINTK("Fix direct CALL offset: 0x%x, CALL 0x%lx",
>>> *(s32 *)(insnbuf + 1),
>>> (unsigned long)instr + *(s32 *)(insnbuf + 1) + 
>>> 5);
>>> -   }
>>>  
>>> -   if (a->replacementlen && is_jmp(replacement[0]))
>>> +   } else if (a->replacementlen == 6 && *insnbuf == 0xff &&
>>> +  *(insnbuf+1) == 0x15) {
>>> +   /* indirect call */
>>> +   *(s32 *)(insnbuf + 2) += replacement - instr;
>>> +   DPRINTK("Fix indirect CALL offset: 0x%x, CALL *0x%lx",
>>> +   *(s32 *)(insnbuf + 2),
>>> +   (unsigned long)instr + *(s32 *)(insnbuf + 2) + 
>>> 6);
>>> +
>>> +   } else if (a->replacementlen && is_jmp(replacement[0])) {
>>
>> Is this correct? Without your patch this was:
>>
>> if (*insnbuf == 0xe8) ...
>> if (is_jmp(replacement[0])) ...
>>
>> Now you have:
>>
>> if (*insnbuf == 0xe8) ...
>> else if (*insnbuf == 0xff15) ...
>> else if (is_jmp(replacement[0])) ...
>>
>> So only one or none of the three variants will be executed. In the past
>> it could be none, one or both.
> 
> It can't be a call *and* a jump.  It's either one or the other.
> 
> Maybe it's a little confusing that the jump check uses replacement[0]
> while the other checks use *insnbuf?  They have the same value, so the

Right, I was fooled by that.

> same variable should probably be used everywhere for consistency.  I can
> make them more consistent.
> 

I'd appreciate that. :-)


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [GIT PULL] xen: features and fixes for v4.15-rc1

2017-11-16 Thread Juergen Gross
Linus,

Please git pull the following tag:

 git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip.git 
for-linus-4.15-rc1-tag

xen: features and fixes for v4.15-rc1

Apart from several small fixes it contains the following features:

- a series by Joao Martins to add vdso support of the pv clock interface
- a series by Juergen Gross to add support for Xen pv guests to be able
  to run on 5 level paging hosts
- a series by Stefano Stabellini adding the Xen pvcalls frontend driver
  using a paravirtualized socket interface


Thanks.

Juergen

 MAINTAINERS|2 +
 arch/arm/xen/grant-table.c |9 +-
 arch/x86/entry/vdso/vma.c  |2 +-
 arch/x86/include/asm/pvclock.h |   19 +-
 arch/x86/include/asm/xen/cpuid.h   |   42 +-
 arch/x86/include/asm/xen/page.h|   11 +-
 arch/x86/kernel/kvmclock.c |7 +-
 arch/x86/kernel/pvclock.c  |   14 +
 arch/x86/xen/grant-table.c |   60 +-
 arch/x86/xen/mmu.c |   14 +-
 arch/x86/xen/mmu_pv.c  |4 +-
 arch/x86/xen/suspend.c |4 +
 arch/x86/xen/time.c|   99 ++-
 arch/x86/xen/xen-ops.h |2 +
 drivers/ptp/ptp_kvm.c  |5 +-
 drivers/xen/Kconfig|   11 +
 drivers/xen/Makefile   |1 +
 drivers/xen/grant-table.c  |  244 +-
 drivers/xen/manage.c   |7 +-
 drivers/xen/privcmd.c  |3 -
 drivers/xen/pvcalls-back.c |4 +
 drivers/xen/pvcalls-front.c| 1278 
 drivers/xen/pvcalls-front.h|   28 +
 drivers/xen/time.c |   72 +-
 drivers/xen/xenbus/xenbus_probe_frontend.c |2 +
 include/xen/grant_table.h  |5 +-
 include/xen/interface/vcpu.h   |   42 +
 include/xen/xen-ops.h  |   25 +
 28 files changed, 1941 insertions(+), 75 deletions(-)

Boris Ostrovsky (2):
  xen/time: Return -ENODEV from xen_get_wallclock()
  xen/pvcalls: Add MODULE_LICENSE()

Colin Ian King (3):
  xen/pvcalls: fix unsigned less than zero error check
  xen/pvcalls: remove redundant check for irq >= 0
  xen/privcmd: remove unused variable pageidx

Dongli Zhang (1):
  xen/time: do not decrease steal time after live migration on xen

Gustavo A. R. Silva (2):
  xen: xenbus_probe_frontend: mark expected switch fall-throughs
  xen/pvcalls-front: mark expected switch fall-through

Joao Martins (5):
  ptp_kvm: probe for kvm guest availability
  x86/pvclock: add setter for pvclock_pvti_cpu0_va
  x86/xen/time: set pvclock flags on xen_time_init()
  x86/xen/time: setup vcpu 0 time info page
  MAINTAINERS: xen, kvm: track pvclock-abi.h changes

Juergen Gross (6):
  xen: support 52 bit physical addresses in pv guests
  xen: re-introduce support for grant v2 interface
  xen: limit grant v2 interface to the v1 functionality
  xen: add grant interface version dependent constants to gnttab_ops
  xen: update arch/x86/include/asm/xen/cpuid.h
  xen: select grant interface version

Paul Durrant (1):
  xen: support priv-mapping in an HVM tools domain

Stefano Stabellini (14):
  xen/pvcalls: introduce the pvcalls xenbus frontend
  xen/pvcalls: implement frontend disconnect
  xen/pvcalls: connect to the backend
  xen/pvcalls: implement socket command and handle events
  xen/pvcalls: implement connect command
  xen/pvcalls: implement bind command
  xen/pvcalls: implement listen command
  xen/pvcalls: implement accept command
  xen/pvcalls: implement sendmsg
  xen/pvcalls: implement recvmsg
  xen/pvcalls: implement poll command
  xen/pvcalls: implement release command
  xen: introduce a Kconfig option to enable the pvcalls frontend
  xen/pvcalls: fix potential endless loop in pvcalls-front.c

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/pvcalls: fix potential endless loop in pvcalls-front.c

2017-11-15 Thread Juergen Gross
On 15/11/17 22:20, Stefano Stabellini wrote:
> On Wed, 15 Nov 2017, Boris Ostrovsky wrote:
>> On 11/15/2017 02:09 PM, Stefano Stabellini wrote:
>>> On Wed, 15 Nov 2017, Juergen Gross wrote:
>>>>>>> while(mutex_is_locked(>active.in_mutex.owner) ||
>>>>>>>   mutex_is_locked(>active.out_mutex.owner))
>>>>>>> cpu_relax();
>>>>>>>
>>>>>>> ?
>>>>>> I'm not convinced there isn't a race.
>>>>>>
>>>>>> In pvcalls_front_recvmsg() sock->sk->sk_send_head is being read and only
>>>>>> then in_mutex is taken. What happens if pvcalls_front_release() resets
>>>>>> sk_send_head and manages to test the mutex before the mutex is locked?
>>>>>>
>>>>>> Even in case this is impossible: the whole construct seems to be rather
>>>>>> fragile.
>>> I agree it looks fragile, and I agree that it might be best to avoid the
>>> usage of in_mutex and out_mutex as refcounts. More comments on this
>>> below.
>>>
>>>  
>>>>> I think we can wait until pvcalls_refcount is 1 (i.e. it's only us) and
>>>>> not rely on mutex state.
>>>> Yes, this would work.
>>> Yes, I agree it would work and for the sake of getting something in
>>> shape for the merge window I am attaching a patch for it. Please go
>>> ahead with it. Let me know if you need anything else immediately, and
>>> I'll work on it ASAP.
>>>
>>>
>>>
>>> However, I should note that this is a pretty big hammer we are using:
>>> the refcount is global, while we only need to wait until it's only us
>>> _on this specific socket_.
>>
>> Can you explain why socket is important?
> 
> Yes, of course: there are going to be many open sockets on a given
> pvcalls connection. pvcalls_refcount is global: waiting on
> pvcalls_refcount means waiting until any operations on any unrelated
> sockets stop. While we only need to wait until the operations on the one
> socket we want to close stop.
> 
> 
>>>
>>> We really need a per socket refcount. If we don't want to use the mutex
>>> internal counters, then we need another one.
>>>
>>> See the appended patch that introduces a per socket refcount. However,
>>> for the merge window, also using pvcalls_refcount is fine.
>>>
>>> The race Juergen is concerned about is only theoretically possible:
>>>
>>> recvmsg: release:
>>>   
>>>   test sk_send_head  clear sk_send_head
>>>   
>>>   grab in_mutex  
>>>  
>>>  test in_mutex
>>>
>>> Without kernel preemption is not possible for release to clear
>>> sk_send_head and test in_mutex after recvmsg tests sk_send_head and
>>> before recvmsg grabs in_mutex.
>>
>> Sorry, I don't follow --- what does preemption have to do with this? If
>> recvmsg and release happen on different processors the order of
>> operations can be
>>
>> CPU0   CPU1
>>
>> test sk_send_head
>> 
>> clear sk_send_head
>> 
>> 
>> test in_mutex
>> free everything
>> grab in_mutex
>>
>> I actually think RCU should take care of all of this.
> 
> Preemption could cause something very similar to happen, but your
> example is very good too, even better, because it could trigger the
> issue even with preemption disabled. I'll think more about this and
> submit a separate patch on top of the simple pvcalls_refcount patch
> below.

We are running as a guest. Even with interrupts off the vcpu could be
off the pcpu for several milliseconds!

Don't count on code length to avoid races!


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/pvcalls: Add MODULE_LICENSE()

2017-11-15 Thread Juergen Gross
On 15/11/17 17:37, Boris Ostrovsky wrote:
> Since commit ba1029c9cbc5 ("modpost: detect modules without a
> MODULE_LICENSE") modules without said macro will generate
> 
> WARNING: modpost: missing MODULE_LICENSE() in 
> 
> While at it, also add module description and attribution.
> 
> Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>

Reviewed-by: Juergen Gross <jgr...@suse.com>


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [OSSTEST PATCH] ts-xen-build-prep: Install libelf-dev for benefit of linux.git

2017-11-15 Thread Juergen Gross
On 15/11/17 12:11, Ian Jackson wrote:
> Linux upstream has started needing libelf-dev.  Without it, recent tip
> fails (in our configuration) like this:
> 
>  Makefile:938: *** "Cannot generate ORC metadata for CONFIG_UNWINDER_ORC=y, 
> please install libelf-dev, libelf-devel or elfutils-libelf-devel".  Stop.

The kernel now is using objtool to create unwind information. This needs
libelf to work. Advantage is that this approach no longer depends on
assembler sources being heavily annotated with unwind hints.

> It is not clear exactly when this requirement was introduced.  Our
> bisector said:
>   Bug introduced:  91a6a6cfee8ad34ea4cc10a54c0765edfe437cdb
>   Bug not present: 1c9dbd4615fd751e5e0b99807a3c7c8612e28e20
> but the "introduced" commit is a merge of a large branch, so it's not
> blaming a specific commit.  None of the commits in that range mention
> libelf so the most likely reason is a consequence of a change to some
> configuration interactions (ie, probably, an expansion of the scope of
> an existing dependency).
> 
> CC: Konrad Rzeszutek Wilk <konrad.w...@oracle.com>
> CC: Stefano Stabellini <sstabell...@kernel.org>
> CC: Boris Ostrovsky <boris.ostrov...@oracle.com>
> CC: Juergen Gross <jgr...@suse.com>
> CC: Paul Durrant <paul.durr...@citrix.com>
> CC: Wei Liu <wei.l...@citrix.com>
> Signed-off-by: Ian Jackson <ian.jack...@eu.citrix.com>

Acked-by: Juergen Gross <jgr...@suse.com>


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/pvcalls: fix potential endless loop in pvcalls-front.c

2017-11-15 Thread Juergen Gross
On 14/11/17 22:46, Boris Ostrovsky wrote:
> On 11/14/2017 04:11 AM, Juergen Gross wrote:
>> On 13/11/17 19:33, Stefano Stabellini wrote:
>>> On Mon, 13 Nov 2017, Juergen Gross wrote:
>>>> On 11/11/17 00:57, Stefano Stabellini wrote:
>>>>> On Tue, 7 Nov 2017, Juergen Gross wrote:
>>>>>> On 06/11/17 23:17, Stefano Stabellini wrote:
>>>>>>> mutex_trylock() returns 1 if you take the lock and 0 if not. Assume you
>>>>>>> take in_mutex on the first try, but you can't take out_mutex. Next times
>>>>>>> you call mutex_trylock() in_mutex is going to fail. It's an endless
>>>>>>> loop.
>>>>>>>
>>>>>>> Solve the problem by moving the two mutex_trylock calls to two separate
>>>>>>> loops.
>>>>>>>
>>>>>>> Reported-by: Dan Carpenter <dan.carpen...@oracle.com>
>>>>>>> Signed-off-by: Stefano Stabellini <sstabell...@kernel.org>
>>>>>>> CC: boris.ostrov...@oracle.com
>>>>>>> CC: jgr...@suse.com
>>>>>>> ---
>>>>>>>  drivers/xen/pvcalls-front.c | 5 +++--
>>>>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
>>>>>>> index 0c1ec68..047dce7 100644
>>>>>>> --- a/drivers/xen/pvcalls-front.c
>>>>>>> +++ b/drivers/xen/pvcalls-front.c
>>>>>>> @@ -1048,8 +1048,9 @@ int pvcalls_front_release(struct socket *sock)
>>>>>>>  * is set to NULL -- we only need to wait for the 
>>>>>>> existing
>>>>>>>  * waiters to return.
>>>>>>>  */
>>>>>>> -   while (!mutex_trylock(>active.in_mutex) ||
>>>>>>> -  !mutex_trylock(>active.out_mutex))
>>>>>>> +   while (!mutex_trylock(>active.in_mutex))
>>>>>>> +   cpu_relax();
>>>>>>> +   while (!mutex_trylock(>active.out_mutex))
>>>>>>> cpu_relax();
>>>>>> Any reason you don't just use mutex_lock()?
>>>>> Hi Juergen, sorry for the late reply.
>>>>>
>>>>> Yes, you are right. Given the patch, it would be just the same to use
>>>>> mutex_lock.
>>>>>
>>>>> This is where I realized that actually we have a problem: no matter if
>>>>> we use mutex_lock or mutex_trylock, there are no guarantees that we'll
>>>>> be the last to take the in/out_mutex. Other waiters could be still
>>>>> outstanding.
>>>>>
>>>>> We solved the same problem using a refcount in pvcalls_front_remove. In
>>>>> this case, I was thinking of reusing the mutex internal counter for
>>>>> efficiency, instead of adding one more refcount.
>>>>>
>>>>> For using the mutex as a refcount, there is really no need to call
>>>>> mutex_trylock or mutex_lock. I suggest checking on the mutex counter
>>>>> directly:
>>>>>
>>>>>
>>>>>   while (atomic_long_read(>active.in_mutex.owner) != 0UL ||
>>>>>  atomic_long_read(>active.out_mutex.owner) != 0UL)
>>>>>   cpu_relax();
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Stefano
>>>>>
>>>>>
>>>>> ---
>>>>>
>>>>> xen/pvcalls: fix potential endless loop in pvcalls-front.c
>>>>>
>>>>> mutex_trylock() returns 1 if you take the lock and 0 if not. Assume you
>>>>> take in_mutex on the first try, but you can't take out_mutex. Next time
>>>>> you call mutex_trylock() in_mutex is going to fail. It's an endless
>>>>> loop.
>>>>>
>>>>> Actually, we don't want to use mutex_trylock at all: we don't need to
>>>>> take the mutex, we only need to wait until the last mutex waiter/holder
>>>>> releases it.
>>>>>
>>>>> Instead of calling mutex_trylock or mutex_lock, just check on the mutex
>>>>> refcount instead.
>>>>>
>>>>> Reported-by: Dan Carpenter <da

Re: [Xen-devel] [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops

2017-11-14 Thread Juergen Gross
On 14/11/17 12:43, Quan Xu wrote:
> 
> 
> On 2017/11/14 18:27, Juergen Gross wrote:
>> On 14/11/17 10:38, Quan Xu wrote:
>>>
>>> On 2017/11/14 15:30, Juergen Gross wrote:
>>>> On 14/11/17 08:02, Quan Xu wrote:
>>>>> On 2017/11/13 18:53, Juergen Gross wrote:
>>>>>> On 13/11/17 11:06, Quan Xu wrote:
>>>>>>> From: Quan Xu <quan@gmail.com>
>>>>>>>
>>>>>>> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is
>>>>>>> called
>>>>>>> in idle path which will poll for a while before we enter the real
>>>>>>> idle
>>>>>>> state.
>>>>>>>
>>>>>>> In virtualization, idle path includes several heavy operations
>>>>>>> includes timer access(LAPIC timer or TSC deadline timer) which will
>>>>>>> hurt performance especially for latency intensive workload like
>>>>>>> message
>>>>>>> passing task. The cost is mainly from the vmexit which is a hardware
>>>>>>> context switch between virtual machine and hypervisor. Our
>>>>>>> solution is
>>>>>>> to poll for a while and do not enter real idle path if we can get
>>>>>>> the
>>>>>>> schedule event during polling.
>>>>>>>
>>>>>>> Poll may cause the CPU waste so we adopt a smart polling
>>>>>>> mechanism to
>>>>>>> reduce the useless poll.
>>>>>>>
>>>>>>> Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
>>>>>>> Signed-off-by: Quan Xu <quan@gmail.com>
>>>>>>> Cc: Juergen Gross <jgr...@suse.com>
>>>>>>> Cc: Alok Kataria <akata...@vmware.com>
>>>>>>> Cc: Rusty Russell <ru...@rustcorp.com.au>
>>>>>>> Cc: Thomas Gleixner <t...@linutronix.de>
>>>>>>> Cc: Ingo Molnar <mi...@redhat.com>
>>>>>>> Cc: "H. Peter Anvin" <h...@zytor.com>
>>>>>>> Cc: x...@kernel.org
>>>>>>> Cc: virtualizat...@lists.linux-foundation.org
>>>>>>> Cc: linux-ker...@vger.kernel.org
>>>>>>> Cc: xen-de...@lists.xenproject.org
>>>>>> Hmm, is the idle entry path really so critical to performance that a
>>>>>> new
>>>>>> pvops function is necessary?
>>>>> Juergen, Here is the data we get when running benchmark netperf:
>>>>>    1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
>>>>>   29031.6 bit/s -- 76.1 %CPU
>>>>>
>>>>>    2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
>>>>>   35787.7 bit/s -- 129.4 %CPU
>>>>>
>>>>>    3. w/ kvm dynamic poll:
>>>>>   35735.6 bit/s -- 200.0 %CPU
>>>>>
>>>>>    4. w/patch and w/ kvm dynamic poll:
>>>>>   42225.3 bit/s -- 198.7 %CPU
>>>>>
>>>>>    5. idle=poll
>>>>>   37081.7 bit/s -- 998.1 %CPU
>>>>>
>>>>>
>>>>>
>>>>>    w/ this patch, we will improve performance by 23%.. even we could
>>>>> improve
>>>>>    performance by 45.4%, if we use w/patch and w/ kvm dynamic poll.
>>>>> also the
>>>>>    cost of CPU is much lower than 'idle=poll' case..
>>>> I don't question the general idea. I just think pvops isn't the best
>>>> way
>>>> to implement it.
>>>>
>>>>>> Wouldn't a function pointer, maybe guarded
>>>>>> by a static key, be enough? A further advantage would be that this
>>>>>> would
>>>>>> work on other architectures, too.
>>>>> I assume this feature will be ported to other archs.. a new pvops
>>>>> makes
>>>    sorry, a typo.. /other archs/other hypervisors/
>>>    it refers hypervisor like Xen, HyperV and VMware)..
>>>
>>>>> code
>>>>> clean and easy to maintain. also I tried to add it into existed pvops,
>>>>> but it
>>>>> doesn't match.
>>>> You are aware that pvops is x86 only?
>>> yes, I'm aware..
>>>
>>>> I really don't see the big difference in maintainability compared

Re: [Xen-devel] [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops

2017-11-14 Thread Juergen Gross
On 14/11/17 10:38, Quan Xu wrote:
> 
> 
> On 2017/11/14 15:30, Juergen Gross wrote:
>> On 14/11/17 08:02, Quan Xu wrote:
>>>
>>> On 2017/11/13 18:53, Juergen Gross wrote:
>>>> On 13/11/17 11:06, Quan Xu wrote:
>>>>> From: Quan Xu <quan@gmail.com>
>>>>>
>>>>> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called
>>>>> in idle path which will poll for a while before we enter the real idle
>>>>> state.
>>>>>
>>>>> In virtualization, idle path includes several heavy operations
>>>>> includes timer access(LAPIC timer or TSC deadline timer) which will
>>>>> hurt performance especially for latency intensive workload like
>>>>> message
>>>>> passing task. The cost is mainly from the vmexit which is a hardware
>>>>> context switch between virtual machine and hypervisor. Our solution is
>>>>> to poll for a while and do not enter real idle path if we can get the
>>>>> schedule event during polling.
>>>>>
>>>>> Poll may cause the CPU waste so we adopt a smart polling mechanism to
>>>>> reduce the useless poll.
>>>>>
>>>>> Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
>>>>> Signed-off-by: Quan Xu <quan@gmail.com>
>>>>> Cc: Juergen Gross <jgr...@suse.com>
>>>>> Cc: Alok Kataria <akata...@vmware.com>
>>>>> Cc: Rusty Russell <ru...@rustcorp.com.au>
>>>>> Cc: Thomas Gleixner <t...@linutronix.de>
>>>>> Cc: Ingo Molnar <mi...@redhat.com>
>>>>> Cc: "H. Peter Anvin" <h...@zytor.com>
>>>>> Cc: x...@kernel.org
>>>>> Cc: virtualizat...@lists.linux-foundation.org
>>>>> Cc: linux-ker...@vger.kernel.org
>>>>> Cc: xen-de...@lists.xenproject.org
>>>> Hmm, is the idle entry path really so critical to performance that a
>>>> new
>>>> pvops function is necessary?
>>> Juergen, Here is the data we get when running benchmark netperf:
>>>   1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
>>>  29031.6 bit/s -- 76.1 %CPU
>>>
>>>   2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
>>>  35787.7 bit/s -- 129.4 %CPU
>>>
>>>   3. w/ kvm dynamic poll:
>>>  35735.6 bit/s -- 200.0 %CPU
>>>
>>>   4. w/patch and w/ kvm dynamic poll:
>>>  42225.3 bit/s -- 198.7 %CPU
>>>
>>>   5. idle=poll
>>>  37081.7 bit/s -- 998.1 %CPU
>>>
>>>
>>>
>>>   w/ this patch, we will improve performance by 23%.. even we could
>>> improve
>>>   performance by 45.4%, if we use w/patch and w/ kvm dynamic poll.
>>> also the
>>>   cost of CPU is much lower than 'idle=poll' case..
>> I don't question the general idea. I just think pvops isn't the best way
>> to implement it.
>>
>>>> Wouldn't a function pointer, maybe guarded
>>>> by a static key, be enough? A further advantage would be that this
>>>> would
>>>> work on other architectures, too.
>>> I assume this feature will be ported to other archs.. a new pvops makes
> 
>   sorry, a typo.. /other archs/other hypervisors/
>   it refers hypervisor like Xen, HyperV and VMware)..
> 
>>> code
>>> clean and easy to maintain. also I tried to add it into existed pvops,
>>> but it
>>> doesn't match.
>> You are aware that pvops is x86 only?
> 
> yes, I'm aware..
> 
>> I really don't see the big difference in maintainability compared to the
>> static key / function pointer variant:
>>
>> void (*guest_idle_poll_func)(void);
>> struct static_key guest_idle_poll_key __read_mostly;
>>
>> static inline void guest_idle_poll(void)
>> {
>> if (static_key_false(_idle_poll_key))
>>     guest_idle_poll_func();
>> }
> 
> 
> 
> thank you for your sample code :)
> I agree there is no big difference.. I think we are discussion for two
> things:
>  1) x86 VM on different hypervisors
>  2) different archs VM on kvm hypervisor
> 
> What I want to do is x86 VM on different hypervisors, such as kvm / xen
> / hyperv ..

Why limit the solution to x86 if the more general solution isn't
harder?

As you didn't give any reason why the pvops approach is better other
than you don't care for non-x86 platforms you won't get an "Ack" from
me for this patch.

> 
>> And KVM would just need to set guest_idle_poll_func and enable the
>> static key. Works on non-x86 architectures, too.
>>
> 
> .. referred to 'pv_mmu_ops', HyperV and Xen can implement their own
> functions for 'pv_mmu_ops'.
> I think it is the same to pv_idle_ops.
> 
> with above explaination, do you still think I need to define the static
> key/function pointer variant?
> 
> btw, any interest to port it to Xen HVM guest? :)

Maybe. But this should work for Xen on ARM, too.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/pvcalls: fix potential endless loop in pvcalls-front.c

2017-11-14 Thread Juergen Gross
On 13/11/17 19:33, Stefano Stabellini wrote:
> On Mon, 13 Nov 2017, Juergen Gross wrote:
>> On 11/11/17 00:57, Stefano Stabellini wrote:
>>> On Tue, 7 Nov 2017, Juergen Gross wrote:
>>>> On 06/11/17 23:17, Stefano Stabellini wrote:
>>>>> mutex_trylock() returns 1 if you take the lock and 0 if not. Assume you
>>>>> take in_mutex on the first try, but you can't take out_mutex. Next times
>>>>> you call mutex_trylock() in_mutex is going to fail. It's an endless
>>>>> loop.
>>>>>
>>>>> Solve the problem by moving the two mutex_trylock calls to two separate
>>>>> loops.
>>>>>
>>>>> Reported-by: Dan Carpenter <dan.carpen...@oracle.com>
>>>>> Signed-off-by: Stefano Stabellini <sstabell...@kernel.org>
>>>>> CC: boris.ostrov...@oracle.com
>>>>> CC: jgr...@suse.com
>>>>> ---
>>>>>  drivers/xen/pvcalls-front.c | 5 +++--
>>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
>>>>> index 0c1ec68..047dce7 100644
>>>>> --- a/drivers/xen/pvcalls-front.c
>>>>> +++ b/drivers/xen/pvcalls-front.c
>>>>> @@ -1048,8 +1048,9 @@ int pvcalls_front_release(struct socket *sock)
>>>>>* is set to NULL -- we only need to wait for the existing
>>>>>* waiters to return.
>>>>>*/
>>>>> - while (!mutex_trylock(>active.in_mutex) ||
>>>>> -!mutex_trylock(>active.out_mutex))
>>>>> + while (!mutex_trylock(>active.in_mutex))
>>>>> + cpu_relax();
>>>>> + while (!mutex_trylock(>active.out_mutex))
>>>>>   cpu_relax();
>>>>
>>>> Any reason you don't just use mutex_lock()?
>>>
>>> Hi Juergen, sorry for the late reply.
>>>
>>> Yes, you are right. Given the patch, it would be just the same to use
>>> mutex_lock.
>>>
>>> This is where I realized that actually we have a problem: no matter if
>>> we use mutex_lock or mutex_trylock, there are no guarantees that we'll
>>> be the last to take the in/out_mutex. Other waiters could be still
>>> outstanding.
>>>
>>> We solved the same problem using a refcount in pvcalls_front_remove. In
>>> this case, I was thinking of reusing the mutex internal counter for
>>> efficiency, instead of adding one more refcount.
>>>
>>> For using the mutex as a refcount, there is really no need to call
>>> mutex_trylock or mutex_lock. I suggest checking on the mutex counter
>>> directly:
>>>
>>>
>>> while (atomic_long_read(>active.in_mutex.owner) != 0UL ||
>>>atomic_long_read(>active.out_mutex.owner) != 0UL)
>>> cpu_relax();
>>>
>>> Cheers,
>>>
>>> Stefano
>>>
>>>
>>> ---
>>>
>>> xen/pvcalls: fix potential endless loop in pvcalls-front.c
>>>
>>> mutex_trylock() returns 1 if you take the lock and 0 if not. Assume you
>>> take in_mutex on the first try, but you can't take out_mutex. Next time
>>> you call mutex_trylock() in_mutex is going to fail. It's an endless
>>> loop.
>>>
>>> Actually, we don't want to use mutex_trylock at all: we don't need to
>>> take the mutex, we only need to wait until the last mutex waiter/holder
>>> releases it.
>>>
>>> Instead of calling mutex_trylock or mutex_lock, just check on the mutex
>>> refcount instead.
>>>
>>> Reported-by: Dan Carpenter <dan.carpen...@oracle.com>
>>> Signed-off-by: Stefano Stabellini <sstabell...@kernel.org>
>>> CC: boris.ostrov...@oracle.com
>>> CC: jgr...@suse.com
>>>
>>> diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
>>> index 0c1ec68..9f33cb8 100644
>>> --- a/drivers/xen/pvcalls-front.c
>>> +++ b/drivers/xen/pvcalls-front.c
>>> @@ -1048,8 +1048,8 @@ int pvcalls_front_release(struct socket *sock)
>>>  * is set to NULL -- we only need to wait for the existing
>>>  * waiters to return.
>>>  */
>>> -   while (!mutex_trylock(>active.in_mutex) ||
>>> -  !mutex_trylock(>acti

Re: [Xen-devel] [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops

2017-11-13 Thread Juergen Gross
On 14/11/17 08:02, Quan Xu wrote:
> 
> 
> On 2017/11/13 18:53, Juergen Gross wrote:
>> On 13/11/17 11:06, Quan Xu wrote:
>>> From: Quan Xu <quan@gmail.com>
>>>
>>> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called
>>> in idle path which will poll for a while before we enter the real idle
>>> state.
>>>
>>> In virtualization, idle path includes several heavy operations
>>> includes timer access(LAPIC timer or TSC deadline timer) which will
>>> hurt performance especially for latency intensive workload like message
>>> passing task. The cost is mainly from the vmexit which is a hardware
>>> context switch between virtual machine and hypervisor. Our solution is
>>> to poll for a while and do not enter real idle path if we can get the
>>> schedule event during polling.
>>>
>>> Poll may cause the CPU waste so we adopt a smart polling mechanism to
>>> reduce the useless poll.
>>>
>>> Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
>>> Signed-off-by: Quan Xu <quan@gmail.com>
>>> Cc: Juergen Gross <jgr...@suse.com>
>>> Cc: Alok Kataria <akata...@vmware.com>
>>> Cc: Rusty Russell <ru...@rustcorp.com.au>
>>> Cc: Thomas Gleixner <t...@linutronix.de>
>>> Cc: Ingo Molnar <mi...@redhat.com>
>>> Cc: "H. Peter Anvin" <h...@zytor.com>
>>> Cc: x...@kernel.org
>>> Cc: virtualizat...@lists.linux-foundation.org
>>> Cc: linux-ker...@vger.kernel.org
>>> Cc: xen-de...@lists.xenproject.org
>> Hmm, is the idle entry path really so critical to performance that a new
>> pvops function is necessary?
> Juergen, Here is the data we get when running benchmark netperf:
>  1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
>     29031.6 bit/s -- 76.1 %CPU
> 
>  2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
>     35787.7 bit/s -- 129.4 %CPU
> 
>  3. w/ kvm dynamic poll:
>     35735.6 bit/s -- 200.0 %CPU
> 
>  4. w/patch and w/ kvm dynamic poll:
>     42225.3 bit/s -- 198.7 %CPU
> 
>  5. idle=poll
>     37081.7 bit/s -- 998.1 %CPU
> 
> 
> 
>  w/ this patch, we will improve performance by 23%.. even we could improve
>  performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the
>  cost of CPU is much lower than 'idle=poll' case..

I don't question the general idea. I just think pvops isn't the best way
to implement it.

>> Wouldn't a function pointer, maybe guarded
>> by a static key, be enough? A further advantage would be that this would
>> work on other architectures, too.
> 
> I assume this feature will be ported to other archs.. a new pvops makes
> code
> clean and easy to maintain. also I tried to add it into existed pvops,
> but it
> doesn't match.

You are aware that pvops is x86 only?

I really don't see the big difference in maintainability compared to the
static key / function pointer variant:

void (*guest_idle_poll_func)(void);
struct static_key guest_idle_poll_key __read_mostly;

static inline void guest_idle_poll(void)
{
if (static_key_false(_idle_poll_key))
guest_idle_poll_func();
}

And KVM would just need to set guest_idle_poll_func and enable the
static key. Works on non-x86 architectures, too.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/pvcalls: fix potential endless loop in pvcalls-front.c

2017-11-13 Thread Juergen Gross
On 11/11/17 00:57, Stefano Stabellini wrote:
> On Tue, 7 Nov 2017, Juergen Gross wrote:
>> On 06/11/17 23:17, Stefano Stabellini wrote:
>>> mutex_trylock() returns 1 if you take the lock and 0 if not. Assume you
>>> take in_mutex on the first try, but you can't take out_mutex. Next times
>>> you call mutex_trylock() in_mutex is going to fail. It's an endless
>>> loop.
>>>
>>> Solve the problem by moving the two mutex_trylock calls to two separate
>>> loops.
>>>
>>> Reported-by: Dan Carpenter <dan.carpen...@oracle.com>
>>> Signed-off-by: Stefano Stabellini <sstabell...@kernel.org>
>>> CC: boris.ostrov...@oracle.com
>>> CC: jgr...@suse.com
>>> ---
>>>  drivers/xen/pvcalls-front.c | 5 +++--
>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
>>> index 0c1ec68..047dce7 100644
>>> --- a/drivers/xen/pvcalls-front.c
>>> +++ b/drivers/xen/pvcalls-front.c
>>> @@ -1048,8 +1048,9 @@ int pvcalls_front_release(struct socket *sock)
>>>  * is set to NULL -- we only need to wait for the existing
>>>  * waiters to return.
>>>  */
>>> -   while (!mutex_trylock(>active.in_mutex) ||
>>> -  !mutex_trylock(>active.out_mutex))
>>> +   while (!mutex_trylock(>active.in_mutex))
>>> +   cpu_relax();
>>> +   while (!mutex_trylock(>active.out_mutex))
>>> cpu_relax();
>>
>> Any reason you don't just use mutex_lock()?
> 
> Hi Juergen, sorry for the late reply.
> 
> Yes, you are right. Given the patch, it would be just the same to use
> mutex_lock.
> 
> This is where I realized that actually we have a problem: no matter if
> we use mutex_lock or mutex_trylock, there are no guarantees that we'll
> be the last to take the in/out_mutex. Other waiters could be still
> outstanding.
> 
> We solved the same problem using a refcount in pvcalls_front_remove. In
> this case, I was thinking of reusing the mutex internal counter for
> efficiency, instead of adding one more refcount.
> 
> For using the mutex as a refcount, there is really no need to call
> mutex_trylock or mutex_lock. I suggest checking on the mutex counter
> directly:
> 
> 
>   while (atomic_long_read(>active.in_mutex.owner) != 0UL ||
>  atomic_long_read(>active.out_mutex.owner) != 0UL)
>   cpu_relax();
> 
> Cheers,
> 
> Stefano
> 
> 
> ---
> 
> xen/pvcalls: fix potential endless loop in pvcalls-front.c
> 
> mutex_trylock() returns 1 if you take the lock and 0 if not. Assume you
> take in_mutex on the first try, but you can't take out_mutex. Next time
> you call mutex_trylock() in_mutex is going to fail. It's an endless
> loop.
> 
> Actually, we don't want to use mutex_trylock at all: we don't need to
> take the mutex, we only need to wait until the last mutex waiter/holder
> releases it.
> 
> Instead of calling mutex_trylock or mutex_lock, just check on the mutex
> refcount instead.
> 
> Reported-by: Dan Carpenter <dan.carpen...@oracle.com>
> Signed-off-by: Stefano Stabellini <sstabell...@kernel.org>
> CC: boris.ostrov...@oracle.com
> CC: jgr...@suse.com
> 
> diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
> index 0c1ec68..9f33cb8 100644
> --- a/drivers/xen/pvcalls-front.c
> +++ b/drivers/xen/pvcalls-front.c
> @@ -1048,8 +1048,8 @@ int pvcalls_front_release(struct socket *sock)
>* is set to NULL -- we only need to wait for the existing
>* waiters to return.
>*/
> - while (!mutex_trylock(>active.in_mutex) ||
> -!mutex_trylock(>active.out_mutex))
> + while (atomic_long_read(>active.in_mutex.owner) != 0UL ||
> +atomic_long_read(>active.out_mutex.owner) != 0UL)

I don't like this.

Can't you use a kref here? Even if it looks like more overhead it is
much cleaner. There will be no questions regarding possible races,
while with an approach like yours will always smell racy (can't there
be someone taking the mutex just after above test?).

In no case you should make use of the mutex internals.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops

2017-11-13 Thread Juergen Gross
On 13/11/17 11:06, Quan Xu wrote:
> From: Quan Xu <quan@gmail.com>
> 
> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called
> in idle path which will poll for a while before we enter the real idle
> state.
> 
> In virtualization, idle path includes several heavy operations
> includes timer access(LAPIC timer or TSC deadline timer) which will
> hurt performance especially for latency intensive workload like message
> passing task. The cost is mainly from the vmexit which is a hardware
> context switch between virtual machine and hypervisor. Our solution is
> to poll for a while and do not enter real idle path if we can get the
> schedule event during polling.
> 
> Poll may cause the CPU waste so we adopt a smart polling mechanism to
> reduce the useless poll.
> 
> Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
> Signed-off-by: Quan Xu <quan@gmail.com>
> Cc: Juergen Gross <jgr...@suse.com>
> Cc: Alok Kataria <akata...@vmware.com>
> Cc: Rusty Russell <ru...@rustcorp.com.au>
> Cc: Thomas Gleixner <t...@linutronix.de>
> Cc: Ingo Molnar <mi...@redhat.com>
> Cc: "H. Peter Anvin" <h...@zytor.com>
> Cc: x...@kernel.org
> Cc: virtualizat...@lists.linux-foundation.org
> Cc: linux-ker...@vger.kernel.org
> Cc: xen-de...@lists.xenproject.org

Hmm, is the idle entry path really so critical to performance that a new
pvops function is necessary? Wouldn't a function pointer, maybe guarded
by a static key, be enough? A further advantage would be that this would
work on other architectures, too.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [xen-unstable test] 115555: regressions - FAIL

2017-11-10 Thread Juergen Gross
On 10/11/17 10:33, Roger Pau Monné wrote:
> On Sat, Nov 04, 2017 at 11:14:35PM +, osstest service owner wrote:
>> flight 11 xen-unstable real [real]
>> http://logs.test-lab.xenproject.org/osstest/logs/11/
>>
>> Regressions :-(
>>
>> Tests which did not succeed and are blocking,
>> including tests which could not be run:
>>  test-amd64-amd64-i386-pvgrub 11 guest-start  fail REGR. vs. 
>> 115526
>>  test-amd64-amd64-amd64-pvgrub 11 guest-start fail REGR. vs. 
>> 115526
> 
> Both of the above trigger an assertion, it seems to be in pvgrub code
> (because I doubt minios has anything kexec related):
> 
> root  (hd0,0)
> 
>  Filesystem type is ext2fs, partition type 0x83
> 
> kernel  /boot/vmlinuz-3.16.0-4-686-pae 
> root=UUID=482efee7-e0f0-453e-bb8e-000bfc
> 
> d5a822 ro 
> 
> initrd  /boot/initrd.img-3.16.0-4-686-pae
> 
> = Init TPM Front 
> Tpmfront:Error Unable to read device/vtpm/0/backend-id during tpmfront 
> initialization! error = ENOENT
> Tpmfront:Info Shutting down tpmfront
> close blk: backend=/local/domain/0/backend/vbd/2/51712 node=device/vbd/51712
> ASSERTION FAILED: 0 at kexec.c:418.

That's a failure of HYPERVISOR_mmu_update() just before the target
kernel is being executed (see stubdom/grub/kexec.c).


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2 2/5] x86: add enum for hypervisors to replace x86_hyper

2017-11-09 Thread Juergen Gross
The x86_hyper pointer is only used for checking whether a virtual
device is supporting the hypervisor the system is running on.

Use an enum for that purpose instead and drop the x86_hyper pointer.

Cc: k...@microsoft.com
Cc: haiya...@microsoft.com
Cc: sthem...@microsoft.com
Cc: akata...@vmware.com
Cc: pbonz...@redhat.com
Cc: rkrc...@redhat.com
Cc: boris.ostrov...@oracle.com
Cc: de...@linuxdriverproject.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: k...@vger.kernel.org
Cc: xen-de...@lists.xenproject.org
Cc: linux-graphics-maintai...@vmware.com
Cc: pv-driv...@vmware.com
Cc: dmitry.torok...@gmail.com
Cc: xdeguill...@vmware.com
Cc: moltm...@vmware.com
Cc: a...@arndb.de
Cc: gre...@linuxfoundation.org
Cc: linux-in...@vger.kernel.org

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
 arch/x86/hyperv/hv_init.c |  2 +-
 arch/x86/include/asm/hypervisor.h | 23 ++-
 arch/x86/kernel/cpu/hypervisor.c  | 12 +---
 arch/x86/kernel/cpu/mshyperv.c|  4 ++--
 arch/x86/kernel/cpu/vmware.c  |  4 ++--
 arch/x86/kernel/kvm.c |  4 ++--
 arch/x86/xen/enlighten_hvm.c  |  4 ++--
 arch/x86/xen/enlighten_pv.c   |  4 ++--
 drivers/hv/vmbus_drv.c|  2 +-
 drivers/input/mouse/vmmouse.c | 10 --
 drivers/misc/vmw_balloon.c|  2 +-
 11 files changed, 40 insertions(+), 31 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index a5db63f728a2..a0b86cf486e0 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -113,7 +113,7 @@ void hyperv_init(void)
u64 guest_id;
union hv_x64_msr_hypercall_contents hypercall_msr;
 
-   if (x86_hyper != _hyper_ms_hyperv)
+   if (x86_hyper_type != X86_HYPER_MS_HYPERV)
return;
 
/* Allocate percpu VP index */
diff --git a/arch/x86/include/asm/hypervisor.h 
b/arch/x86/include/asm/hypervisor.h
index 0eca7239a7aa..1b0a5abcd8ae 100644
--- a/arch/x86/include/asm/hypervisor.h
+++ b/arch/x86/include/asm/hypervisor.h
@@ -29,6 +29,16 @@
 /*
  * x86 hypervisor information
  */
+
+enum x86_hypervisor_type {
+   X86_HYPER_NATIVE = 0,
+   X86_HYPER_VMWARE,
+   X86_HYPER_MS_HYPERV,
+   X86_HYPER_XEN_PV,
+   X86_HYPER_XEN_HVM,
+   X86_HYPER_KVM,
+};
+
 struct hypervisor_x86 {
/* Hypervisor name */
const char  *name;
@@ -36,6 +46,9 @@ struct hypervisor_x86 {
/* Detection routine */
uint32_t(*detect)(void);
 
+   /* Hypervisor type */
+   enum x86_hypervisor_type type;
+
/* init time callbacks */
struct x86_hyper_init init;
 
@@ -43,15 +56,7 @@ struct hypervisor_x86 {
struct x86_hyper_runtime runtime;
 };
 
-extern const struct hypervisor_x86 *x86_hyper;
-
-/* Recognized hypervisors */
-extern const struct hypervisor_x86 x86_hyper_vmware;
-extern const struct hypervisor_x86 x86_hyper_ms_hyperv;
-extern const struct hypervisor_x86 x86_hyper_xen_pv;
-extern const struct hypervisor_x86 x86_hyper_xen_hvm;
-extern const struct hypervisor_x86 x86_hyper_kvm;
-
+extern enum x86_hypervisor_type x86_hyper_type;
 extern void init_hypervisor_platform(void);
 #else
 static inline void init_hypervisor_platform(void) { }
diff --git a/arch/x86/kernel/cpu/hypervisor.c b/arch/x86/kernel/cpu/hypervisor.c
index 6c1bf092..bea8d3e24f50 100644
--- a/arch/x86/kernel/cpu/hypervisor.c
+++ b/arch/x86/kernel/cpu/hypervisor.c
@@ -26,6 +26,12 @@
 #include 
 #include 
 
+extern const struct hypervisor_x86 x86_hyper_vmware;
+extern const struct hypervisor_x86 x86_hyper_ms_hyperv;
+extern const struct hypervisor_x86 x86_hyper_xen_pv;
+extern const struct hypervisor_x86 x86_hyper_xen_hvm;
+extern const struct hypervisor_x86 x86_hyper_kvm;
+
 static const __initconst struct hypervisor_x86 * const hypervisors[] =
 {
 #ifdef CONFIG_XEN_PV
@@ -41,8 +47,8 @@ static const __initconst struct hypervisor_x86 * const 
hypervisors[] =
 #endif
 };
 
-const struct hypervisor_x86 *x86_hyper;
-EXPORT_SYMBOL(x86_hyper);
+enum x86_hypervisor_type x86_hyper_type;
+EXPORT_SYMBOL(x86_hyper_type);
 
 static inline const struct hypervisor_x86 * __init
 detect_hypervisor_vendor(void)
@@ -87,6 +93,6 @@ void __init init_hypervisor_platform(void)
copy_array(>init, _init.hyper, sizeof(h->init));
copy_array(>runtime, _platform.hyper, sizeof(h->runtime));
 
-   x86_hyper = h;
+   x86_hyper_type = h->type;
x86_init.hyper.init_platform();
 }
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 6bb84d655e4b..85eb5fc180c8 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -254,9 +254,9 @@ static void __init ms_hyperv_init_platform(void)
 #endif
 }
 
-const __refconst struct hypervisor_x86 x86_hyper_ms_hyperv = {
+const __initconst struct hypervisor_x86 x86_hyper_ms_hyperv = {
.name   = "Microsoft Hyper-V",
.detect = ms

[Xen-devel] [PATCH v2 5/5] x86/xen: use guest_late_init to detect Xen PVH guest

2017-11-09 Thread Juergen Gross
In case we are booted via the default boot entry by a generic loader
like grub or OVMF it is necessary to distinguish between a HVM guest
with a device model supporting legacy devices and a PVH guest without
device model.

PVH guests will always have x86_platform.legacy.no_vga set and
x86_platform.legacy.rtc cleared, while both won't be true for HVM
guests.

Test for both conditions in the guest_late_init hook and set xen_pvh
to true if they are met.

Move some of the early PVH initializations to the new hook in order
to avoid duplicated code.

Cc: boris.ostrov...@oracle.com
Cc: xen-de...@lists.xenproject.org

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
 arch/x86/xen/enlighten_hvm.c | 24 ++--
 arch/x86/xen/enlighten_pvh.c |  9 -
 2 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
index 754d5391d9fa..826898701045 100644
--- a/arch/x86/xen/enlighten_hvm.c
+++ b/arch/x86/xen/enlighten_hvm.c
@@ -1,3 +1,4 @@
+#include 
 #include 
 #include 
 #include 
@@ -188,8 +189,6 @@ static void __init xen_hvm_guest_init(void)
xen_hvm_init_time_ops();
xen_hvm_init_mmu_ops();
 
-   if (xen_pvh_domain())
-   machine_ops.emergency_restart = xen_emergency_restart;
 #ifdef CONFIG_KEXEC_CORE
machine_ops.shutdown = xen_hvm_shutdown;
machine_ops.crash_shutdown = xen_hvm_crash_shutdown;
@@ -226,6 +225,26 @@ static uint32_t __init xen_platform_hvm(void)
return xen_cpuid_base();
 }
 
+static __init void xen_hvm_guest_late_init(void)
+{
+#ifdef CONFIG_XEN_PVH
+   /* Test for PVH domain (PVH boot path taken overrides ACPI flags). */
+   if (!xen_pvh &&
+   (x86_platform.legacy.rtc || !x86_platform.legacy.no_vga))
+   return;
+
+   /* PVH detected. */
+   xen_pvh = true;
+
+   /* Make sure we don't fall back to (default) ACPI_IRQ_MODEL_PIC. */
+   if (!nr_ioapics && acpi_irq_model == ACPI_IRQ_MODEL_PIC)
+   acpi_irq_model = ACPI_IRQ_MODEL_PLATFORM;
+
+   machine_ops.emergency_restart = xen_emergency_restart;
+   pv_info.name = "Xen PVH";
+#endif
+}
+
 const __initconst struct hypervisor_x86 x86_hyper_xen_hvm = {
.name   = "Xen HVM",
.detect = xen_platform_hvm,
@@ -233,5 +252,6 @@ const __initconst struct hypervisor_x86 x86_hyper_xen_hvm = 
{
.init.init_platform = xen_hvm_guest_init,
.init.x2apic_available  = xen_x2apic_para_available,
.init.init_mem_mapping  = xen_hvm_init_mem_mapping,
+   .init.guest_late_init   = xen_hvm_guest_late_init,
.runtime.pin_vcpu   = xen_pin_vcpu,
 };
diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c
index 7bd3ee08393e..436c4f003e17 100644
--- a/arch/x86/xen/enlighten_pvh.c
+++ b/arch/x86/xen/enlighten_pvh.c
@@ -25,13 +25,6 @@ struct boot_params pvh_bootparams 
__attribute__((section(".data")));
 struct hvm_start_info pvh_start_info;
 unsigned int pvh_start_info_sz = sizeof(pvh_start_info);
 
-static void xen_pvh_arch_setup(void)
-{
-   /* Make sure we don't fall back to (default) ACPI_IRQ_MODEL_PIC. */
-   if (nr_ioapics == 0)
-   acpi_irq_model = ACPI_IRQ_MODEL_PLATFORM;
-}
-
 static void __init init_pvh_bootparams(void)
 {
struct xen_memory_map memmap;
@@ -102,6 +95,4 @@ void __init xen_prepare_pvh(void)
wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32));
 
init_pvh_bootparams();
-
-   x86_init.oem.arch_setup = xen_pvh_arch_setup;
 }
-- 
2.12.3


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2 1/5] x86: merge x86_hyper into x86_platform and x86_init

2017-11-09 Thread Juergen Gross
Instead of x86_hyper being either NULL on bare metal or a pointer to a
struct hypervisor_x86 in case of the kernel running as a guest merge
the struct into x86_platform and x86_init.

This will remove the need for wrappers making it hard to find out what
is being called. With dummy functions added for all callbacks testing
for a NULL function pointer can be removed, too.

Cc: k...@microsoft.com
Cc: haiya...@microsoft.com
Cc: sthem...@microsoft.com
Cc: akata...@vmware.com
Cc: pbonz...@redhat.com
Cc: rkrc...@redhat.com
Cc: boris.ostrov...@oracle.com
Cc: ru...@rustcorp.com.au
Cc: de...@linuxdriverproject.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: k...@vger.kernel.org
Cc: xen-de...@lists.xenproject.org

Suggested-by: Ingo Molnar <mi...@kernel.org>
Signed-off-by: Juergen Gross <jgr...@suse.com>
---
 arch/x86/include/asm/hypervisor.h | 25 --
 arch/x86/include/asm/x86_init.h   | 24 +
 arch/x86/kernel/apic/apic.c   |  2 +-
 arch/x86/kernel/cpu/hypervisor.c  | 54 +++
 arch/x86/kernel/cpu/mshyperv.c|  2 +-
 arch/x86/kernel/cpu/vmware.c  |  4 +--
 arch/x86/kernel/kvm.c |  2 +-
 arch/x86/kernel/x86_init.c|  9 +++
 arch/x86/mm/init.c|  2 +-
 arch/x86/xen/enlighten_hvm.c  |  8 +++---
 arch/x86/xen/enlighten_pv.c   |  2 +-
 include/linux/hypervisor.h|  8 --
 12 files changed, 81 insertions(+), 61 deletions(-)

diff --git a/arch/x86/include/asm/hypervisor.h 
b/arch/x86/include/asm/hypervisor.h
index 0ead9dbb9130..0eca7239a7aa 100644
--- a/arch/x86/include/asm/hypervisor.h
+++ b/arch/x86/include/asm/hypervisor.h
@@ -23,6 +23,7 @@
 #ifdef CONFIG_HYPERVISOR_GUEST
 
 #include 
+#include 
 #include 
 
 /*
@@ -35,17 +36,11 @@ struct hypervisor_x86 {
/* Detection routine */
uint32_t(*detect)(void);
 
-   /* Platform setup (run once per boot) */
-   void(*init_platform)(void);
+   /* init time callbacks */
+   struct x86_hyper_init init;
 
-   /* X2APIC detection (run once per boot) */
-   bool(*x2apic_available)(void);
-
-   /* pin current vcpu to specified physical cpu (run rarely) */
-   void(*pin_vcpu)(int);
-
-   /* called during init_mem_mapping() to setup early mappings. */
-   void(*init_mem_mapping)(void);
+   /* runtime callbacks */
+   struct x86_hyper_runtime runtime;
 };
 
 extern const struct hypervisor_x86 *x86_hyper;
@@ -58,17 +53,7 @@ extern const struct hypervisor_x86 x86_hyper_xen_hvm;
 extern const struct hypervisor_x86 x86_hyper_kvm;
 
 extern void init_hypervisor_platform(void);
-extern bool hypervisor_x2apic_available(void);
-extern void hypervisor_pin_vcpu(int cpu);
-
-static inline void hypervisor_init_mem_mapping(void)
-{
-   if (x86_hyper && x86_hyper->init_mem_mapping)
-   x86_hyper->init_mem_mapping();
-}
 #else
 static inline void init_hypervisor_platform(void) { }
-static inline bool hypervisor_x2apic_available(void) { return false; }
-static inline void hypervisor_init_mem_mapping(void) { }
 #endif /* CONFIG_HYPERVISOR_GUEST */
 #endif /* _ASM_X86_HYPERVISOR_H */
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 8a1ebf9540dd..ad15a0fda917 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -115,6 +115,18 @@ struct x86_init_pci {
 };
 
 /**
+ * struct x86_hyper_init - x86 hypervisor init functions
+ * @init_platform: platform setup
+ * @x2apic_available:  X2APIC detection
+ * @init_mem_mapping:  setup early mappings during init_mem_mapping()
+ */
+struct x86_hyper_init {
+   void (*init_platform)(void);
+   bool (*x2apic_available)(void);
+   void (*init_mem_mapping)(void);
+};
+
+/**
  * struct x86_init_ops - functions for platform specific setup
  *
  */
@@ -127,6 +139,7 @@ struct x86_init_ops {
struct x86_init_timers  timers;
struct x86_init_iommu   iommu;
struct x86_init_pci pci;
+   struct x86_hyper_init   hyper;
 };
 
 /**
@@ -200,6 +213,15 @@ struct x86_legacy_features {
 };
 
 /**
+ * struct x86_hyper_runtime - x86 hypervisor specific runtime callbacks
+ *
+ * @pin_vcpu:  pin current vcpu to specified physical cpu (run rarely)
+ */
+struct x86_hyper_runtime {
+   void (*pin_vcpu)(int cpu);
+};
+
+/**
  * struct x86_platform_ops - platform specific runtime functions
  * @calibrate_cpu: calibrate CPU
  * @calibrate_tsc: calibrate TSC, if different from CPU
@@ -218,6 +240,7 @@ struct x86_legacy_features {
  * possible in x86_early_init_platform_quirks() by
  * only using the current x86_hardware_subarch
  * semantics.
+ * @hyper: x86 hypervisor specific runtime callbacks
  *

[Xen-devel] [PATCH v2 0/5] x86/xen: support booting PVH guest via standard boot path

2017-11-09 Thread Juergen Gross
Booting a Xen PVH guest requires a special boot entry as it is
mandatory to setup some Xen-specific interfaces rather early. When grub
or OVMF are used as boot loaders, however, those will fill the boot
parameters in zeropage and there is no longer a need to do something
PVH specific in the early boot path.

This patch series adds support for that scenario by identifying PVH
environment and doing the required init steps via Xen hooks instead of
using a dedicated boot entry.

The dedicated entry is still mandatory for support of Dom0 running in
PVH mode as in this case there is no grub or OVMF involved for filling
in the boot parameters.

Changes in V2:
- added new patches 1 and 2

Cc: k...@microsoft.com
Cc: haiya...@microsoft.com
Cc: sthem...@microsoft.com
Cc: akata...@vmware.com
Cc: pbonz...@redhat.com
Cc: rkrc...@redhat.com
Cc: boris.ostrov...@oracle.com
Cc: ru...@rustcorp.com.au
Cc: de...@linuxdriverproject.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: k...@vger.kernel.org
Cc: xen-de...@lists.xenproject.org
Cc: linux-graphics-maintai...@vmware.com
Cc: pv-driv...@vmware.com
Cc: dmitry.torok...@gmail.com
Cc: xdeguill...@vmware.com
Cc: moltm...@vmware.com
Cc: a...@arndb.de
Cc: gre...@linuxfoundation.org
Cc: linux-in...@vger.kernel.org
Cc: r...@rjwysocki.net
Cc: len.br...@intel.com
Cc: pa...@ucw.cz
Cc: linux...@vger.kernel.org

Juergen Gross (5):
  x86: merge x86_hyper into x86_platform and x86_init
  x86: add enum for hypervisors to replace x86_hyper
  x86/acpi: add test for ACPI_FADT_NO_VGA
  x86: add guest_late_init hook to hypervisor_x86 structure
  x86/xen: use guest_late_init to detect Xen PVH guest

 arch/x86/hyperv/hv_init.c |  2 +-
 arch/x86/include/asm/hypervisor.h | 46 +++-
 arch/x86/include/asm/kvm_para.h   |  2 --
 arch/x86/include/asm/x86_init.h   | 27 +
 arch/x86/kernel/acpi/boot.c   |  5 +++
 arch/x86/kernel/apic/apic.c   |  2 +-
 arch/x86/kernel/cpu/hypervisor.c  | 64 +--
 arch/x86/kernel/cpu/mshyperv.c|  6 ++--
 arch/x86/kernel/cpu/vmware.c  |  8 ++---
 arch/x86/kernel/kvm.c |  9 +++---
 arch/x86/kernel/setup.c   |  2 +-
 arch/x86/kernel/x86_init.c| 10 ++
 arch/x86/mm/init.c|  2 +-
 arch/x86/xen/enlighten_hvm.c  | 36 +-
 arch/x86/xen/enlighten_pv.c   |  6 ++--
 arch/x86/xen/enlighten_pvh.c  |  9 --
 drivers/hv/vmbus_drv.c|  2 +-
 drivers/input/mouse/vmmouse.c | 10 +++---
 drivers/misc/vmw_balloon.c|  2 +-
 include/linux/hypervisor.h|  8 +++--
 20 files changed, 153 insertions(+), 105 deletions(-)

-- 
2.12.3


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] mini-os: add a coding style file

2017-11-09 Thread Juergen Gross
On 09/11/17 13:31, Wei Liu wrote:
> On Thu, Nov 09, 2017 at 01:10:12PM +0100, Juergen Gross wrote:
>> Since carving out Mini-OS from the Xen repository there hasn't been a
>> description of the preferred coding style. Copy the Xen CODING_STYLE
>> file.
>>
> 
> I welcome such addition. I have no opinion in actual style used though.
> I just want consistency. :-)

Is this an Ack?


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH] mini-os: add a coding style file

2017-11-09 Thread Juergen Gross
Since carving out Mini-OS from the Xen repository there hasn't been a
description of the preferred coding style. Copy the Xen CODING_STYLE
file.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
 CODING_STYLE | 109 +++
 1 file changed, 109 insertions(+)
 create mode 100644 CODING_STYLE

diff --git a/CODING_STYLE b/CODING_STYLE
new file mode 100644
index 000..539aa60
--- /dev/null
+++ b/CODING_STYLE
@@ -0,0 +1,109 @@
+Coding Style for Mini-OS
+
+
+Indentation
+---
+
+Indenting uses spaces, not tabs - in contrast to Linux.  An indent
+level consists of four spaces.  Code within blocks is indented by one
+extra indent level.  The enclosing braces of a block are indented the
+same as the code _outside_ the block.  e.g.
+
+void fun(void)
+{
+/* One level of indent. */
+
+{
+/* A second level of indent. */
+}
+}
+
+White space
+---
+
+Space characters are used to spread out logical statements, such as in
+the condition of an if or while.  Spaces are placed between the
+keyword and the brackets surrounding the condition, between the
+brackets and the condition itself, and around binary operators (except
+the structure access operators, '.' and '->'). e.g.
+
+if ( (wibble & wombat) == 42 )
+{
+...
+
+There should be no trailing white space at the end of lines (including
+after the opening /* of a comment block).
+
+Line Length
+---
+
+Lines should be less than 80 characters in length.  Long lines should
+be split at sensible places and the trailing portions indented.
+
+User visible strings (e.g., printk() messages) should not be split so
+they can searched for more easily.
+
+Bracing
+---
+
+Braces ('{' and '}') are usually placed on a line of their own, except
+for the do/while loop.  This is unlike the Linux coding style and
+unlike K  do/while loops are an exception. e.g.:
+
+if ( condition )
+{
+/* Do stuff. */
+}
+else
+{
+/* Other stuff. */
+}
+
+while ( condition )
+{
+/* Do stuff. */
+}
+
+do {
+/* Do stuff. */
+} while ( condition );
+
+etc.
+
+Braces should be omitted for blocks with a single statement. e.g.,
+
+if ( condition )
+single_statement();
+
+Comments
+
+
+Only C style /* ... */ comments are to be used.  C++ style // comments
+should not be used.  Multi-word comments should begin with a capital
+letter.  Comments containing a single sentence may end with a full
+stop; comments containing several sentences must have a full stop
+after each sentence.
+
+Multi-line comment blocks should start and end with comment markers on
+separate lines and each line should begin with a leading '*'.
+
+/*
+ * Example, multi-line comment block.
+ *
+ * Note beginning and end markers on separate lines and leading '*'.
+ */
+
+Emacs local variables
+-
+
+A comment block containing local variables for emacs is permitted at
+the end of files.  It should be:
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
2.12.3


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] Xen/pciback: Implement PCI slot or bus reset with 'do_flr' SysFS attribute

2017-11-08 Thread Juergen Gross
On 08/11/17 16:00, Govinda Tatti wrote:
> Thanks Roger for your review comments. Please see below for my comments.
> 
> On 11/7/2017 5:21 AM, Roger Pau Monné wrote:
>> On Mon, Nov 06, 2017 at 12:48:42PM -0500, Govinda Tatti wrote:
>>> +out:
>>> +    if (!err)
>>> +    err = count;
>>> +
>>> +    return err;
>> What's the purpose of returning count here?
> Not sure. Will check with Konrad and address this comment.

You are telling sysfs that you have consumed all input characters.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 3/3] x86/xen: use guest_late_init to detect Xen PVH guest

2017-11-08 Thread Juergen Gross
On 08/11/17 15:10, Boris Ostrovsky wrote:
> On 11/08/2017 08:36 AM, Juergen Gross wrote:
>>
>> Regarding ACPI tables: current PVH implementation in Linux kernel
>> seems not to make use of the special information presented in the boot
>> information block.
> 
> It will need to do so for dom0 (and, then, for simplicity, for all PVH
> guests).

What about: for all PVH guests booted via the PVH specific boot entry?
A guest booted via the default boot entry won't know it is PVH until
ACPI tables have been scanned.


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 0/3] x86/xen: support booting PVH guest via standard boot path

2017-11-08 Thread Juergen Gross
On 08/11/17 14:37, Boris Ostrovsky wrote:
> On 11/08/2017 04:07 AM, Juergen Gross wrote:
>> Booting a Xen PVH guest requires a special boot entry as it is
>> mandatory to setup some Xen-specific interfaces rather early. When grub
>> or OVMF are used as boot loaders, however, those will fill the boot
>> parameters in zeropage and there is no longer a need to do something
>> PVH specific in the early boot path.
>>
>> This patch series adds support for that scenario by identifying PVH
>> environment and doing the required init steps via Xen hooks instead of
>> using a dedicated boot entry.
>>
>> The dedicated entry is still needed for support of Dom0 running in PVH
>> mode as in this case there is no grub or OVMF involved for filling in
>> the boot parameters.
> 
> We are going to continue supporting direct boot of unprivileged guests
> too so this entry point will be needed not for dom0 only.

Sure, but using e.g. grub in this case would be an alternative. For Dom0
this alternative isn't existing. So this entry is mandatory, not a "nice
to have".


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 3/3] x86/xen: use guest_late_init to detect Xen PVH guest

2017-11-08 Thread Juergen Gross
On 08/11/17 13:58, Jan Beulich wrote:
 On 08.11.17 at 13:45,  wrote:
>> On 08/11/17 13:31, Jan Beulich wrote:
>> On 08.11.17 at 12:55,  wrote:
 On 08/11/17 12:18, Jan Beulich wrote:
 On 08.11.17 at 10:07,  wrote:
>> In case we are booted via the default boot entry by a generic loader
>> like grub or OVMF it is necessary to distinguish between a HVM guest
>> with a device model supporting legacy devices and a PVH guest without
>> device model.
>>
>> PVH guests will always have x86_platform.legacy.no_vga set and
>> x86_platform.legacy.rtc cleared, while both won't be true for HVM
>> guests.
>>
>> Test for both conditions in the guest_late_init hook and set xen_pvh
>> to true if they are met.
>
> This sounds pretty fragile to me: I can't see a reason why a proper
> HVM guest couldn't come without VGA and RTC. That's not possible
> today, agreed, but certainly an option down the road if virtualization
> follows bare metal's road towards being legacy free.

 From guest's perspective: what is the difference between a legacy free
 HVM domain and PVH? In the end the need for differentiating is to avoid
 access to legacy features in PVH as those would require a device model.
>>>
>>> My point is that "legacy free" would likely be reached over time (and
>>> even once fully reached, hybrid configurations would be possible).
>>> I.e. there could be a setup with PIC, but with neither VGA nor RTC.
>>> That's still not PVH then. Nor do all legacy features require a device
>>> model in the first place - some of them are being emulated entirely
>>> in the hypervisor.
>>>
>>> Furthermore, PVH absolutely requires guest awareness afaict, while
>>> legacy-free pure HVM guests (with an OS only aware of the possible
>>> absence of legacy devices) would still be possible.
>>
>> Hmm, where else do you expect PVH awareness to be required? Maybe for
>> vcpu hotplugging, but this could easily be solved by adding a Xenstore
>> entry containing the required information. Is there any other problem to
>> be expected before Xenstore access is possible?
> 
> Let me ask the question the other way around: What's all the PVH
> specific code for under arch/x86/xen/ if there's no difference? One

Most of it is for early boot when coming through the PVH specific
boot entry.

> thing I seem to remember is that getting hold of the ACPI tables
> is different between PVH and HVM. Iirc the distinct PVH entry point
> is (in part) for that purpose. In the end - with that separate entry
> point - it is not really clear to me why any "detection" needs to be
> done in the first place: You'd know which mode you're in by knowing
> which entry point path you've taken.

Its all in the commit message: I am trying to enable a boot loader to
use the default kernel boot entry for PVH. This will reduce the needed
modifications in the loader.

Regarding ACPI tables: current PVH implementation in Linux kernel
seems not to make use of the special information presented in the boot
information block.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/privcmd: remove unused variable pageidx

2017-11-08 Thread Juergen Gross
On 08/11/17 14:00, Colin King wrote:
> From: Colin Ian King <colin.k...@canonical.com>
> 
> Variable pageidx is assigned a value but it is never read, hence it
> is redundant and can be removed. Cleans up clang warning:
> 
> drivers/xen/privcmd.c:199:2: warning: Value stored to 'pageidx'
> is never read
> 
> Signed-off-by: Colin Ian King <colin.k...@canonical.com>

Reviewed-by: Juergen Gross <jgr...@suse.com>


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 3/3] x86/xen: use guest_late_init to detect Xen PVH guest

2017-11-08 Thread Juergen Gross
On 08/11/17 13:31, Jan Beulich wrote:
 On 08.11.17 at 12:55,  wrote:
>> On 08/11/17 12:18, Jan Beulich wrote:
>> On 08.11.17 at 10:07,  wrote:
 In case we are booted via the default boot entry by a generic loader
 like grub or OVMF it is necessary to distinguish between a HVM guest
 with a device model supporting legacy devices and a PVH guest without
 device model.

 PVH guests will always have x86_platform.legacy.no_vga set and
 x86_platform.legacy.rtc cleared, while both won't be true for HVM
 guests.

 Test for both conditions in the guest_late_init hook and set xen_pvh
 to true if they are met.
>>>
>>> This sounds pretty fragile to me: I can't see a reason why a proper
>>> HVM guest couldn't come without VGA and RTC. That's not possible
>>> today, agreed, but certainly an option down the road if virtualization
>>> follows bare metal's road towards being legacy free.
>>
>> From guest's perspective: what is the difference between a legacy free
>> HVM domain and PVH? In the end the need for differentiating is to avoid
>> access to legacy features in PVH as those would require a device model.
> 
> My point is that "legacy free" would likely be reached over time (and
> even once fully reached, hybrid configurations would be possible).
> I.e. there could be a setup with PIC, but with neither VGA nor RTC.
> That's still not PVH then. Nor do all legacy features require a device
> model in the first place - some of them are being emulated entirely
> in the hypervisor.
> 
> Furthermore, PVH absolutely requires guest awareness afaict, while
> legacy-free pure HVM guests (with an OS only aware of the possible
> absence of legacy devices) would still be possible.

Hmm, where else do you expect PVH awareness to be required? Maybe for
vcpu hotplugging, but this could easily be solved by adding a Xenstore
entry containing the required information. Is there any other problem to
be expected before Xenstore access is possible?


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 3/3] x86/xen: use guest_late_init to detect Xen PVH guest

2017-11-08 Thread Juergen Gross
On 08/11/17 13:26, Paolo Bonzini wrote:
> On 08/11/2017 13:24, Juergen Gross wrote:
>>> My understanding of Xen is very rusty at this point, but I think a
>>> "completely" legacy-free HVM domain will still have a PCI bus and the
>>> Xen platform device on that bus.
>>>
>>> A PVH domain just knows how to access the Xen PV features.
>>
>> A HVM domain does so, too. Today maybe only partially, but e.g. event
>> channels work in a HVM domain even without the Xen platform device.
>> Grant tables can be made working without the platform device, too,
>> and I'm already preparing a patch to do exactly that.
> 
> What about assigned PCI devices?  I think they are not PV pcifront for
> HVM.  So the main difference in the end is the PCI bus.

Sure, but this is easily detectable, isn't it?


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 3/3] x86/xen: use guest_late_init to detect Xen PVH guest

2017-11-08 Thread Juergen Gross
On 08/11/17 13:03, Paolo Bonzini wrote:
> On 08/11/2017 12:55, Juergen Gross wrote:
>> On 08/11/17 12:18, Jan Beulich wrote:
>>>>>> On 08.11.17 at 10:07, <jgr...@suse.com> wrote:
>>>> In case we are booted via the default boot entry by a generic loader
>>>> like grub or OVMF it is necessary to distinguish between a HVM guest
>>>> with a device model supporting legacy devices and a PVH guest without
>>>> device model.
>>>>
>>>> PVH guests will always have x86_platform.legacy.no_vga set and
>>>> x86_platform.legacy.rtc cleared, while both won't be true for HVM
>>>> guests.
>>>>
>>>> Test for both conditions in the guest_late_init hook and set xen_pvh
>>>> to true if they are met.
>>>
>>> This sounds pretty fragile to me: I can't see a reason why a proper
>>> HVM guest couldn't come without VGA and RTC. That's not possible
>>> today, agreed, but certainly an option down the road if virtualization
>>> follows bare metal's road towards being legacy free.
>>
>> From guest's perspective: what is the difference between a legacy free
>> HVM domain and PVH? In the end the need for differentiating is to avoid
>> access to legacy features in PVH as those would require a device model.
> 
> My understanding of Xen is very rusty at this point, but I think a
> "completely" legacy-free HVM domain will still have a PCI bus and the
> Xen platform device on that bus.
> 
> A PVH domain just knows how to access the Xen PV features.

A HVM domain does so, too. Today maybe only partially, but e.g. event
channels work in a HVM domain even without the Xen platform device.
Grant tables can be made working without the platform device, too,
and I'm already preparing a patch to do exactly that.

The only need for the platform device will then be to have an
interface for unplugging emulated boot devices in favor of the pv
devices. And without the platform device we can just skip that
step without doing any harm, as this can happen only for PVH where
we have no need to do the unplug, or for HVM explicitly configured
to work without platform device needing to continue using the
emulated devices as it is doing today in this case.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 3/3] x86/xen: use guest_late_init to detect Xen PVH guest

2017-11-08 Thread Juergen Gross
On 08/11/17 12:18, Jan Beulich wrote:
 On 08.11.17 at 10:07,  wrote:
>> In case we are booted via the default boot entry by a generic loader
>> like grub or OVMF it is necessary to distinguish between a HVM guest
>> with a device model supporting legacy devices and a PVH guest without
>> device model.
>>
>> PVH guests will always have x86_platform.legacy.no_vga set and
>> x86_platform.legacy.rtc cleared, while both won't be true for HVM
>> guests.
>>
>> Test for both conditions in the guest_late_init hook and set xen_pvh
>> to true if they are met.
> 
> This sounds pretty fragile to me: I can't see a reason why a proper
> HVM guest couldn't come without VGA and RTC. That's not possible
> today, agreed, but certainly an option down the road if virtualization
> follows bare metal's road towards being legacy free.

From guest's perspective: what is the difference between a legacy free
HVM domain and PVH? In the end the need for differentiating is to avoid
access to legacy features in PVH as those would require a device model.


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 2/3] x86: add guest_late_init hook to hypervisor_x86 structure

2017-11-08 Thread Juergen Gross
On 08/11/17 10:40, Ingo Molnar wrote:
> 
> * Juergen Gross <jgr...@suse.com> wrote:
> 
>>> Plus add a default empty function (which hypervisors can override). This 
>>> avoids 
>>> all the hidden conditions and wrappery.
>>
>> Hmm, x86_hyper is just a pointer being NULL on bare metal. So we would
>> have to add a pre-filled struct with dummy functions and in case a
>> hypervisor is detected we'd need to copy all non-NULL pointers of the
>> hypervisor specific struct to the pre-filled one.
> 
> Ok, I missed that detail - but yeah, since this is mostly about boot code,
> where readability is king, I still think it would be an overall improvement.
> 
> This is the pattern we are trying to use with x86_platform ops for example, 
> and 
> doing:
> 
>   git grep -w x86_platform arch/x86
> 
> gives a pretty clear idea about how it's used - while if it was all within 
> wrappers it would be a lot more difficult to discover all the callsites.
> 
> Admittedly it's not done totally consistently:
> 
>  arch/x86/kernel/apic/probe_32.c:if (x86_platform.apic_post_init)
>  arch/x86/kernel/apic/probe_64.c:if (x86_platform.apic_post_init)
>  arch/x86/kernel/ebda.c: if (!x86_platform.legacy.reserve_bios_regions)
>  arch/x86/kernel/platform-quirks.c:  if (x86_platform.set_legacy_features)
>  arch/x86/platform/intel-mid/device_libs/platform_mrfld_rtc.c:   if 
> (!x86_platform.legacy.rtc)
> 
> ... but the _idea_ behind it is consistent ;-)
> 
>> In case there are no objections I can add a patch to modify the current
>> way x86_hyper is used to the pre-filled variant.
> 
> Yeah, sounds good to me!

Okay. With you mentioning x86_platform: should I make x86_hyper a member
of x86_platform (e.g. x86_platform.hyper.) ?

I think this would fit quite nice.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 2/3] x86: add guest_late_init hook to hypervisor_x86 structure

2017-11-08 Thread Juergen Gross
On 08/11/17 10:13, Ingo Molnar wrote:
> 
> * Juergen Gross <jgr...@suse.com> wrote:
> 
>> Add a new guest_late_init hook to the hypervisor_x86 structure. It
>> will replace the current kvm_guest_init() call which is changed to
>> make use of the new hook.
>>
>> Signed-off-by: Juergen Gross <jgr...@suse.com>
>> ---
>>  arch/x86/include/asm/hypervisor.h | 11 +++
>>  arch/x86/include/asm/kvm_para.h   |  2 --
>>  arch/x86/kernel/kvm.c |  3 ++-
>>  arch/x86/kernel/setup.c   |  2 +-
>>  4 files changed, 14 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/hypervisor.h 
>> b/arch/x86/include/asm/hypervisor.h
>> index 0ead9dbb9130..37320687b8cb 100644
>> --- a/arch/x86/include/asm/hypervisor.h
>> +++ b/arch/x86/include/asm/hypervisor.h
>> @@ -38,6 +38,9 @@ struct hypervisor_x86 {
>>  /* Platform setup (run once per boot) */
>>  void(*init_platform)(void);
>>  
>> +/* Guest late init */
>> +void(*guest_late_init)(void);
>> +
>>  /* X2APIC detection (run once per boot) */
>>  bool(*x2apic_available)(void);
>>  
>> @@ -66,9 +69,17 @@ static inline void hypervisor_init_mem_mapping(void)
>>  if (x86_hyper && x86_hyper->init_mem_mapping)
>>  x86_hyper->init_mem_mapping();
>>  }
>> +
>> +static inline void hypervisor_guest_late_init(void)
>> +{
>> +if (x86_hyper && x86_hyper->guest_late_init)
>> +x86_hyper->guest_late_init();
>> +}
>> +
>>  #else
>>  static inline void init_hypervisor_platform(void) { }
>>  static inline bool hypervisor_x2apic_available(void) { return false; }
>>  static inline void hypervisor_init_mem_mapping(void) { }
>> +static inline void hypervisor_guest_late_init(void) { }
>>  #endif /* CONFIG_HYPERVISOR_GUEST */
>>  #endif /* _ASM_X86_HYPERVISOR_H */
>> diff --git a/arch/x86/include/asm/kvm_para.h 
>> b/arch/x86/include/asm/kvm_para.h
>> index c373e44049b1..7b407dda2bd7 100644
>> --- a/arch/x86/include/asm/kvm_para.h
>> +++ b/arch/x86/include/asm/kvm_para.h
>> @@ -88,7 +88,6 @@ static inline long kvm_hypercall4(unsigned int nr, 
>> unsigned long p1,
>>  #ifdef CONFIG_KVM_GUEST
>>  bool kvm_para_available(void);
>>  unsigned int kvm_arch_para_features(void);
>> -void __init kvm_guest_init(void);
>>  void kvm_async_pf_task_wait(u32 token, int interrupt_kernel);
>>  void kvm_async_pf_task_wake(u32 token);
>>  u32 kvm_read_and_reset_pf_reason(void);
>> @@ -103,7 +102,6 @@ static inline void kvm_spinlock_init(void)
>>  #endif /* CONFIG_PARAVIRT_SPINLOCKS */
>>  
>>  #else /* CONFIG_KVM_GUEST */
>> -#define kvm_guest_init() do {} while (0)
>>  #define kvm_async_pf_task_wait(T, I) do {} while(0)
>>  #define kvm_async_pf_task_wake(T) do {} while(0)
>>  
>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>> index 8bb9594d0761..d331b5060aa9 100644
>> --- a/arch/x86/kernel/kvm.c
>> +++ b/arch/x86/kernel/kvm.c
>> @@ -465,7 +465,7 @@ static void __init kvm_apf_trap_init(void)
>>  update_intr_gate(X86_TRAP_PF, async_page_fault);
>>  }
>>  
>> -void __init kvm_guest_init(void)
>> +static void __init kvm_guest_init(void)
>>  {
>>  int i;
>>  
>> @@ -548,6 +548,7 @@ const struct hypervisor_x86 x86_hyper_kvm __refconst = {
>>  .name   = "KVM",
>>  .detect = kvm_detect,
>>  .x2apic_available   = kvm_para_available,
>> +.guest_late_init= kvm_guest_init,
>>  };
>>  EXPORT_SYMBOL_GPL(x86_hyper_kvm);
>>  
>> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
>> index 0957dd73d127..578569481d87 100644
>> --- a/arch/x86/kernel/setup.c
>> +++ b/arch/x86/kernel/setup.c
>> @@ -1294,7 +1294,7 @@ void __init setup_arch(char **cmdline_p)
>>  
>>  io_apic_init_mappings();
>>  
>> -kvm_guest_init();
>> +hypervisor_guest_late_init();
> 
> No principal objections, but could we please use a more obvious pattern 
> showing 
> that this is a callback, by calling it directly:
>   
>   x86_hyper->guest_late_init();
> 
> Plus add a default empty function (which hypervisors can override). This 
> avoids 
> all the hidden conditions and wrappery.

Hmm, x86_hyper is just a pointer being NULL on bare metal. So we would
have to add a pre-filled struct with dummy functions and in case a
hypervisor is detected we'd need to copy all non-NULL pointers of the
hypervisor specific struct to the pre-filled one.

In case there are no objections I can add a patch to modify the current
way x86_hyper is used to the pre-filled variant.


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH 1/3] x86/acpi: add test for ACPI_FADT_NO_VGA

2017-11-08 Thread Juergen Gross
Add a test for ACPI_FADT_NO_VGA when scanning the FADT and set the new
flag x86_platform.legacy.no_vga accordingly.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
 arch/x86/include/asm/x86_init.h | 1 +
 arch/x86/kernel/acpi/boot.c | 5 +
 2 files changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 8a1ebf9540dd..75a60fa52a8a 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -195,6 +195,7 @@ enum x86_legacy_i8042_state {
 struct x86_legacy_features {
enum x86_legacy_i8042_state i8042;
int rtc;
+   int no_vga;
int reserve_bios_regions;
struct x86_legacy_devices devices;
 };
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 079535e53e2a..ef9e02e614d0 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -961,6 +961,11 @@ static int __init acpi_parse_fadt(struct acpi_table_header 
*table)
x86_platform.legacy.rtc = 0;
}
 
+   if (acpi_gbl_FADT.boot_flags & ACPI_FADT_NO_VGA) {
+   pr_debug("ACPI: probing for VGA not safe\n");
+   x86_platform.legacy.no_vga = 1;
+   }
+
 #ifdef CONFIG_X86_PM_TIMER
/* detect the location of the ACPI PM Timer */
if (acpi_gbl_FADT.header.revision >= FADT2_REVISION_ID) {
-- 
2.12.3


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH 3/3] x86/xen: use guest_late_init to detect Xen PVH guest

2017-11-08 Thread Juergen Gross
In case we are booted via the default boot entry by a generic loader
like grub or OVMF it is necessary to distinguish between a HVM guest
with a device model supporting legacy devices and a PVH guest without
device model.

PVH guests will always have x86_platform.legacy.no_vga set and
x86_platform.legacy.rtc cleared, while both won't be true for HVM
guests.

Test for both conditions in the guest_late_init hook and set xen_pvh
to true if they are met.

Move some of the early PVH initializations to the new hook in order
to avoid duplicated code.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
 arch/x86/xen/enlighten_hvm.c | 24 ++--
 arch/x86/xen/enlighten_pvh.c |  9 -
 2 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
index de503c225ae1..d7d68a39073a 100644
--- a/arch/x86/xen/enlighten_hvm.c
+++ b/arch/x86/xen/enlighten_hvm.c
@@ -1,3 +1,4 @@
+#include 
 #include 
 #include 
 #include 
@@ -188,8 +189,6 @@ static void __init xen_hvm_guest_init(void)
xen_hvm_init_time_ops();
xen_hvm_init_mmu_ops();
 
-   if (xen_pvh_domain())
-   machine_ops.emergency_restart = xen_emergency_restart;
 #ifdef CONFIG_KEXEC_CORE
machine_ops.shutdown = xen_hvm_shutdown;
machine_ops.crash_shutdown = xen_hvm_crash_shutdown;
@@ -226,6 +225,26 @@ static uint32_t __init xen_platform_hvm(void)
return xen_cpuid_base();
 }
 
+static __init void xen_hvm_guest_late_init(void)
+{
+#ifdef CONFIG_XEN_PVH
+   /* Test for PVH domain (PVH boot path taken overrides ACPI flags). */
+   if (!xen_pvh &&
+   (x86_platform.legacy.rtc || !x86_platform.legacy.no_vga))
+   return;
+
+   /* PVH detected. */
+   xen_pvh = true;
+
+   /* Make sure we don't fall back to (default) ACPI_IRQ_MODEL_PIC. */
+   if (!nr_ioapics && acpi_irq_model == ACPI_IRQ_MODEL_PIC)
+   acpi_irq_model = ACPI_IRQ_MODEL_PLATFORM;
+
+   machine_ops.emergency_restart = xen_emergency_restart;
+   pv_info.name = "Xen PVH";
+#endif
+}
+
 const struct hypervisor_x86 x86_hyper_xen_hvm = {
.name   = "Xen HVM",
.detect = xen_platform_hvm,
@@ -233,5 +252,6 @@ const struct hypervisor_x86 x86_hyper_xen_hvm = {
.pin_vcpu   = xen_pin_vcpu,
.x2apic_available   = xen_x2apic_para_available,
.init_mem_mapping   = xen_hvm_init_mem_mapping,
+   .guest_late_init= xen_hvm_guest_late_init,
 };
 EXPORT_SYMBOL(x86_hyper_xen_hvm);
diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c
index 7bd3ee08393e..436c4f003e17 100644
--- a/arch/x86/xen/enlighten_pvh.c
+++ b/arch/x86/xen/enlighten_pvh.c
@@ -25,13 +25,6 @@ struct boot_params pvh_bootparams 
__attribute__((section(".data")));
 struct hvm_start_info pvh_start_info;
 unsigned int pvh_start_info_sz = sizeof(pvh_start_info);
 
-static void xen_pvh_arch_setup(void)
-{
-   /* Make sure we don't fall back to (default) ACPI_IRQ_MODEL_PIC. */
-   if (nr_ioapics == 0)
-   acpi_irq_model = ACPI_IRQ_MODEL_PLATFORM;
-}
-
 static void __init init_pvh_bootparams(void)
 {
struct xen_memory_map memmap;
@@ -102,6 +95,4 @@ void __init xen_prepare_pvh(void)
wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32));
 
init_pvh_bootparams();
-
-   x86_init.oem.arch_setup = xen_pvh_arch_setup;
 }
-- 
2.12.3


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH 2/3] x86: add guest_late_init hook to hypervisor_x86 structure

2017-11-08 Thread Juergen Gross
Add a new guest_late_init hook to the hypervisor_x86 structure. It
will replace the current kvm_guest_init() call which is changed to
make use of the new hook.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
 arch/x86/include/asm/hypervisor.h | 11 +++
 arch/x86/include/asm/kvm_para.h   |  2 --
 arch/x86/kernel/kvm.c |  3 ++-
 arch/x86/kernel/setup.c   |  2 +-
 4 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/hypervisor.h 
b/arch/x86/include/asm/hypervisor.h
index 0ead9dbb9130..37320687b8cb 100644
--- a/arch/x86/include/asm/hypervisor.h
+++ b/arch/x86/include/asm/hypervisor.h
@@ -38,6 +38,9 @@ struct hypervisor_x86 {
/* Platform setup (run once per boot) */
void(*init_platform)(void);
 
+   /* Guest late init */
+   void(*guest_late_init)(void);
+
/* X2APIC detection (run once per boot) */
bool(*x2apic_available)(void);
 
@@ -66,9 +69,17 @@ static inline void hypervisor_init_mem_mapping(void)
if (x86_hyper && x86_hyper->init_mem_mapping)
x86_hyper->init_mem_mapping();
 }
+
+static inline void hypervisor_guest_late_init(void)
+{
+   if (x86_hyper && x86_hyper->guest_late_init)
+   x86_hyper->guest_late_init();
+}
+
 #else
 static inline void init_hypervisor_platform(void) { }
 static inline bool hypervisor_x2apic_available(void) { return false; }
 static inline void hypervisor_init_mem_mapping(void) { }
+static inline void hypervisor_guest_late_init(void) { }
 #endif /* CONFIG_HYPERVISOR_GUEST */
 #endif /* _ASM_X86_HYPERVISOR_H */
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index c373e44049b1..7b407dda2bd7 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -88,7 +88,6 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned 
long p1,
 #ifdef CONFIG_KVM_GUEST
 bool kvm_para_available(void);
 unsigned int kvm_arch_para_features(void);
-void __init kvm_guest_init(void);
 void kvm_async_pf_task_wait(u32 token, int interrupt_kernel);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
@@ -103,7 +102,6 @@ static inline void kvm_spinlock_init(void)
 #endif /* CONFIG_PARAVIRT_SPINLOCKS */
 
 #else /* CONFIG_KVM_GUEST */
-#define kvm_guest_init() do {} while (0)
 #define kvm_async_pf_task_wait(T, I) do {} while(0)
 #define kvm_async_pf_task_wake(T) do {} while(0)
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 8bb9594d0761..d331b5060aa9 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -465,7 +465,7 @@ static void __init kvm_apf_trap_init(void)
update_intr_gate(X86_TRAP_PF, async_page_fault);
 }
 
-void __init kvm_guest_init(void)
+static void __init kvm_guest_init(void)
 {
int i;
 
@@ -548,6 +548,7 @@ const struct hypervisor_x86 x86_hyper_kvm __refconst = {
.name   = "KVM",
.detect = kvm_detect,
.x2apic_available   = kvm_para_available,
+   .guest_late_init= kvm_guest_init,
 };
 EXPORT_SYMBOL_GPL(x86_hyper_kvm);
 
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 0957dd73d127..578569481d87 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1294,7 +1294,7 @@ void __init setup_arch(char **cmdline_p)
 
io_apic_init_mappings();
 
-   kvm_guest_init();
+   hypervisor_guest_late_init();
 
e820__reserve_resources();
e820__register_nosave_regions(max_low_pfn);
-- 
2.12.3


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH 0/3] x86/xen: support booting PVH guest via standard boot path

2017-11-08 Thread Juergen Gross
Booting a Xen PVH guest requires a special boot entry as it is
mandatory to setup some Xen-specific interfaces rather early. When grub
or OVMF are used as boot loaders, however, those will fill the boot
parameters in zeropage and there is no longer a need to do something
PVH specific in the early boot path.

This patch series adds support for that scenario by identifying PVH
environment and doing the required init steps via Xen hooks instead of
using a dedicated boot entry.

The dedicated entry is still needed for support of Dom0 running in PVH
mode as in this case there is no grub or OVMF involved for filling in
the boot parameters.

Juergen Gross (3):
  x86/acpi: add test for ACPI_FADT_NO_VGA
  x86: add guest_late_init hook to hypervisor_x86 structure
  x86/xen: use guest_late_init to detect Xen PVH guest

 arch/x86/include/asm/hypervisor.h | 11 +++
 arch/x86/include/asm/kvm_para.h   |  2 --
 arch/x86/include/asm/x86_init.h   |  1 +
 arch/x86/kernel/acpi/boot.c   |  5 +
 arch/x86/kernel/kvm.c |  3 ++-
 arch/x86/kernel/setup.c   |  2 +-
 arch/x86/xen/enlighten_hvm.c  | 24 ++--
 arch/x86/xen/enlighten_pvh.c  |  9 -
 8 files changed, 42 insertions(+), 15 deletions(-)

-- 
2.12.3


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen PVH support in grub2

2017-11-07 Thread Juergen Gross
On 06/11/17 12:36, Juergen Gross wrote:
> On 03/11/17 13:17, Roger Pau Monné wrote:
>> On Fri, Nov 03, 2017 at 01:00:46PM +0100, Juergen Gross wrote:
>>> On 29/09/17 17:51, Roger Pau Monné wrote:
>>>> On Fri, Sep 29, 2017 at 03:33:58PM +, Juergen Gross wrote:
>>>>> On 29/09/17 17:24, Roger Pau Monné wrote:
>>>>>> On Fri, Sep 29, 2017 at 02:46:53PM +, Juergen Gross wrote:
>>>>>> Then, I also wonder whether it would make sense for this grub to load
>>>>>> the kernel using the PVH entry point or the native entry point. Would
>>>>>> it be possible to boot a Linux kernel up to the point where cpuid can
>>>>>> be used inside of a PVH container?
>>>>>
>>>>> I don't think today's Linux allows that. This has been discussed
>>>>> very thoroughly at the time Boris added PVH V2 support to the kernel.
>>>>
>>>> OK, I'm not going to insist on that, but my plans for FreeBSD is to
>>>> make the native entry point capable of booting inside of a PVH
>>>> container up to the point where cpuid (or whatever method) can be used
>>>> to detect the environment.
>>>
>>> Looking more thoroughly into the Linux boot code I think this could
>>> work for Linux, too. But only if we can tell PVH from HVM in the guest.
>>> How would you do that in FreeBSD? Via flags in the boot params? This
>>> would the have to be done in the boot loader (e.g. grub or OVMF).
>>
>> My plan was not to differentiate between HVM and PVH, but rather to
>> make use of the ACPI information in order to decide which devices are
>> available and which are not inside of a PVH guest.
>>
>> For example in the FADT "IA-PC Boot Architecture Flags" field for PVH
>> we already set "VGA Not Present" and "CMOS RTC Not Present". There
>> might be other flags/fields that must be set, but I would like to
>> avoid having a CPUID bit or similar saying "PVH", because then Xen
>> will be tied to always providing the same set of devices in PVH
>> containers.
> 
> I looked through the xen_pvh_domain() use cases in the Linux kernel
> again.
> 
> Maybe we can really manage to not need differentiating PVH from HVM
> until ACPI table scan. We'd need another hook for Xen, but this should
> be easy as KVM already has a hook where we'd need one. So this can be
> made more general and we are fine.
> 
> I even think we can drop some of the PVH tests, as the PVH-specific
> handling (e.g. for grant table initialization) should work for HVM, too.

So I did a little test now: with some small patches added I've managed
to boot a PVH Linux kernel without any special handling in the early
PVH paths of the kernel (only setting up initial page tables and faking
a memory map similar to the one grub would deliver). PVH detection is
done via ACPI table ("VGA Not Present" and "CMOS RTC Not Present" in a
HVM domain will result in PVH assumed).

If nobody objects to this handling I'll send patches soon.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen PVH support in grub2

2017-11-06 Thread Juergen Gross
On 06/11/17 17:42, Boris Ostrovsky wrote:
> On 11/06/2017 10:05 AM, Juergen Gross wrote:
>> On 06/11/17 15:51, Boris Ostrovsky wrote:
>>> On 11/06/2017 02:16 AM, Juergen Gross wrote:
>>>> On 03/11/17 20:00, Boris Ostrovsky wrote:
>>>>> On 11/03/2017 02:40 PM, Juergen Gross wrote:
>>>>>> On 03/11/17 19:35, Boris Ostrovsky wrote:
>>>>>>> On 11/03/2017 02:23 PM, Juergen Gross wrote:
>>>>>>>> On 03/11/17 19:19, Boris Ostrovsky wrote:
>>>>>>>>> On 11/03/2017 02:05 PM, Juergen Gross wrote:
>>>>>>>>>> So again the question: how to tell whether we are PVH or HVM in
>>>>>>>>>> init_hypervisor_platform()? ACPi tables are scanned way later...
>>>>>>>>> Can we make grub/OVMF append a boot option?
>>>>>>>>>
>>>>>>>>> Or set setup_header.hardware_subarch to something? We already have
>>>>>>>>> X86_SUBARCH_XEN but it is only used by PV.  Or we might be able to use
>>>>>>>>> hardware_subarch_data (will need to get a buy-in from x86 
>>>>>>>>> maintainers, I
>>>>>>>>> think).
>>>>>>>> But wouldn't this break the idea to reuse the native boot paths in
>>>>>>>> grub/OVMF without further modifications?
>>>>>>> WDYM? We will have to have some sort of a plugin in either one to build
>>>>>>> the zeropage anyway. So we'd set hardware_subarch there, in addition to
>>>>>>> other things like setting memory and such.
>>>>>> But isn't the zeropage already being built? I admit that setting subarch
>>>>>> isn't a big deal, but using another entry with a passed-through pvh
>>>>>> start struct isn't either...
>>>>> I don't follow, sorry. My understanding is that zeropage will be built
>>>>> by PVH-enlightened grub so part of this process would be setting the
>>>>> subarch bit.
>>>> My reasoning was based on Roger's remark:
>>>>
>>>> "OTOH if Linux is capable of booting from the native entry point inside
>>>> of a PVH container, we would only have to port OVMF and grub in order
>>>> to work inside of a PVH container, leaving the rest of the logic
>>>> untouched."
>>> Right, and in my mind porting OVMF/grub includes creating proper zeropage.
>> Aah, okay. I reasoned on the assumption to just enable OVMF/grub to run
>> in PVH environment without touching the parts setting up anything for
>> the new kernel.
> 
> Someone needs to do what xen_prepare_pvh() does.

As the loader is filling in the memory map information the only thing
remaining would be setting xen_pvh. And this could be delayed as my test
have shown, so we only need to detect the PVH case from inside the
kernel. One possibility would be the flags in the ACPI FADT table as
Roger mentioned, another idea would be a flag in zeropage set by the
loader.

> And, for 64-bit, we also may need to build early pagetables since
> startup_64() (unlike startup_32()) expects paging to be on. (I don't
> know whether this is already part of standard FW codepath)

This would be done the same way as for a native kernel.

>>> BTW, another option might be to "type_of_loader = (9 << 4) | 0", which
>>> is what init_pvh_bootparams() does. In fact, whatever is done in the
>>> firmware should probably match what that routine does.
>> So it wouldn't be possible any longer to tell whether the kernel has
>> been booted directly or via grub. I don't like this. The loader type
>> is accessible via sysfs after all.
> 
> I didn't know that. What is the path?

/proc/sys/kernel/bootloader_type


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/pvcalls: fix potential endless loop in pvcalls-front.c

2017-11-06 Thread Juergen Gross
On 06/11/17 23:17, Stefano Stabellini wrote:
> mutex_trylock() returns 1 if you take the lock and 0 if not. Assume you
> take in_mutex on the first try, but you can't take out_mutex. Next times
> you call mutex_trylock() in_mutex is going to fail. It's an endless
> loop.
> 
> Solve the problem by moving the two mutex_trylock calls to two separate
> loops.
> 
> Reported-by: Dan Carpenter 
> Signed-off-by: Stefano Stabellini 
> CC: boris.ostrov...@oracle.com
> CC: jgr...@suse.com
> ---
>  drivers/xen/pvcalls-front.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
> index 0c1ec68..047dce7 100644
> --- a/drivers/xen/pvcalls-front.c
> +++ b/drivers/xen/pvcalls-front.c
> @@ -1048,8 +1048,9 @@ int pvcalls_front_release(struct socket *sock)
>* is set to NULL -- we only need to wait for the existing
>* waiters to return.
>*/
> - while (!mutex_trylock(>active.in_mutex) ||
> -!mutex_trylock(>active.out_mutex))
> + while (!mutex_trylock(>active.in_mutex))
> + cpu_relax();
> + while (!mutex_trylock(>active.out_mutex))
>   cpu_relax();

Any reason you don't just use mutex_lock()?


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen PVH support in grub2

2017-11-06 Thread Juergen Gross
On 06/11/17 15:51, Boris Ostrovsky wrote:
> On 11/06/2017 02:16 AM, Juergen Gross wrote:
>> On 03/11/17 20:00, Boris Ostrovsky wrote:
>>> On 11/03/2017 02:40 PM, Juergen Gross wrote:
>>>> On 03/11/17 19:35, Boris Ostrovsky wrote:
>>>>> On 11/03/2017 02:23 PM, Juergen Gross wrote:
>>>>>> On 03/11/17 19:19, Boris Ostrovsky wrote:
>>>>>>> On 11/03/2017 02:05 PM, Juergen Gross wrote:
>>>>>>>> So again the question: how to tell whether we are PVH or HVM in
>>>>>>>> init_hypervisor_platform()? ACPi tables are scanned way later...
>>>>>>> Can we make grub/OVMF append a boot option?
>>>>>>>
>>>>>>> Or set setup_header.hardware_subarch to something? We already have
>>>>>>> X86_SUBARCH_XEN but it is only used by PV.  Or we might be able to use
>>>>>>> hardware_subarch_data (will need to get a buy-in from x86 maintainers, I
>>>>>>> think).
>>>>>> But wouldn't this break the idea to reuse the native boot paths in
>>>>>> grub/OVMF without further modifications?
>>>>> WDYM? We will have to have some sort of a plugin in either one to build
>>>>> the zeropage anyway. So we'd set hardware_subarch there, in addition to
>>>>> other things like setting memory and such.
>>>> But isn't the zeropage already being built? I admit that setting subarch
>>>> isn't a big deal, but using another entry with a passed-through pvh
>>>> start struct isn't either...
>>> I don't follow, sorry. My understanding is that zeropage will be built
>>> by PVH-enlightened grub so part of this process would be setting the
>>> subarch bit.
>> My reasoning was based on Roger's remark:
>>
>> "OTOH if Linux is capable of booting from the native entry point inside
>> of a PVH container, we would only have to port OVMF and grub in order
>> to work inside of a PVH container, leaving the rest of the logic
>> untouched."
> 
> Right, and in my mind porting OVMF/grub includes creating proper zeropage.

Aah, okay. I reasoned on the assumption to just enable OVMF/grub to run
in PVH environment without touching the parts setting up anything for
the new kernel.

> BTW, another option might be to "type_of_loader = (9 << 4) | 0", which
> is what init_pvh_bootparams() does. In fact, whatever is done in the
> firmware should probably match what that routine does.

So it wouldn't be possible any longer to tell whether the kernel has
been booted directly or via grub. I don't like this. The loader type
is accessible via sysfs after all.


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen PVH support in grub2

2017-11-06 Thread Juergen Gross
On 03/11/17 13:17, Roger Pau Monné wrote:
> On Fri, Nov 03, 2017 at 01:00:46PM +0100, Juergen Gross wrote:
>> On 29/09/17 17:51, Roger Pau Monné wrote:
>>> On Fri, Sep 29, 2017 at 03:33:58PM +0000, Juergen Gross wrote:
>>>> On 29/09/17 17:24, Roger Pau Monné wrote:
>>>>> On Fri, Sep 29, 2017 at 02:46:53PM +, Juergen Gross wrote:
>>>>> Then, I also wonder whether it would make sense for this grub to load
>>>>> the kernel using the PVH entry point or the native entry point. Would
>>>>> it be possible to boot a Linux kernel up to the point where cpuid can
>>>>> be used inside of a PVH container?
>>>>
>>>> I don't think today's Linux allows that. This has been discussed
>>>> very thoroughly at the time Boris added PVH V2 support to the kernel.
>>>
>>> OK, I'm not going to insist on that, but my plans for FreeBSD is to
>>> make the native entry point capable of booting inside of a PVH
>>> container up to the point where cpuid (or whatever method) can be used
>>> to detect the environment.
>>
>> Looking more thoroughly into the Linux boot code I think this could
>> work for Linux, too. But only if we can tell PVH from HVM in the guest.
>> How would you do that in FreeBSD? Via flags in the boot params? This
>> would the have to be done in the boot loader (e.g. grub or OVMF).
> 
> My plan was not to differentiate between HVM and PVH, but rather to
> make use of the ACPI information in order to decide which devices are
> available and which are not inside of a PVH guest.
> 
> For example in the FADT "IA-PC Boot Architecture Flags" field for PVH
> we already set "VGA Not Present" and "CMOS RTC Not Present". There
> might be other flags/fields that must be set, but I would like to
> avoid having a CPUID bit or similar saying "PVH", because then Xen
> will be tied to always providing the same set of devices in PVH
> containers.

I looked through the xen_pvh_domain() use cases in the Linux kernel
again.

Maybe we can really manage to not need differentiating PVH from HVM
until ACPI table scan. We'd need another hook for Xen, but this should
be easy as KVM already has a hook where we'd need one. So this can be
made more general and we are fine.

I even think we can drop some of the PVH tests, as the PVH-specific
handling (e.g. for grant table initialization) should work for HVM, too.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Confused about mapped pages "struct page" updates

2017-11-06 Thread Juergen Gross
On 06/11/17 10:57, Waseem, Amna wrote:
> Hello All,
> 
> I am a little confused about mapping mechanism in Xen for page from DomU to 
> Dom0.
> 
> When Dom0 maps DomU page to its applied host_addr, Page table entries are 
> created by Xen hypervisor for mapping  applied host_addr vritual  address of 
> Dom0 to DomU physical page. The result is host_addr maps to DomU phsyical 
> page.
> 
> Now in network backend driver, virt_to_page macro is called on this mapped 
> host_addr. How does Dom0 gets struct page for the mapped DomU page in its 
> domain? Is Xen also updates mem_map array of Dom0 to create struct page for 
> the mapped page? Or Dom0 creates struct page for all the physical memory 
> including provided to DomU during its creation ?
> 
> Can anybody tell me how struct page for mapped pages from another domain gets 
> updated or created in DOm0?

Dom0 requests the mapping for a specific Dom0 physical address
(normally this is a page from the balloon driver, but in case no
ballooned page is available a kernel page is being allocated for
that purpose). So there always is a struct page available in Dom0.

host_addr above is part of Dom0 physical addresses. And the hypervisor
either modifies the Dom0 page table entry (in case of a PV Dom0 on
X86) or it just modifies the p2m mapping of the Dom0 physical address
(in case of ARM).


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen PVH support in grub2

2017-11-05 Thread Juergen Gross
On 03/11/17 20:00, Boris Ostrovsky wrote:
> On 11/03/2017 02:40 PM, Juergen Gross wrote:
>> On 03/11/17 19:35, Boris Ostrovsky wrote:
>>> On 11/03/2017 02:23 PM, Juergen Gross wrote:
>>>> On 03/11/17 19:19, Boris Ostrovsky wrote:
>>>>> On 11/03/2017 02:05 PM, Juergen Gross wrote:
>>>>>> So again the question: how to tell whether we are PVH or HVM in
>>>>>> init_hypervisor_platform()? ACPi tables are scanned way later...
>>>>> Can we make grub/OVMF append a boot option?
>>>>>
>>>>> Or set setup_header.hardware_subarch to something? We already have
>>>>> X86_SUBARCH_XEN but it is only used by PV.  Or we might be able to use
>>>>> hardware_subarch_data (will need to get a buy-in from x86 maintainers, I
>>>>> think).
>>>> But wouldn't this break the idea to reuse the native boot paths in
>>>> grub/OVMF without further modifications?
>>> WDYM? We will have to have some sort of a plugin in either one to build
>>> the zeropage anyway. So we'd set hardware_subarch there, in addition to
>>> other things like setting memory and such.
>> But isn't the zeropage already being built? I admit that setting subarch
>> isn't a big deal, but using another entry with a passed-through pvh
>> start struct isn't either...
> 
> I don't follow, sorry. My understanding is that zeropage will be built
> by PVH-enlightened grub so part of this process would be setting the
> subarch bit.

My reasoning was based on Roger's remark:

"OTOH if Linux is capable of booting from the native entry point inside
of a PVH container, we would only have to port OVMF and grub in order
to work inside of a PVH container, leaving the rest of the logic
untouched."


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen PVH support in grub2

2017-11-03 Thread Juergen Gross
On 03/11/17 19:37, Roger Pau Monné wrote:
> On Fri, Nov 03, 2017 at 07:23:50PM +0100, Juergen Gross wrote:
>> On 03/11/17 19:19, Boris Ostrovsky wrote:
>>> On 11/03/2017 02:05 PM, Juergen Gross wrote:
>>>>
>>>> So again the question: how to tell whether we are PVH or HVM in
>>>> init_hypervisor_platform()? ACPi tables are scanned way later...
>>>
>>> Can we make grub/OVMF append a boot option?
>>>
>>> Or set setup_header.hardware_subarch to something? We already have
>>> X86_SUBARCH_XEN but it is only used by PV.  Or we might be able to use
>>> hardware_subarch_data (will need to get a buy-in from x86 maintainers, I
>>> think).
>>
>> But wouldn't this break the idea to reuse the native boot paths in
>> grub/OVMF without further modifications?
>>
>> I'd rather have a way to ask the hypervisor whether we are in PVH mode
>> (e.g. via CPUID or a hypercall to test for a devicemodel having itself
>> registered).
> 
> Keep in mind that from Xen's PoV PVH is just a HVM guest. Xen
> currently keeps a bitmap of the emulated devices that are available to
> a guest, but that's so far an internal field. We could consider
> exporting this on a cpuid leaf, but then we need to make it a fixed
> ABI.
> 
> Maybe we can make a list of platform devices that are not available on
> PVH and that Linux assumes to be present?

The main problem in Linux is that we have to know whether we are a HVM
or a PVH guest. Okay, we could scan for the Xen PCI-device to make the
distinction, but this is again rather late in the boot process.

Using implicit assumptions is normally a bad way to make such a
decision. I'd rather be either told by the boot loader I'm running as
PVH guest (and the boot loader does know it), or I want to ask the
hypervisor (which right now doesn't know), or I need another clear
distinction like the existence of the Xen PCI-device (which might be
a problem in the future for a PVH dom0 in a nested Xen environment).


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen PVH support in grub2

2017-11-03 Thread Juergen Gross
On 03/11/17 19:35, Boris Ostrovsky wrote:
> On 11/03/2017 02:23 PM, Juergen Gross wrote:
>> On 03/11/17 19:19, Boris Ostrovsky wrote:
>>> On 11/03/2017 02:05 PM, Juergen Gross wrote:
>>>> So again the question: how to tell whether we are PVH or HVM in
>>>> init_hypervisor_platform()? ACPi tables are scanned way later...
>>> Can we make grub/OVMF append a boot option?
>>>
>>> Or set setup_header.hardware_subarch to something? We already have
>>> X86_SUBARCH_XEN but it is only used by PV.  Or we might be able to use
>>> hardware_subarch_data (will need to get a buy-in from x86 maintainers, I
>>> think).
>> But wouldn't this break the idea to reuse the native boot paths in
>> grub/OVMF without further modifications?
> 
> WDYM? We will have to have some sort of a plugin in either one to build
> the zeropage anyway. So we'd set hardware_subarch there, in addition to
> other things like setting memory and such.

But isn't the zeropage already being built? I admit that setting subarch
isn't a big deal, but using another entry with a passed-through pvh
start struct isn't either...


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Commit moratorium to staging

2017-11-03 Thread Juergen Gross
On 03/11/17 19:29, Roger Pau Monné wrote:
> On Fri, Nov 03, 2017 at 05:57:52PM +, George Dunlap wrote:
>> On 11/03/2017 02:52 PM, George Dunlap wrote:
>>> On 11/03/2017 02:14 PM, Roger Pau Monné wrote:
 On Thu, Nov 02, 2017 at 09:55:11AM +, Paul Durrant wrote:
> Hmm. I wonder whether the guest is actually healthy after the migrate. 
> One could imagine a situation where the storage device model (IDE in our 
> case I guess) gets stuck in some way but recovers after a timeout in the 
> guest storage stack. Thus, if you happen to try shut down while it is 
> still stuck Windows starts trying to shut down but can't. Try after the 
> timeout though and it can.
> In the past we did make attempts to support Windows without PV drivers in 
> XenServer but xenrt would never reliably pass VM lifecycle tests using 
> emulated devices. That was with qemu trad, but I wonder whether upstream 
> qemu is actually any better particularly if using older device models 
> such as IDE and RTL8139 (which are probably largely unmodified from trad).

 Since I've been looking into this for a couple of days, and found no
 solution I'm going to write what I've found so far:

  - The issue only affects Windows guests.
  - It only manifests itself when doing live migration, non-live
migration or save/resume work fine.
  - It affects all x86 hardware, the amount of migrations in order to
trigger it seems to depend on the hardware, but doing 20 migrations
reliably triggers it on all the hardware I've tested.
>>>
>>> Not good.
>>>
>>> You said that Windows reported that the login process failed somehow?
>>>
>>> Is it possible something bad is happening, like sending spurious page
>>> faults to the guest in logdirty mode?
>>>
>>> I wonder if we could reproduce something like it on Linux -- set a build
>>> going and start localhost migrating; a spurious page fault is likely to
>>> cause the build to fail.
>>
>> Well, with a looping xen-build going on in the guest, I've done 40 local
>> migrates with no problems yet.
>>
>> But Roger -- is this on emulated devices only, no PV drivers?
>>
>> That might be something worth looking at.
> 
> Yes, windows doesn't have PV devices. But save/restore and non-live
> migration seems fine, so it doesn't look to be related to devices, but
> rather to log-dirty or some other aspect of live-migration.

log-dirty for read-I/Os of emulated devices?


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen PVH support in grub2

2017-11-03 Thread Juergen Gross
On 03/11/17 19:19, Boris Ostrovsky wrote:
> On 11/03/2017 02:05 PM, Juergen Gross wrote:
>>
>> So again the question: how to tell whether we are PVH or HVM in
>> init_hypervisor_platform()? ACPi tables are scanned way later...
> 
> Can we make grub/OVMF append a boot option?
> 
> Or set setup_header.hardware_subarch to something? We already have
> X86_SUBARCH_XEN but it is only used by PV.  Or we might be able to use
> hardware_subarch_data (will need to get a buy-in from x86 maintainers, I
> think).

But wouldn't this break the idea to reuse the native boot paths in
grub/OVMF without further modifications?

I'd rather have a way to ask the hypervisor whether we are in PVH mode
(e.g. via CPUID or a hypercall to test for a devicemodel having itself
registered).

In case we need to set an indicator in zeropage I'd opt for reusing
X86_SUBARCH_XEN.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen PVH support in grub2

2017-11-03 Thread Juergen Gross
On 03/11/17 16:27, Juergen Gross wrote:
> On 03/11/17 16:10, Boris Ostrovsky wrote:
>> On 11/03/2017 10:59 AM, Juergen Gross wrote:
>>> On 03/11/17 15:36, Boris Ostrovsky wrote:
>>>> On 11/03/2017 10:24 AM, Juergen Gross wrote:
>>>>> On 03/11/17 15:07, Roger Pau Monné wrote:
>>>>>> On Fri, Nov 03, 2017 at 01:50:11PM +0100, Juergen Gross wrote:
>>>>>>> On 03/11/17 13:17, Roger Pau Monné wrote:
>>>>>>>> On Fri, Nov 03, 2017 at 01:00:46PM +0100, Juergen Gross wrote:
>>>>>>>>> On 29/09/17 17:51, Roger Pau Monné wrote:
>>>>>>>>>> On Fri, Sep 29, 2017 at 03:33:58PM +, Juergen Gross wrote:
>>>>>>>>>>> On 29/09/17 17:24, Roger Pau Monné wrote:
>>>>>>>>>>>> On Fri, Sep 29, 2017 at 02:46:53PM +, Juergen Gross wrote:
>>>>>>>>>>>> Then, I also wonder whether it would make sense for this grub to 
>>>>>>>>>>>> load
>>>>>>>>>>>> the kernel using the PVH entry point or the native entry point. 
>>>>>>>>>>>> Would
>>>>>>>>>>>> it be possible to boot a Linux kernel up to the point where cpuid 
>>>>>>>>>>>> can
>>>>>>>>>>>> be used inside of a PVH container?
>>>>>>>>>>> I don't think today's Linux allows that. This has been discussed
>>>>>>>>>>> very thoroughly at the time Boris added PVH V2 support to the 
>>>>>>>>>>> kernel.
>>>>>>>>>> OK, I'm not going to insist on that, but my plans for FreeBSD is to
>>>>>>>>>> make the native entry point capable of booting inside of a PVH
>>>>>>>>>> container up to the point where cpuid (or whatever method) can be 
>>>>>>>>>> used
>>>>>>>>>> to detect the environment.
>>>>>>>>> Looking more thoroughly into the Linux boot code I think this could
>>>>>>>>> work for Linux, too. But only if we can tell PVH from HVM in the 
>>>>>>>>> guest.
>>>>>>>>> How would you do that in FreeBSD? Via flags in the boot params? This
>>>>>>>>> would the have to be done in the boot loader (e.g. grub or OVMF).
>>>>>>>> My plan was not to differentiate between HVM and PVH, but rather to
>>>>>>>> make use of the ACPI information in order to decide which devices are
>>>>>>>> available and which are not inside of a PVH guest.
>>>>>>>>
>>>>>>>> For example in the FADT "IA-PC Boot Architecture Flags" field for PVH
>>>>>>>> we already set "VGA Not Present" and "CMOS RTC Not Present". There
>>>>>>>> might be other flags/fields that must be set, but I would like to
>>>>>>>> avoid having a CPUID bit or similar saying "PVH", because then Xen
>>>>>>>> will be tied to always providing the same set of devices in PVH
>>>>>>>> containers.
>>>>>>> Why? This would depend on the semantics tied to the flag. It could just
>>>>>>> mean "don't assume availability of legacy stuff" (e.g. BIOS calls).
>>>>>>>
>>>>>>> Linux would have a problem with the ACPI approach as it would try BIOS
>>>>>>> calls way before it is initializing its ACPI handling. So in Linux I'd
>>>>>>> need another way to tell I'm running in PVH mode, e.g. a "no legacy"
>>>>>>> bit in the Xen HVM cpuid leaf.
>>>>>> If you are booted from the PVH entry point, there's no BIOS or UEFI
>>>>>> (ie: no firmware), if you are booted from the BIOS entry point there's
>>>>>> a BIOS and the same applies to UEFI. How does Linux differentiate
>>>>>> whether it's booted from BIOS or UEFI?
>>>>> They use different entries.
>>>> In fact, we had a discussion with Matt Fleming (Linux EFI maintainer) to
>>>> see if we can use EFI entry point to also be able to boot PVH guest but
>>>> found some issues with that approach, which is why we ended up with a
>>>> dedicated PVH entry point.
>>>>
>>>> I am curious though, Juergen --- what do we need besides zeropage to
>>>> allow us to boot PVH from startup_64?
>>> Oh, you are right. I managed to get lost in the early boot paths.
>>>
>>> Only setting up the hyperpage seems to be missing, but this should be
>>> doable. And setting xen_pvh, of course.
>>
>> That last part was actually my question --- do we need to have xen_pvh
>> set before we get to xen-specific code for the first time (which I think
>> is init_hypervisor_platform()) from startup_64?
>>
>> Because if we do --- who will set it?
> 
> This should be easily testable: Just set an internal indicator in
> xen_prepare_pvh() and copy that to xen_pvh in init_hypervisor_platform()
> and see whether the kernel still comes up in PVH mode.
> 
> The same can be done with setting up the hypercall page.
> 
> Just building a kernel trying that...

Okay, setting xen_pvh late is working. I can't omit setting up the
hypercall_page right now, of course, as then I wouldn't get the memory
map. But this shouldn't be a problem for the grub/OVMF case as those
would setup the memory map for the guest in zeropage.

So again the question: how to tell whether we are PVH or HVM in
init_hypervisor_platform()? ACPi tables are scanned way later...


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen PVH support in grub2

2017-11-03 Thread Juergen Gross
On 03/11/17 16:10, Boris Ostrovsky wrote:
> On 11/03/2017 10:59 AM, Juergen Gross wrote:
>> On 03/11/17 15:36, Boris Ostrovsky wrote:
>>> On 11/03/2017 10:24 AM, Juergen Gross wrote:
>>>> On 03/11/17 15:07, Roger Pau Monné wrote:
>>>>> On Fri, Nov 03, 2017 at 01:50:11PM +0100, Juergen Gross wrote:
>>>>>> On 03/11/17 13:17, Roger Pau Monné wrote:
>>>>>>> On Fri, Nov 03, 2017 at 01:00:46PM +0100, Juergen Gross wrote:
>>>>>>>> On 29/09/17 17:51, Roger Pau Monné wrote:
>>>>>>>>> On Fri, Sep 29, 2017 at 03:33:58PM +, Juergen Gross wrote:
>>>>>>>>>> On 29/09/17 17:24, Roger Pau Monné wrote:
>>>>>>>>>>> On Fri, Sep 29, 2017 at 02:46:53PM +, Juergen Gross wrote:
>>>>>>>>>>> Then, I also wonder whether it would make sense for this grub to 
>>>>>>>>>>> load
>>>>>>>>>>> the kernel using the PVH entry point or the native entry point. 
>>>>>>>>>>> Would
>>>>>>>>>>> it be possible to boot a Linux kernel up to the point where cpuid 
>>>>>>>>>>> can
>>>>>>>>>>> be used inside of a PVH container?
>>>>>>>>>> I don't think today's Linux allows that. This has been discussed
>>>>>>>>>> very thoroughly at the time Boris added PVH V2 support to the kernel.
>>>>>>>>> OK, I'm not going to insist on that, but my plans for FreeBSD is to
>>>>>>>>> make the native entry point capable of booting inside of a PVH
>>>>>>>>> container up to the point where cpuid (or whatever method) can be used
>>>>>>>>> to detect the environment.
>>>>>>>> Looking more thoroughly into the Linux boot code I think this could
>>>>>>>> work for Linux, too. But only if we can tell PVH from HVM in the guest.
>>>>>>>> How would you do that in FreeBSD? Via flags in the boot params? This
>>>>>>>> would the have to be done in the boot loader (e.g. grub or OVMF).
>>>>>>> My plan was not to differentiate between HVM and PVH, but rather to
>>>>>>> make use of the ACPI information in order to decide which devices are
>>>>>>> available and which are not inside of a PVH guest.
>>>>>>>
>>>>>>> For example in the FADT "IA-PC Boot Architecture Flags" field for PVH
>>>>>>> we already set "VGA Not Present" and "CMOS RTC Not Present". There
>>>>>>> might be other flags/fields that must be set, but I would like to
>>>>>>> avoid having a CPUID bit or similar saying "PVH", because then Xen
>>>>>>> will be tied to always providing the same set of devices in PVH
>>>>>>> containers.
>>>>>> Why? This would depend on the semantics tied to the flag. It could just
>>>>>> mean "don't assume availability of legacy stuff" (e.g. BIOS calls).
>>>>>>
>>>>>> Linux would have a problem with the ACPI approach as it would try BIOS
>>>>>> calls way before it is initializing its ACPI handling. So in Linux I'd
>>>>>> need another way to tell I'm running in PVH mode, e.g. a "no legacy"
>>>>>> bit in the Xen HVM cpuid leaf.
>>>>> If you are booted from the PVH entry point, there's no BIOS or UEFI
>>>>> (ie: no firmware), if you are booted from the BIOS entry point there's
>>>>> a BIOS and the same applies to UEFI. How does Linux differentiate
>>>>> whether it's booted from BIOS or UEFI?
>>>> They use different entries.
>>> In fact, we had a discussion with Matt Fleming (Linux EFI maintainer) to
>>> see if we can use EFI entry point to also be able to boot PVH guest but
>>> found some issues with that approach, which is why we ended up with a
>>> dedicated PVH entry point.
>>>
>>> I am curious though, Juergen --- what do we need besides zeropage to
>>> allow us to boot PVH from startup_64?
>> Oh, you are right. I managed to get lost in the early boot paths.
>>
>> Only setting up the hyperpage seems to be missing, but this should be
>> doable. And setting xen_pvh, of course.
> 
> That last part was actually my question --- do we need to have xen_pvh
> set before we get to xen-specific code for the first time (which I think
> is init_hypervisor_platform()) from startup_64?
> 
> Because if we do --- who will set it?

This should be easily testable: Just set an internal indicator in
xen_prepare_pvh() and copy that to xen_pvh in init_hypervisor_platform()
and see whether the kernel still comes up in PVH mode.

The same can be done with setting up the hypercall page.

Just building a kernel trying that...


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen PVH support in grub2

2017-11-03 Thread Juergen Gross
On 03/11/17 15:36, Boris Ostrovsky wrote:
> On 11/03/2017 10:24 AM, Juergen Gross wrote:
>> On 03/11/17 15:07, Roger Pau Monné wrote:
>>> On Fri, Nov 03, 2017 at 01:50:11PM +0100, Juergen Gross wrote:
>>>> On 03/11/17 13:17, Roger Pau Monné wrote:
>>>>> On Fri, Nov 03, 2017 at 01:00:46PM +0100, Juergen Gross wrote:
>>>>>> On 29/09/17 17:51, Roger Pau Monné wrote:
>>>>>>> On Fri, Sep 29, 2017 at 03:33:58PM +, Juergen Gross wrote:
>>>>>>>> On 29/09/17 17:24, Roger Pau Monné wrote:
>>>>>>>>> On Fri, Sep 29, 2017 at 02:46:53PM +, Juergen Gross wrote:
>>>>>>>>> Then, I also wonder whether it would make sense for this grub to load
>>>>>>>>> the kernel using the PVH entry point or the native entry point. Would
>>>>>>>>> it be possible to boot a Linux kernel up to the point where cpuid can
>>>>>>>>> be used inside of a PVH container?
>>>>>>>> I don't think today's Linux allows that. This has been discussed
>>>>>>>> very thoroughly at the time Boris added PVH V2 support to the kernel.
>>>>>>> OK, I'm not going to insist on that, but my plans for FreeBSD is to
>>>>>>> make the native entry point capable of booting inside of a PVH
>>>>>>> container up to the point where cpuid (or whatever method) can be used
>>>>>>> to detect the environment.
>>>>>> Looking more thoroughly into the Linux boot code I think this could
>>>>>> work for Linux, too. But only if we can tell PVH from HVM in the guest.
>>>>>> How would you do that in FreeBSD? Via flags in the boot params? This
>>>>>> would the have to be done in the boot loader (e.g. grub or OVMF).
>>>>> My plan was not to differentiate between HVM and PVH, but rather to
>>>>> make use of the ACPI information in order to decide which devices are
>>>>> available and which are not inside of a PVH guest.
>>>>>
>>>>> For example in the FADT "IA-PC Boot Architecture Flags" field for PVH
>>>>> we already set "VGA Not Present" and "CMOS RTC Not Present". There
>>>>> might be other flags/fields that must be set, but I would like to
>>>>> avoid having a CPUID bit or similar saying "PVH", because then Xen
>>>>> will be tied to always providing the same set of devices in PVH
>>>>> containers.
>>>> Why? This would depend on the semantics tied to the flag. It could just
>>>> mean "don't assume availability of legacy stuff" (e.g. BIOS calls).
>>>>
>>>> Linux would have a problem with the ACPI approach as it would try BIOS
>>>> calls way before it is initializing its ACPI handling. So in Linux I'd
>>>> need another way to tell I'm running in PVH mode, e.g. a "no legacy"
>>>> bit in the Xen HVM cpuid leaf.
>>> If you are booted from the PVH entry point, there's no BIOS or UEFI
>>> (ie: no firmware), if you are booted from the BIOS entry point there's
>>> a BIOS and the same applies to UEFI. How does Linux differentiate
>>> whether it's booted from BIOS or UEFI?
>> They use different entries.
> 
> In fact, we had a discussion with Matt Fleming (Linux EFI maintainer) to
> see if we can use EFI entry point to also be able to boot PVH guest but
> found some issues with that approach, which is why we ended up with a
> dedicated PVH entry point.
> 
> I am curious though, Juergen --- what do we need besides zeropage to
> allow us to boot PVH from startup_64?

Oh, you are right. I managed to get lost in the early boot paths.

Only setting up the hyperpage seems to be missing, but this should be
doable. And setting xen_pvh, of course.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen PVH support in grub2

2017-11-03 Thread Juergen Gross
On 03/11/17 15:07, Roger Pau Monné wrote:
> On Fri, Nov 03, 2017 at 01:50:11PM +0100, Juergen Gross wrote:
>> On 03/11/17 13:17, Roger Pau Monné wrote:
>>> On Fri, Nov 03, 2017 at 01:00:46PM +0100, Juergen Gross wrote:
>>>> On 29/09/17 17:51, Roger Pau Monné wrote:
>>>>> On Fri, Sep 29, 2017 at 03:33:58PM +, Juergen Gross wrote:
>>>>>> On 29/09/17 17:24, Roger Pau Monné wrote:
>>>>>>> On Fri, Sep 29, 2017 at 02:46:53PM +, Juergen Gross wrote:
>>>>>>> Then, I also wonder whether it would make sense for this grub to load
>>>>>>> the kernel using the PVH entry point or the native entry point. Would
>>>>>>> it be possible to boot a Linux kernel up to the point where cpuid can
>>>>>>> be used inside of a PVH container?
>>>>>>
>>>>>> I don't think today's Linux allows that. This has been discussed
>>>>>> very thoroughly at the time Boris added PVH V2 support to the kernel.
>>>>>
>>>>> OK, I'm not going to insist on that, but my plans for FreeBSD is to
>>>>> make the native entry point capable of booting inside of a PVH
>>>>> container up to the point where cpuid (or whatever method) can be used
>>>>> to detect the environment.
>>>>
>>>> Looking more thoroughly into the Linux boot code I think this could
>>>> work for Linux, too. But only if we can tell PVH from HVM in the guest.
>>>> How would you do that in FreeBSD? Via flags in the boot params? This
>>>> would the have to be done in the boot loader (e.g. grub or OVMF).
>>>
>>> My plan was not to differentiate between HVM and PVH, but rather to
>>> make use of the ACPI information in order to decide which devices are
>>> available and which are not inside of a PVH guest.
>>>
>>> For example in the FADT "IA-PC Boot Architecture Flags" field for PVH
>>> we already set "VGA Not Present" and "CMOS RTC Not Present". There
>>> might be other flags/fields that must be set, but I would like to
>>> avoid having a CPUID bit or similar saying "PVH", because then Xen
>>> will be tied to always providing the same set of devices in PVH
>>> containers.
>>
>> Why? This would depend on the semantics tied to the flag. It could just
>> mean "don't assume availability of legacy stuff" (e.g. BIOS calls).
>>
>> Linux would have a problem with the ACPI approach as it would try BIOS
>> calls way before it is initializing its ACPI handling. So in Linux I'd
>> need another way to tell I'm running in PVH mode, e.g. a "no legacy"
>> bit in the Xen HVM cpuid leaf.
> 
> If you are booted from the PVH entry point, there's no BIOS or UEFI
> (ie: no firmware), if you are booted from the BIOS entry point there's
> a BIOS and the same applies to UEFI. How does Linux differentiate
> whether it's booted from BIOS or UEFI?

They use different entries.


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen PVH support in grub2

2017-11-03 Thread Juergen Gross
On 03/11/17 13:17, Roger Pau Monné wrote:
> On Fri, Nov 03, 2017 at 01:00:46PM +0100, Juergen Gross wrote:
>> On 29/09/17 17:51, Roger Pau Monné wrote:
>>> On Fri, Sep 29, 2017 at 03:33:58PM +0000, Juergen Gross wrote:
>>>> On 29/09/17 17:24, Roger Pau Monné wrote:
>>>>> On Fri, Sep 29, 2017 at 02:46:53PM +, Juergen Gross wrote:
>>>>> Then, I also wonder whether it would make sense for this grub to load
>>>>> the kernel using the PVH entry point or the native entry point. Would
>>>>> it be possible to boot a Linux kernel up to the point where cpuid can
>>>>> be used inside of a PVH container?
>>>>
>>>> I don't think today's Linux allows that. This has been discussed
>>>> very thoroughly at the time Boris added PVH V2 support to the kernel.
>>>
>>> OK, I'm not going to insist on that, but my plans for FreeBSD is to
>>> make the native entry point capable of booting inside of a PVH
>>> container up to the point where cpuid (or whatever method) can be used
>>> to detect the environment.
>>
>> Looking more thoroughly into the Linux boot code I think this could
>> work for Linux, too. But only if we can tell PVH from HVM in the guest.
>> How would you do that in FreeBSD? Via flags in the boot params? This
>> would the have to be done in the boot loader (e.g. grub or OVMF).
> 
> My plan was not to differentiate between HVM and PVH, but rather to
> make use of the ACPI information in order to decide which devices are
> available and which are not inside of a PVH guest.
> 
> For example in the FADT "IA-PC Boot Architecture Flags" field for PVH
> we already set "VGA Not Present" and "CMOS RTC Not Present". There
> might be other flags/fields that must be set, but I would like to
> avoid having a CPUID bit or similar saying "PVH", because then Xen
> will be tied to always providing the same set of devices in PVH
> containers.

Why? This would depend on the semantics tied to the flag. It could just
mean "don't assume availability of legacy stuff" (e.g. BIOS calls).

Linux would have a problem with the ACPI approach as it would try BIOS
calls way before it is initializing its ACPI handling. So in Linux I'd
need another way to tell I'm running in PVH mode, e.g. a "no legacy"
bit in the Xen HVM cpuid leaf.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Xen PVH support in grub2

2017-11-03 Thread Juergen Gross
On 29/09/17 17:51, Roger Pau Monné wrote:
> On Fri, Sep 29, 2017 at 03:33:58PM +0000, Juergen Gross wrote:
>> On 29/09/17 17:24, Roger Pau Monné wrote:
>>> On Fri, Sep 29, 2017 at 02:46:53PM +0000, Juergen Gross wrote:
>>> Then, I also wonder whether it would make sense for this grub to load
>>> the kernel using the PVH entry point or the native entry point. Would
>>> it be possible to boot a Linux kernel up to the point where cpuid can
>>> be used inside of a PVH container?
>>
>> I don't think today's Linux allows that. This has been discussed
>> very thoroughly at the time Boris added PVH V2 support to the kernel.
> 
> OK, I'm not going to insist on that, but my plans for FreeBSD is to
> make the native entry point capable of booting inside of a PVH
> container up to the point where cpuid (or whatever method) can be used
> to detect the environment.

Looking more thoroughly into the Linux boot code I think this could
work for Linux, too. But only if we can tell PVH from HVM in the guest.
How would you do that in FreeBSD? Via flags in the boot params? This
would the have to be done in the boot loader (e.g. grub or OVMF).


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/pvcalls: remove redundant check for irq >= 0

2017-11-03 Thread Juergen Gross
On 03/11/17 10:20, Colin King wrote:
> From: Colin Ian King <colin.k...@canonical.com>
> 
> This is a moot point, but irq is always less than zero at the label
> out_error, so the check for irq >= 0 is redundant and can be removed.
> 
> Detected by CoverityScan, CID#1460371 ("Logically dead code")
> 
> Fixes: cb1c7d9bbc87 ("xen/pvcalls: implement connect command")
> Signed-off-by: Colin Ian King <colin.k...@canonical.com>

Reviewed-by: Juergen Gross <jgr...@suse.com>


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/pvcalls: fix unsigned less than zero error check

2017-11-03 Thread Juergen Gross
On 03/11/17 09:42, Colin King wrote:
> From: Colin Ian King <colin.k...@canonical.com>
> 
> The check on bedata->ref is never true because ref is an unsigned
> integer. Fix this by assigning signed int ret to the return of the
> call to gnttab_claim_grant_reference so the -ve return can be checked.
> 
> Detected by CoverityScan, CID#1460358 ("Unsigned compared against 0")
> 
> Fixes: 219681909913 ("xen/pvcalls: connect to the backend")
> Signed-off-by: Colin Ian King <colin.k...@canonical.com>

Reviewed-by: Juergen Gross <jgr...@suse.com>


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/pvcalls-front: mark expected switch fall-through

2017-11-03 Thread Juergen Gross
On 02/11/17 19:51, Gustavo A. R. Silva wrote:
> In preparation to enabling -Wimplicit-fallthrough, mark switch cases
> where we are expecting to fall through.
> 
> Notice that in this particular case I placed the "fall through" comment
> on its own line, which is what GCC is expecting to find.
> 
> Signed-off-by: Gustavo A. R. Silva <garsi...@embeddedor.com>

Reviewed-by: Juergen Gross <jgr...@suse.com>


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen: xenbus_probe_frontend: mark expected switch fall-throughs

2017-11-03 Thread Juergen Gross
On 02/11/17 19:41, Gustavo A. R. Silva wrote:
> In preparation to enabling -Wimplicit-fallthrough, mark switch cases
> where we are expecting to fall through.
> 
> Addresses-Coverity-ID: 146562
> Addresses-Coverity-ID: 146563
> Signed-off-by: Gustavo A. R. Silva <garsi...@embeddedor.com>

Reviewed-by: Juergen Gross <jgr...@suse.com>


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/time: Return -ENODEV from xen_get_wallclock()

2017-11-03 Thread Juergen Gross
On 02/11/17 23:18, Boris Ostrovsky wrote:
> For any other error sync_cmos_clock() will reschedule itself
> every second or so, for no good reason.
> 
> Suggested-by: Paolo Bonzini <pbonz...@redhat.com>
> Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>

Reviewed-by: Juergen Gross <jgr...@suse.com>


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] USB Passthrough: Incorrect device detected in Domain U

2017-11-02 Thread Juergen Gross
On 02/11/17 08:37, Rakesh Awanti wrote:
>>> Hello,
>>> 
>>> I'm trying USB passthrough on x86 and ARM platforms. I've taken the USB
>>> frontend and backend code from 
> 
>>So from where?
> 
> 
> https://lists.xen.org/archives/html/xen-devel/2015-02/msg03245.html

The backend driver was discarded as its funcionality has been put into
qemu. In case you want to continue using the kernel based variant you
are basically on your own.

The most recent frontend driver is available from

https://lists.xen.org/archives/html/xen-devel/2015-06/msg03436.html


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH-tip v2 2/2] x86/xen: Deprecate xen_nopvspin

2017-11-02 Thread Juergen Gross
On 02/11/17 14:25, Waiman Long wrote:
> On 11/01/2017 06:01 PM, Boris Ostrovsky wrote:
>> On 11/01/2017 04:58 PM, Waiman Long wrote:
>>> +/* TODO: To be removed in a future kernel version */
>>>  static __init int xen_parse_nopvspin(char *arg)
>>>  {
>>> -   xen_pvspin = false;
>>> +   pr_warn("xen_nopvspin is deprecated, replace it with 
>>> \"pvlock_type=queued\"!\n");
>>> +   if (!pv_spinlock_type)
>>> +   pv_spinlock_type = locktype_queued;
>> Since we currently end up using unfair locks and because you are
>> deprecating xen_nopvspin I wonder whether it would be better to set this
>> to locktype_unfair so that current behavior doesn't change. (Sorry, I
>> haven't responded to your earlier message before you posted this). Juergen?
> 
> I think the latest patch from Juergen in tip is to use native qspinlock
> when xen_nopvspin is specified. Right? That is why I made the current
> choice. I can certainly change to unfair if it is what you guys want.

No, when we are keeping xen_nopvspin (even as deprecated) it should
behave as designed, so locktype_queued is correct.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] d0v0 Unhandled general protection fault with 4.9.x on brand new hardware

2017-11-02 Thread Juergen Gross
On 02/11/17 08:56, Francesco De Francesco wrote:
> Hi everyone,
> 
> I need to address an issue that prevents me from running Xen on new
> hardware. It's an HPE DL360 Gen10 with double Xeon Silver 4108 CPU and
> 256GB ECC DDR4 RDIMM. It happens with both CentOS6 and CentOS7.
> 
> When I try to boot with Xen kernel 4.9.58-29.el6, I get the following
> error at boot time (I can read it from the serial console):
> 
> (XEN) Brought up 32 CPUs
> (XEN) ACPI sleep modes: S3
> (XEN) VPMU: disabled
> (XEN) mcheck_poll: Machine check polling timer started.
> (XEN) Dom0 has maximum 1240 PIRQs
> (XEN) NX (Execute Disable) protection active
> (XEN) *** LOADING DOMAIN 0 ***
> (XEN)  Xen  kernel: 64-bit, lsb, compat32
> (XEN)  Dom0 kernel: 64-bit, PAE, lsb, paddr 0x100 -> 0x26a
> (XEN) PHYSICAL MEMORY ARRANGEMENT:
> (XEN)  Dom0 alloc.:   003fdc00->003fe000 (1012299 pages
> to be allocated)
> (XEN)  Init. ramdisk: 00403b04b000->00403fdff200
> (XEN) VIRTUAL MEMORY ARRANGEMENT:
> (XEN)  Loaded kernel: 8100->826a
> (XEN)  Init. ramdisk: ->
> (XEN)  Phys-Mach map: 0080->00800080
> (XEN)  Start info:    826a->826a04b4
> (XEN)  Page tables:   826a1000->826b8000
> (XEN)  Boot stack:    826b8000->826b9000
> (XEN)  TOTAL:         8000->8280
> (XEN)  ENTRY ADDRESS: 821a9180
> (XEN) Dom0 has maximum 32 VCPUs
> (XEN) Scrubbing Free RAM on 2 nodes using 16 CPUs
> (XEN)
> .done.
> (XEN) Initial low memory virq threshold set at 0x4000 pages.
> (XEN) Std. Loglevel: All
> (XEN) Guest Loglevel: All
> (XEN) *** Serial input -> DOM0 (type 'CTRL-a' three times to switch
> input to Xen)
> (XEN) Freed 300kB init memory.
> mapping kernel into physical memory
> about to get started...
> (XEN) d0v0 Unhandled general protection fault fault/trap [#13, ec=]
> (XEN) domain_crash_sync called from entry.S: fault at 82d08022f983
> create_bounce_frame+0x12b/0x13a
> (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
> (XEN) [ Xen-4.6.6-3.el6  x86_64  debug=n  Not tainted ]
> (XEN) CPU:    0
> (XEN) RIP:    e033:[]
> (XEN) RFLAGS: 0246   EM: 1   CONTEXT: pv guest (d0v0)
> (XEN) rax: 02ff   rbx: 8217a1a0   rcx: 
> (XEN) rdx:    rsi: 02ff   rdi: 00042660
> (XEN) rbp: 82003dc8   rsp: 82003d80   r8:  82003e0c
> (XEN) r9:  82003e08   r10:    r11: 
> (XEN) r12: 82003e04   r13: 82003e00   r14: 82003dfc
> (XEN) r15: 82003df8   cr0: 80050033   cr4: 003526e0
> (XEN) cr3: 003fde007000   cr2: 
> (XEN) ds:    es:    fs:    gs:    ss: e02b   cs: e033
> (XEN) Guest stack trace from rsp=82003d80:
> (XEN)       8103cf38
> (XEN)    0001e030 00010046 82003dc8 e02b
> (XEN)    8103cf28 82003e48 821bbd3e 82199890
> (XEN)    8219a090 82199890 8219a090 82003e38
> (XEN)    00100800 0a88 02ff0240 8217a1a0
> (XEN)    8217a1a0 82673000 82003f20 
> (XEN)     82003e78 821bb684 0100
> (XEN)    037f82673000 82003f20 0100 82003e88
> (XEN)    821bc371 82003e98 821bc3a7 82003ef8
> (XEN)    821b73e7 0010 82003f08 82003ec8
> (XEN)    82003e88 8114b100  
> (XEN)       82003f28
> (XEN)    821aa0c6   b013b3f5b0133a3e
> (XEN)    821a97ac 82003f38 821a9386 82003ff8
> (XEN)    821b0dc6 03010032 0005 
> (XEN)       
> (XEN)       
> (XEN)       
> (XEN)       ffd83a831fc9cbf5
> (XEN)    0005065400100800 0001  
> (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.

So your dom0 crashed very early before trap handlers have been set up.

Can you please try to find matching source lines for suspected kernel
addresses in above stack trace and for the guest's RIP: for all
addresses starting with "81" you should try the addr2line tool
to obtain that information. This 

[Xen-devel] [PATCH v2 4/5] xen: update arch/x86/include/asm/xen/cpuid.h

2017-11-02 Thread Juergen Gross
Update arch/x86/include/asm/xen/cpuid.h from the Xen tree to get newest
definitions.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
 arch/x86/include/asm/xen/cpuid.h | 42 ++--
 1 file changed, 32 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/xen/cpuid.h b/arch/x86/include/asm/xen/cpuid.h
index 3bdd10d71223..a9630104f1c4 100644
--- a/arch/x86/include/asm/xen/cpuid.h
+++ b/arch/x86/include/asm/xen/cpuid.h
@@ -74,21 +74,43 @@
 #define XEN_CPUID_FEAT1_MMU_PT_UPDATE_PRESERVE_AD  (1u<<0)
 
 /*
+ * Leaf 4 (0x4x03)
+ * Sub-leaf 0: EAX: bit 0: emulated tsc
+ *  bit 1: host tsc is known to be reliable
+ *  bit 2: RDTSCP instruction available
+ * EBX: tsc_mode: 0=default (emulate if necessary), 1=emulate,
+ *2=no emulation, 3=no emulation + TSC_AUX support
+ * ECX: guest tsc frequency in kHz
+ * EDX: guest tsc incarnation (migration count)
+ * Sub-leaf 1: EAX: tsc offset low part
+ * EBX: tsc offset high part
+ * ECX: multiplicator for tsc->ns conversion
+ * EDX: shift amount for tsc->ns conversion
+ * Sub-leaf 2: EAX: host tsc frequency in kHz
+ */
+
+/*
  * Leaf 5 (0x4x04)
  * HVM-specific features
- * EAX: Features
- * EBX: vcpu id (iff EAX has XEN_HVM_CPUID_VCPU_ID_PRESENT flag)
+ * Sub-leaf 0: EAX: Features
+ * Sub-leaf 0: EBX: vcpu id (iff EAX has XEN_HVM_CPUID_VCPU_ID_PRESENT flag)
  */
-
-/* Virtualized APIC registers */
-#define XEN_HVM_CPUID_APIC_ACCESS_VIRT (1u << 0)
-/* Virtualized x2APIC accesses */
-#define XEN_HVM_CPUID_X2APIC_VIRT  (1u << 1)
+#define XEN_HVM_CPUID_APIC_ACCESS_VIRT (1u << 0) /* Virtualized APIC registers 
*/
+#define XEN_HVM_CPUID_X2APIC_VIRT  (1u << 1) /* Virtualized x2APIC 
accesses */
 /* Memory mapped from other domains has valid IOMMU entries */
 #define XEN_HVM_CPUID_IOMMU_MAPPINGS   (1u << 2)
-/* vcpu id is present in EBX */
-#define XEN_HVM_CPUID_VCPU_ID_PRESENT  (1u << 3)
+#define XEN_HVM_CPUID_VCPU_ID_PRESENT  (1u << 3) /* vcpu id is present in EBX 
*/
+
+/*
+ * Leaf 6 (0x4x05)
+ * PV-specific parameters
+ * Sub-leaf 0: EAX: max available sub-leaf
+ * Sub-leaf 0: EBX: bits 0-7: max machine address width
+ */
+
+/* Max. address width in bits taking memory hotplug into account. */
+#define XEN_CPUID_MACHINE_ADDRESS_WIDTH_MASK (0xffu << 0)
 
-#define XEN_CPUID_MAX_NUM_LEAVES 4
+#define XEN_CPUID_MAX_NUM_LEAVES 5
 
 #endif /* __XEN_PUBLIC_ARCH_X86_CPUID_H__ */
-- 
2.12.3


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2 2/5] xen: limit grant v2 interface to the v1 functionality

2017-11-02 Thread Juergen Gross
As there is currently no user for sub-page grants or transient grants
remove that functionality. This at once makes it possible to switch
from grant v2 to grant v1 without restrictions, as there is no loss of
functionality other than the limited frame number width related to
the switch.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
V2:
- remove update_trans_entry() from gnttab_ops (Boris Ostrovsky)
---
 drivers/xen/grant-table.c | 150 --
 include/xen/grant_table.h |  25 
 2 files changed, 175 deletions(-)

diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
index 65c4bdb0b463..db6a1b9efd11 100644
--- a/drivers/xen/grant-table.c
+++ b/drivers/xen/grant-table.c
@@ -122,29 +122,6 @@ struct gnttab_ops {
 * by bit operations.
 */
int (*query_foreign_access)(grant_ref_t ref);
-   /*
-* Grant a domain to access a range of bytes within the page referred by
-* an available grant entry. Ref parameter is reference of a grant entry
-* which will be sub-page accessed, domid is id of grantee domain, frame
-* is frame address of subpage grant, flags is grant type and flag
-* information, page_off is offset of the range of bytes, and length is
-* length of bytes to be accessed.
-*/
-   void (*update_subpage_entry)(grant_ref_t ref, domid_t domid,
-unsigned long frame, int flags,
-unsigned page_off, unsigned length);
-   /*
-* Redirect an available grant entry on domain A to another grant
-* reference of domain B, then allow domain C to use grant reference
-* of domain B transitively. Ref parameter is an available grant entry
-* reference on domain A, domid is id of domain C which accesses grant
-* entry transitively, flags is grant type and flag information,
-* trans_domid is id of domain B whose grant entry is finally accessed
-* transitively, trans_gref is grant entry transitive reference of
-* domain B.
-*/
-   void (*update_trans_entry)(grant_ref_t ref, domid_t domid, int flags,
-  domid_t trans_domid, grant_ref_t trans_gref);
 };
 
 struct unmap_refs_callback_data {
@@ -292,122 +269,6 @@ int gnttab_grant_foreign_access(domid_t domid, unsigned 
long frame,
 }
 EXPORT_SYMBOL_GPL(gnttab_grant_foreign_access);
 
-static void gnttab_update_subpage_entry_v2(grant_ref_t ref, domid_t domid,
-  unsigned long frame, int flags,
-  unsigned page_off, unsigned length)
-{
-   gnttab_shared.v2[ref].sub_page.frame = frame;
-   gnttab_shared.v2[ref].sub_page.page_off = page_off;
-   gnttab_shared.v2[ref].sub_page.length = length;
-   gnttab_shared.v2[ref].hdr.domid = domid;
-   wmb();
-   gnttab_shared.v2[ref].hdr.flags =
-   GTF_permit_access | GTF_sub_page | flags;
-}
-
-int gnttab_grant_foreign_access_subpage_ref(grant_ref_t ref, domid_t domid,
-   unsigned long frame, int flags,
-   unsigned page_off,
-   unsigned length)
-{
-   if (flags & (GTF_accept_transfer | GTF_reading |
-GTF_writing | GTF_transitive))
-   return -EPERM;
-
-   if (gnttab_interface->update_subpage_entry == NULL)
-   return -ENOSYS;
-
-   gnttab_interface->update_subpage_entry(ref, domid, frame, flags,
-  page_off, length);
-
-   return 0;
-}
-EXPORT_SYMBOL_GPL(gnttab_grant_foreign_access_subpage_ref);
-
-int gnttab_grant_foreign_access_subpage(domid_t domid, unsigned long frame,
-   int flags, unsigned page_off,
-   unsigned length)
-{
-   int ref, rc;
-
-   ref = get_free_entries(1);
-   if (unlikely(ref < 0))
-   return -ENOSPC;
-
-   rc = gnttab_grant_foreign_access_subpage_ref(ref, domid, frame, flags,
-page_off, length);
-   if (rc < 0) {
-   put_free_entry(ref);
-   return rc;
-   }
-
-   return ref;
-}
-EXPORT_SYMBOL_GPL(gnttab_grant_foreign_access_subpage);
-
-bool gnttab_subpage_grants_available(void)
-{
-   return gnttab_interface->update_subpage_entry != NULL;
-}
-EXPORT_SYMBOL_GPL(gnttab_subpage_grants_available);
-
-static void gnttab_update_trans_entry_v2(grant_ref_t ref, domid_t domid,
-int flags, domid_t trans_domid,
-grant_ref_t trans_gref)
-{
-   gnttab_shared.v2[ref].transitive.trans_domid = trans_domid;
-   gnttab_shared.v2[ref].tra

[Xen-devel] [PATCH v2 5/5] xen: select grant interface version

2017-11-02 Thread Juergen Gross
Grant v2 will be needed in cases where a frame number in the grant
table can exceed 32 bits. For PV guests this is a host feature, while
for HVM guests this is a guest feature.

So select grant v2 in case frame numbers can be larger than 32 bits
and grant v1 else.

For testing purposes add a way to specify the grant interface version
via a boot parameter.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
V2:
- use cpuid on pv and max_possible_pfn on hvm for version select
---
 drivers/xen/grant-table.c | 35 +--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
index 573e7fe9b147..fa152832e0e6 100644
--- a/drivers/xen/grant-table.c
+++ b/drivers/xen/grant-table.c
@@ -33,6 +33,7 @@
 
 #define pr_fmt(fmt) "xen:" KBUILD_MODNAME ": " fmt
 
+#include 
 #include 
 #include 
 #include 
@@ -43,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -52,6 +54,9 @@
 #include 
 #include 
 #include 
+#ifdef CONFIG_X86
+#include 
+#endif
 #include 
 #include 
 
@@ -68,6 +73,8 @@ static int gnttab_free_count;
 static grant_ref_t gnttab_free_head;
 static DEFINE_SPINLOCK(gnttab_list_lock);
 struct grant_frames xen_auto_xlat_grant_frames;
+static unsigned int xen_gnttab_version;
+module_param_named(version, xen_gnttab_version, uint, 0);
 
 static union {
struct grant_entry_v1 *v1;
@@ -1177,12 +1184,36 @@ static const struct gnttab_ops gnttab_v2_ops = {
.query_foreign_access   = gnttab_query_foreign_access_v2,
 };
 
+static bool gnttab_need_v2(void)
+{
+#ifdef CONFIG_X86
+   uint32_t base, width;
+
+   if (xen_pv_domain()) {
+   base = xen_cpuid_base();
+   if (cpuid_eax(base) < 5)
+   return false;   /* Information not available, use V1. */
+   width = cpuid_ebx(base + 5) &
+   XEN_CPUID_MACHINE_ADDRESS_WIDTH_MASK;
+   return width > 32 + PAGE_SHIFT;
+   }
+#endif
+   return !!(max_possible_pfn >> 32);
+}
+
 static void gnttab_request_version(void)
 {
-   int rc;
+   long rc;
struct gnttab_set_version gsv;
 
-   gsv.version = 1;
+   if (gnttab_need_v2())
+   gsv.version = 2;
+   else
+   gsv.version = 1;
+
+   /* Boot parameter overrides automatic selection. */
+   if (xen_gnttab_version >= 1 && xen_gnttab_version <= 2)
+   gsv.version = xen_gnttab_version;
 
rc = HYPERVISOR_grant_table_op(GNTTABOP_set_version, , 1);
if (rc == 0 && gsv.version == 2)
-- 
2.12.3


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2 0/5] xen: grant table interface v2 support

2017-11-02 Thread Juergen Gross
In order to support Linux to run as a pv guest on machines with huge
memory (>16TB) or as a hvm guest with more than 16TB of memory the
kernel has to support grant table interface v2, as v1 is limited to
32 bit frame numbers.

This series re-adds that support (it has been removed in 2015) and
restricts usage of v2 to the features of v1 in order to support
migration to hosts which only support v1.

V2 is selected only in the following cases:
- the user has specified v2 via module parameter
- in a pv guest if the maximum possible memory address of the host is
  above 16TB (memory hotplug taken into account)
- in a hvm guest if the maximum guest memory address is above 16TB
  (again with memory hotplug taken into account)

Changes in V2:
- patch 2: remove update_trans_entry() from gnttab_ops (Boris Ostrovsky)
- added new patch 4
- patch 5: use cpuid on pv and max_possible_pfn on hvm for version select

Juergen Gross (5):
  xen: re-introduce support for grant v2 interface
  xen: limit grant v2 interface to the v1 functionality
  xen: add grant interface version dependent constants to gnttab_ops
  xen: update arch/x86/include/asm/xen/cpuid.h
  xen: select grant interface version

 arch/arm/xen/grant-table.c   |   9 +-
 arch/x86/include/asm/xen/cpuid.h |  42 +--
 arch/x86/xen/grant-table.c   |  60 +-
 drivers/xen/grant-table.c| 244 +++
 include/xen/grant_table.h|   5 +-
 5 files changed, 318 insertions(+), 42 deletions(-)

-- 
2.12.3


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2 1/5] xen: re-introduce support for grant v2 interface

2017-11-02 Thread Juergen Gross
The grant v2 support was removed from the kernel with
commit 438b33c7145ca8a5131a30c36d8f59bce119a19a ("xen/grant-table:
remove support for V2 tables") as the higher memory footprint of v2
grants resulted in less grants being possible for a kernel compared
to the v1 grant interface.

As machines with more than 16TB of memory are expected to be more
common in the near future support of grant v2 is mandatory in order
to be able to run a Xen pv domain at any memory location.

So re-add grant v2 support basically by reverting above commit.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
 arch/arm/xen/grant-table.c |   9 +-
 arch/x86/xen/grant-table.c |  60 -
 drivers/xen/grant-table.c  | 312 -
 include/xen/grant_table.h  |  30 -
 4 files changed, 398 insertions(+), 13 deletions(-)

diff --git a/arch/arm/xen/grant-table.c b/arch/arm/xen/grant-table.c
index e43791829ace..91cf08ba1e95 100644
--- a/arch/arm/xen/grant-table.c
+++ b/arch/arm/xen/grant-table.c
@@ -45,7 +45,14 @@ void arch_gnttab_unmap(void *shared, unsigned long 
nr_gframes)
return;
 }
 
-int arch_gnttab_init(unsigned long nr_shared)
+int arch_gnttab_map_status(uint64_t *frames, unsigned long nr_gframes,
+  unsigned long max_nr_gframes,
+  grant_status_t **__shared)
+{
+   return -ENOSYS;
+}
+
+int arch_gnttab_init(unsigned long nr_shared, unsigned long nr_status)
 {
return 0;
 }
diff --git a/arch/x86/xen/grant-table.c b/arch/x86/xen/grant-table.c
index 809b6c812654..92ccc718152d 100644
--- a/arch/x86/xen/grant-table.c
+++ b/arch/x86/xen/grant-table.c
@@ -49,7 +49,7 @@
 static struct gnttab_vm_area {
struct vm_struct *area;
pte_t **ptes;
-} gnttab_shared_vm_area;
+} gnttab_shared_vm_area, gnttab_status_vm_area;
 
 int arch_gnttab_map_shared(unsigned long *frames, unsigned long nr_gframes,
   unsigned long max_nr_gframes,
@@ -73,16 +73,43 @@ int arch_gnttab_map_shared(unsigned long *frames, unsigned 
long nr_gframes,
return 0;
 }
 
+int arch_gnttab_map_status(uint64_t *frames, unsigned long nr_gframes,
+  unsigned long max_nr_gframes,
+  grant_status_t **__shared)
+{
+   grant_status_t *shared = *__shared;
+   unsigned long addr;
+   unsigned long i;
+
+   if (shared == NULL)
+   *__shared = shared = gnttab_status_vm_area.area->addr;
+
+   addr = (unsigned long)shared;
+
+   for (i = 0; i < nr_gframes; i++) {
+   set_pte_at(_mm, addr, gnttab_status_vm_area.ptes[i],
+  mfn_pte(frames[i], PAGE_KERNEL));
+   addr += PAGE_SIZE;
+   }
+
+   return 0;
+}
+
 void arch_gnttab_unmap(void *shared, unsigned long nr_gframes)
 {
+   pte_t **ptes;
unsigned long addr;
unsigned long i;
 
+   if (shared == gnttab_status_vm_area.area->addr)
+   ptes = gnttab_status_vm_area.ptes;
+   else
+   ptes = gnttab_shared_vm_area.ptes;
+
addr = (unsigned long)shared;
 
for (i = 0; i < nr_gframes; i++) {
-   set_pte_at(_mm, addr, gnttab_shared_vm_area.ptes[i],
-  __pte(0));
+   set_pte_at(_mm, addr, ptes[i], __pte(0));
addr += PAGE_SIZE;
}
 }
@@ -102,12 +129,35 @@ static int arch_gnttab_valloc(struct gnttab_vm_area 
*area, unsigned nr_frames)
return 0;
 }
 
-int arch_gnttab_init(unsigned long nr_shared)
+static void arch_gnttab_vfree(struct gnttab_vm_area *area)
 {
+   free_vm_area(area->area);
+   kfree(area->ptes);
+}
+
+int arch_gnttab_init(unsigned long nr_shared, unsigned long nr_status)
+{
+   int ret;
+
if (!xen_pv_domain())
return 0;
 
-   return arch_gnttab_valloc(_shared_vm_area, nr_shared);
+   ret = arch_gnttab_valloc(_shared_vm_area, nr_shared);
+   if (ret < 0)
+   return ret;
+
+   /*
+* Always allocate the space for the status frames in case
+* we're migrated to a host with V2 support.
+*/
+   ret = arch_gnttab_valloc(_status_vm_area, nr_status);
+   if (ret < 0)
+   goto err;
+
+   return 0;
+err:
+   arch_gnttab_vfree(_shared_vm_area);
+   return -ENOMEM;
 }
 
 #ifdef CONFIG_XEN_PVH
diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
index 2c6a9114d332..65c4bdb0b463 100644
--- a/drivers/xen/grant-table.c
+++ b/drivers/xen/grant-table.c
@@ -71,6 +71,7 @@ struct grant_frames xen_auto_xlat_grant_frames;
 
 static union {
struct grant_entry_v1 *v1;
+   union grant_entry_v2 *v2;
void *addr;
 } gnttab_shared;
 
@@ -121,6 +122,29 @@ struct gnttab_ops {
 * by bit operations.
 */
int (*query_foreign_access)(grant_ref_t ref);
+   /*
+* Grant a domain to access a range of bytes within the 

[Xen-devel] [PATCH v2 3/5] xen: add grant interface version dependent constants to gnttab_ops

2017-11-02 Thread Juergen Gross
Instead of having multiple variables with constants like
grant_table_version or grefs_per_grant_frame add those to struct
gnttab_ops and access them just via the gnttab_interface pointer.

Signed-off-by: Juergen Gross <jgr...@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 drivers/xen/grant-table.c | 73 ---
 1 file changed, 43 insertions(+), 30 deletions(-)

diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
index db6a1b9efd11..573e7fe9b147 100644
--- a/drivers/xen/grant-table.c
+++ b/drivers/xen/grant-table.c
@@ -78,6 +78,14 @@ static union {
 /*This is a structure of function pointers for grant table*/
 struct gnttab_ops {
/*
+* Version of the grant interface.
+*/
+   unsigned int version;
+   /*
+* Grant refs per grant frame.
+*/
+   unsigned int grefs_per_grant_frame;
+   /*
 * Mapping a list of frames for storing grant entries. Frames parameter
 * is used to store grant table address when grant table being setup,
 * nr_gframes is the number of frames to map grant table. Returning
@@ -134,9 +142,6 @@ static const struct gnttab_ops *gnttab_interface;
 /* This reflects status of grant entries, so act as a global value. */
 static grant_status_t *grstatus;
 
-static int grant_table_version;
-static int grefs_per_grant_frame;
-
 static struct gnttab_free_callback *gnttab_free_callback_list;
 
 static int gnttab_expand(unsigned int req_entries);
@@ -636,19 +641,26 @@ void gnttab_cancel_free_callback(struct 
gnttab_free_callback *callback)
 }
 EXPORT_SYMBOL_GPL(gnttab_cancel_free_callback);
 
+static unsigned int gnttab_frames(unsigned int frames, unsigned int align)
+{
+   return (frames * gnttab_interface->grefs_per_grant_frame + align - 1) /
+  align;
+}
+
 static int grow_gnttab_list(unsigned int more_frames)
 {
unsigned int new_nr_grant_frames, extra_entries, i;
unsigned int nr_glist_frames, new_nr_glist_frames;
+   unsigned int grefs_per_frame;
 
-   BUG_ON(grefs_per_grant_frame == 0);
+   BUG_ON(gnttab_interface == NULL);
+   grefs_per_frame = gnttab_interface->grefs_per_grant_frame;
 
new_nr_grant_frames = nr_grant_frames + more_frames;
-   extra_entries   = more_frames * grefs_per_grant_frame;
+   extra_entries = more_frames * grefs_per_frame;
 
-   nr_glist_frames = (nr_grant_frames * grefs_per_grant_frame + RPP - 1) / 
RPP;
-   new_nr_glist_frames =
-   (new_nr_grant_frames * grefs_per_grant_frame + RPP - 1) / RPP;
+   nr_glist_frames = gnttab_frames(nr_grant_frames, RPP);
+   new_nr_glist_frames = gnttab_frames(new_nr_grant_frames, RPP);
for (i = nr_glist_frames; i < new_nr_glist_frames; i++) {
gnttab_list[i] = (grant_ref_t *)__get_free_page(GFP_ATOMIC);
if (!gnttab_list[i])
@@ -656,12 +668,12 @@ static int grow_gnttab_list(unsigned int more_frames)
}
 
 
-   for (i = grefs_per_grant_frame * nr_grant_frames;
-i < grefs_per_grant_frame * new_nr_grant_frames - 1; i++)
+   for (i = grefs_per_frame * nr_grant_frames;
+i < grefs_per_frame * new_nr_grant_frames - 1; i++)
gnttab_entry(i) = i + 1;
 
gnttab_entry(i) = gnttab_free_head;
-   gnttab_free_head = grefs_per_grant_frame * nr_grant_frames;
+   gnttab_free_head = grefs_per_frame * nr_grant_frames;
gnttab_free_count += extra_entries;
 
nr_grant_frames = new_nr_grant_frames;
@@ -1013,8 +1025,8 @@ EXPORT_SYMBOL_GPL(gnttab_unmap_refs_sync);
 
 static unsigned int nr_status_frames(unsigned int nr_grant_frames)
 {
-   BUG_ON(grefs_per_grant_frame == 0);
-   return (nr_grant_frames * grefs_per_grant_frame + SPP - 1) / SPP;
+   BUG_ON(gnttab_interface == NULL);
+   return gnttab_frames(nr_grant_frames, SPP);
 }
 
 static int gnttab_map_frames_v1(xen_pfn_t *frames, unsigned int nr_gframes)
@@ -1142,6 +1154,9 @@ static int gnttab_map(unsigned int start_idx, unsigned 
int end_idx)
 }
 
 static const struct gnttab_ops gnttab_v1_ops = {
+   .version= 1,
+   .grefs_per_grant_frame  = XEN_PAGE_SIZE /
+ sizeof(struct grant_entry_v1),
.map_frames = gnttab_map_frames_v1,
.unmap_frames   = gnttab_unmap_frames_v1,
.update_entry   = gnttab_update_entry_v1,
@@ -1151,6 +1166,9 @@ static const struct gnttab_ops gnttab_v1_ops = {
 };
 
 static const struct gnttab_ops gnttab_v2_ops = {
+   .version= 2,
+   .grefs_per_grant_frame  = XEN_PAGE_SIZE /
+ sizeof(union grant_entry_v2),
.map_frames = gnttab_map_frames_v2,
.unmap_frames   = gnttab_unmap_frame

Re: [Xen-devel] USB Passthrough: Incorrect device detected in Domain U

2017-11-02 Thread Juergen Gross
On 02/11/17 07:57, Rakesh Awanti wrote:
> Hello,
> 
> I'm trying USB passthrough on x86 and ARM platforms. I've taken the USB
> frontend and backend code from 

So from where?


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] x86/paravirt: Add kernel parameter to choose paravirt lock type

2017-11-01 Thread Juergen Gross
On 01/11/17 16:32, Waiman Long wrote:
> Currently, there are 3 different lock types that can be chosen for
> the x86 architecture:
> 
>  - qspinlock
>  - pvqspinlock
>  - unfair lock
> 
> One of the above lock types will be chosen at boot time depending on
> a number of different factors.
> 
> Ideally, the hypervisors should be able to pick the best performing
> lock type for the current VM configuration. That is not currently
> the case as the performance of each lock type are affected by many
> different factors like the number of vCPUs in the VM, the amount vCPU
> overcommitment, the CPU type and so on.
> 
> Generally speaking, unfair lock performs well for VMs with a small
> number of vCPUs. Native qspinlock may perform better than pvqspinlock
> if there is vCPU pinning and there is no vCPU over-commitment.
> 
> This patch adds a new kernel parameter to allow administrator to
> choose the paravirt spinlock type to be used. VM administrators can
> experiment with the different lock types and choose one that can best
> suit their need, if they want to. Hypervisor developers can also use
> that to experiment with different lock types so that they can come
> up with a better algorithm to pick the best lock type.
> 
> The hypervisor paravirt spinlock code will override this new parameter
> in determining if pvqspinlock should be used. The parameter, however,
> will override Xen's xen_nopvspin in term of disabling unfair lock.

Hmm, I'm not sure we need pvlock_type _and_ xen_nopvspin. What do others
think?

> 
> Signed-off-by: Waiman Long 
> ---
>  Documentation/admin-guide/kernel-parameters.txt |  7 +
>  arch/x86/include/asm/paravirt.h |  9 ++
>  arch/x86/kernel/kvm.c   |  4 +++
>  arch/x86/kernel/paravirt.c  | 40 
> -
>  arch/x86/xen/spinlock.c |  6 ++--
>  5 files changed, 62 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index f7df49d..c98d9c7 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3275,6 +3275,13 @@
>   [KNL] Number of legacy pty's. Overwrites compiled-in
>   default number.
>  
> + pvlock_type=[X86,PV_OPS]
> + Specify the paravirt spinlock type to be used.
> + Options are:
> + queued - native queued spinlock
> + pv - paravirt queued spinlock
> + unfair - simple TATAS unfair lock
> +
>   quiet   [KNL] Disable most log messages
>  
>   r128=   [HW,DRM]
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index 12deec7..941a046 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -690,6 +690,15 @@ static __always_inline bool pv_vcpu_is_preempted(long 
> cpu)
>  
>  #endif /* SMP && PARAVIRT_SPINLOCKS */
>  
> +enum pv_spinlock_type {
> + locktype_auto,
> + locktype_queued,
> + locktype_paravirt,
> + locktype_unfair,
> +};
> +
> +extern enum pv_spinlock_type pv_spinlock_type;
> +
>  #ifdef CONFIG_X86_32
>  #define PV_SAVE_REGS "pushl %ecx; pushl %edx;"
>  #define PV_RESTORE_REGS "popl %edx; popl %ecx;"
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 8bb9594..3a5d3ec4 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -646,6 +646,10 @@ void __init kvm_spinlock_init(void)
>   if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
>   return;
>  
> + if ((pv_spinlock_type == locktype_queued) ||
> + (pv_spinlock_type == locktype_unfair))
> + return;
> +
>   __pv_init_lock_hash();
>   pv_lock_ops.queued_spin_lock_slowpath = __pv_queued_spin_lock_slowpath;
>   pv_lock_ops.queued_spin_unlock = 
> PV_CALLEE_SAVE(__pv_queued_spin_unlock);
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index 041096b..ca35cd3 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -115,11 +115,48 @@ unsigned paravirt_patch_jmp(void *insnbuf, const void 
> *target,
>   return 5;
>  }
>  
> +/*
> + * The kernel argument "pvlock_type=" can be used to explicitly specify
> + * which type of spinlocks to be used. Currently, there are 3 options:
> + * 1) queued - the native queued spinlock
> + * 2) pv - the paravirt queued spinlock (if CONFIG_PARAVIRT_SPINLOCKS)
> + * 3) unfair - the simple TATAS unfair lock
> + *
> + * If this argument is not specified, the kernel will automatically choose
> + * an appropriate one depending on X86_FEATURE_HYPERVISOR and hypervisor
> + * specific settings.
> + */
> +enum pv_spinlock_type __read_mostly pv_spinlock_type = locktype_auto;
> +
> +static int __init pvlock_setup(char *s)
> 

Re: [Xen-devel] [PATCH v2] xen: support priv-mapping in an HVM tools domain

2017-11-01 Thread Juergen Gross
On 01/11/17 14:45, Paul Durrant wrote:
>> -Original Message-
>> From: Juergen Gross [mailto:jgr...@suse.com]
>> Sent: 01 November 2017 13:40
>> To: Paul Durrant <paul.durr...@citrix.com>; x...@kernel.org; xen-
>> de...@lists.xenproject.org; linux-ker...@vger.kernel.org
>> Cc: Boris Ostrovsky <boris.ostrov...@oracle.com>; Thomas Gleixner
>> <t...@linutronix.de>; Ingo Molnar <mi...@redhat.com>; H. Peter Anvin
>> <h...@zytor.com>
>> Subject: Re: [PATCH v2] xen: support priv-mapping in an HVM tools domain
>>
>> On 01/11/17 12:31, Paul Durrant wrote:
>>> If the domain has XENFEAT_auto_translated_physmap then use of the PV-
>>> specific HYPERVISOR_mmu_update hypercall is clearly incorrect.
>>>
>>> This patch adds checks in xen_remap_domain_gfn_array() and
>>> xen_unmap_domain_gfn_array() which call through to the approprate
>>> xlate_mmu function if the feature is present.
>>>
>>> This patch also moves xen_remap_domain_gfn_range() into the PV-only
>> MMU
>>> code and #ifdefs the (only) calling code in privcmd accordingly.
>>>
>>> Signed-off-by: Paul Durrant <paul.durr...@citrix.com>
>>> ---
>>> Cc: Boris Ostrovsky <boris.ostrov...@oracle.com>
>>> Cc: Juergen Gross <jgr...@suse.com>
>>> Cc: Thomas Gleixner <t...@linutronix.de>
>>> Cc: Ingo Molnar <mi...@redhat.com>
>>> Cc: "H. Peter Anvin" <h...@zytor.com>
>>> ---
>>>  arch/x86/xen/mmu.c| 36 +---
>>>  arch/x86/xen/mmu_pv.c | 11 +++
>>>  drivers/xen/privcmd.c | 17 +
>>>  include/xen/xen-ops.h |  7 +++
>>>  4 files changed, 48 insertions(+), 23 deletions(-)
>>>
>>> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
>>> index 3e15345abfe7..01837c36e293 100644
>>> --- a/arch/x86/xen/mmu.c
>>> +++ b/arch/x86/xen/mmu.c
>>> @@ -91,12 +91,12 @@ static int remap_area_mfn_pte_fn(pte_t *ptep,
>> pgtable_t token,
>>> return 0;
>>>  }
>>>
>>> -static int do_remap_gfn(struct vm_area_struct *vma,
>>> -   unsigned long addr,
>>> -   xen_pfn_t *gfn, int nr,
>>> -   int *err_ptr, pgprot_t prot,
>>> -   unsigned domid,
>>> -   struct page **pages)
>>> +int xen_remap_gfn(struct vm_area_struct *vma,
>>> + unsigned long addr,
>>> + xen_pfn_t *gfn, int nr,
>>> + int *err_ptr, pgprot_t prot,
>>> + unsigned int domid,
>>> + struct page **pages)
>>>  {
>>> int err = 0;
>>> struct remap_data rmd;
>>> @@ -166,36 +166,34 @@ static int do_remap_gfn(struct vm_area_struct
>> *vma,
>>> return err < 0 ? err : mapped;
>>>  }
>>>
>>> -int xen_remap_domain_gfn_range(struct vm_area_struct *vma,
>>> -  unsigned long addr,
>>> -  xen_pfn_t gfn, int nr,
>>> -  pgprot_t prot, unsigned domid,
>>> -  struct page **pages)
>>> -{
>>> -   return do_remap_gfn(vma, addr, , nr, NULL, prot, domid,
>> pages);
>>> -}
>>> -EXPORT_SYMBOL_GPL(xen_remap_domain_gfn_range);
>>> -
>>>  int xen_remap_domain_gfn_array(struct vm_area_struct *vma,
>>>unsigned long addr,
>>>xen_pfn_t *gfn, int nr,
>>>int *err_ptr, pgprot_t prot,
>>>unsigned domid, struct page **pages)
>>>  {
>>> +   if (xen_feature(XENFEAT_auto_translated_physmap))
>>> +   return xen_xlate_remap_gfn_array(vma, addr, gfn, nr,
>> err_ptr,
>>> +prot, domid, pages);
>>> +
>>> /* We BUG_ON because it's a programmer error to pass a NULL
>> err_ptr,
>>>  * and the consequences later is quite hard to detect what the actual
>>>  * cause of "wrong memory was mapped in".
>>>  */
>>> BUG_ON(err_ptr == NULL);
>>> -   return do_remap_gfn(vma, addr, gfn, nr, err_ptr, prot, domid,
>> pages);
>>> +   return xen_remap_gfn(vma, addr, gfn, nr, err_ptr, prot, domid,
>>> +pages);
>>>  }
>>>  EXPORT_SYMBOL_GPL(

Re: [Xen-devel] [PATCH v2] xen: support priv-mapping in an HVM tools domain

2017-11-01 Thread Juergen Gross
On 01/11/17 12:31, Paul Durrant wrote:
> If the domain has XENFEAT_auto_translated_physmap then use of the PV-
> specific HYPERVISOR_mmu_update hypercall is clearly incorrect.
> 
> This patch adds checks in xen_remap_domain_gfn_array() and
> xen_unmap_domain_gfn_array() which call through to the approprate
> xlate_mmu function if the feature is present.
> 
> This patch also moves xen_remap_domain_gfn_range() into the PV-only MMU
> code and #ifdefs the (only) calling code in privcmd accordingly.
> 
> Signed-off-by: Paul Durrant <paul.durr...@citrix.com>
> ---
> Cc: Boris Ostrovsky <boris.ostrov...@oracle.com>
> Cc: Juergen Gross <jgr...@suse.com>
> Cc: Thomas Gleixner <t...@linutronix.de>
> Cc: Ingo Molnar <mi...@redhat.com>
> Cc: "H. Peter Anvin" <h...@zytor.com>
> ---
>  arch/x86/xen/mmu.c| 36 +---
>  arch/x86/xen/mmu_pv.c | 11 +++
>  drivers/xen/privcmd.c | 17 +
>  include/xen/xen-ops.h |  7 +++
>  4 files changed, 48 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index 3e15345abfe7..01837c36e293 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -91,12 +91,12 @@ static int remap_area_mfn_pte_fn(pte_t *ptep, pgtable_t 
> token,
>   return 0;
>  }
>  
> -static int do_remap_gfn(struct vm_area_struct *vma,
> - unsigned long addr,
> - xen_pfn_t *gfn, int nr,
> - int *err_ptr, pgprot_t prot,
> - unsigned domid,
> - struct page **pages)
> +int xen_remap_gfn(struct vm_area_struct *vma,
> +   unsigned long addr,
> +   xen_pfn_t *gfn, int nr,
> +   int *err_ptr, pgprot_t prot,
> +   unsigned int domid,
> +   struct page **pages)
>  {
>   int err = 0;
>   struct remap_data rmd;
> @@ -166,36 +166,34 @@ static int do_remap_gfn(struct vm_area_struct *vma,
>   return err < 0 ? err : mapped;
>  }
>  
> -int xen_remap_domain_gfn_range(struct vm_area_struct *vma,
> -unsigned long addr,
> -xen_pfn_t gfn, int nr,
> -pgprot_t prot, unsigned domid,
> -struct page **pages)
> -{
> - return do_remap_gfn(vma, addr, , nr, NULL, prot, domid, pages);
> -}
> -EXPORT_SYMBOL_GPL(xen_remap_domain_gfn_range);
> -
>  int xen_remap_domain_gfn_array(struct vm_area_struct *vma,
>  unsigned long addr,
>  xen_pfn_t *gfn, int nr,
>  int *err_ptr, pgprot_t prot,
>  unsigned domid, struct page **pages)
>  {
> + if (xen_feature(XENFEAT_auto_translated_physmap))
> + return xen_xlate_remap_gfn_array(vma, addr, gfn, nr, err_ptr,
> +  prot, domid, pages);
> +
>   /* We BUG_ON because it's a programmer error to pass a NULL err_ptr,
>* and the consequences later is quite hard to detect what the actual
>* cause of "wrong memory was mapped in".
>*/
>   BUG_ON(err_ptr == NULL);
> - return do_remap_gfn(vma, addr, gfn, nr, err_ptr, prot, domid, pages);
> + return xen_remap_gfn(vma, addr, gfn, nr, err_ptr, prot, domid,
> +  pages);
>  }
>  EXPORT_SYMBOL_GPL(xen_remap_domain_gfn_array);
>  
>  /* Returns: 0 success */
>  int xen_unmap_domain_gfn_range(struct vm_area_struct *vma,
> -int numpgs, struct page **pages)
> +int nr, struct page **pages)
>  {
> - if (!pages || !xen_feature(XENFEAT_auto_translated_physmap))
> + if (xen_feature(XENFEAT_auto_translated_physmap))
> + return xen_xlate_unmap_gfn_range(vma, nr, pages);
> +
> + if (!pages)
>   return 0;
>  
>   return -EINVAL;
> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
> index 71495f1a86d7..4974d8a6c2b4 100644
> --- a/arch/x86/xen/mmu_pv.c
> +++ b/arch/x86/xen/mmu_pv.c
> @@ -2670,3 +2670,14 @@ phys_addr_t paddr_vmcoreinfo_note(void)
>   return __pa(vmcoreinfo_note);
>  }
>  #endif /* CONFIG_KEXEC_CORE */
> +
> +int xen_remap_domain_gfn_range(struct vm_area_struct *vma,
> +unsigned long addr,
> +xen_pfn_t gfn, int nr,
> +pgprot_t prot, unsigned int domid,
> +struct page **pages)
> +{
> + return xen_remap_gfn(vma, addr, , nr, NULL, prot, domid

Re: [Xen-devel] [BUG] xen_gntdev - gntdev_vma_find_special_page - unable to handle kernel paging request

2017-10-30 Thread Juergen Gross
On 30/10/17 09:05, Arthur Borsboom wrote:
> Hi Juergen,
> 
> I needed to wait until the VM guest crashed again, so that caused this
> delay.
> I have extracted the hypervisor dmesg and the dom0 dmesg which can be
> found in the PasteBins.
> It is always the same domain crashing, if that is of any help.
> 
> Xen hypervisor dmesg: https://pastebin.com/Y9fv0Ey0

Aah, interesting. Do you happen to configure your domains with maxmem=
larger than memory= ?

I'm asking because the error messages seem to be related to another
problem introduced in kernel 4.13 which is fixed in 4.14. This problem
should be avoidable by setting maxmem= to the same value as memory=
in the domain config of HVM domains.

BTW: your hypervisor parameters are containing "dom0pvh=1". You should
remove it as PVH support has been reworked and is not yet working for
dom0.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [locking/paravirt] static_key_disable_cpuslocked(): static key 'virt_spin_lock_key+0x0/0x20' used before call to jump_label_init()

2017-10-30 Thread Juergen Gross
On 30/10/17 08:35, Fengguang Wu wrote:
> On Sun, Oct 29, 2017 at 11:51:55PM +0100, Fengguang Wu wrote:
>> Hi Linus,
>>
>> Up to now we see the below boot error/warnings when testing v4.14-rc6.
>>
>> They hit the RC release mainly due to various imperfections in 0day's
>> auto bisection. So I manually list them here and CC the likely easy to
>> debug ones to the corresponding maintainers in the followup emails.
>>
>> boot_successes: 4700
>> boot_failures: 247
> 
> [...]
> 
>> WARNING:at_kernel/jump_label.c:#static_key_disable_cpuslocked: 7

This patch is in the tip tree only, it will be merged in 4.15. So I
don't understand why you are reporting this for 4.14-rc6.

There is a patch by Dou Liyang pending since 28th October addressing
that issue:

[PATCH tip v2] x86/paravirt: Make the virt_spin_lock_key setup after
jump_label_init()


Juergen

> 
> The call trace is
> 
> [    0.00] Booting paravirtualized kernel on bare hardware
> [    0.00] clocksource: refined-jiffies: mask: 0x
> max_cycles: 0x, max_idle_ns: 1910969940391419 ns
> [    0.00] setup_percpu: NR_CPUS:8192 nr_cpumask_bits:72
> nr_cpu_ids:72 nr_node_ids:2
> [    0.00] percpu: Embedded 39 pages/cpu @88103f40 s120088
> r8192 d31464 u262144
> [    0.00] pcpu-alloc: s120088 r8192 d31464 u262144 alloc=1*2097152
> [    0.00] pcpu-alloc: [0] 00 01 02 03 04 05 06 07 [0] 08 09 10 11
> 12 13 14 15 [    0.00] pcpu-alloc: [0] 16 17 36 37 38 39 40 41 [0]
> 42 43 44 45 46 47 48 49 [    0.00] pcpu-alloc: [0] 50 51 52 53 -- --
> -- -- [1] 18 19 20 21 22 23 24 25 [    0.00] pcpu-alloc: [1] 26 27
> 28 29 30 31 32 33 [1] 34 35 54 55 56 57 58 59 [    0.00] pcpu-alloc:
> [1] 60 61 62 63 64 65 66 67 [1] 68 69 70 71 -- -- -- -- [    0.00]
> static_key_disable_cpuslocked(): static key
> 'virt_spin_lock_key+0x0/0x20' used before call to jump_label_init()
> [    0.00] [ cut here ]
> [    0.00] WARNING: CPU: 0 PID: 0 at kernel/jump_label.c:161
> static_key_disable_cpuslocked+0x6c/0x80
> [    0.00] Modules linked in:
> [    0.00] CPU: 0 PID: 0 Comm: swapper Not tainted
> 4.14.0-rc6-00162-g3d6dabc2 #1
> [    0.00] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS
> SE5C610.86B.01.01.0008.021120151325 02/11/2015
> [    0.00] task: 81e10480 task.stack: 81e0
> [    0.00] RIP: 0010:static_key_disable_cpuslocked+0x6c/0x80
> [    0.00] RSP: :81e03e98 EFLAGS: 00010092 ORIG_RAX:
> 
> [    0.00] RAX: 006f RBX: 81e36fc0 RCX:
> 81e60cf8
> [    0.00] RDX: 0001 RSI: 0092 RDI:
> 0047
> [    0.00] RBP: 81e03ea8 R08: 6b5f636974617473 R09:
> 018c
> [    0.00] R10:  R11: 006f R12:
> 88207ffcf4c0
> [    0.00] R13: 82170920 R14: 8218b780 R15:
> 
> [    0.00] FS:  () GS:88103f40()
> knlGS:
> [    0.00] CS:  0010 DS:  ES:  CR0: 80050033
> [    0.00] CR2: 88207f4e CR3: 00207ee09000 CR4:
> 000606b0
> [    0.00] Call Trace:
> [    0.00]  static_key_disable+0x1a/0x30
> [    0.00]  native_pv_lock_init+0x1b/0x1e
> [    0.00]  native_smp_prepare_boot_cpu+0x32/0x35
> [    0.00]  start_kernel+0x14f/0x421
> [    0.00]  x86_64_start_reservations+0x2a/0x2c
> [    0.00]  x86_64_start_kernel+0x72/0x75
> [    0.00]  secondary_startup_64+0xa5/0xb0
> [    0.00] Code: 85 c0 75 2f 48 c7 c7 20 5e f0 81 e8 df 2a 7b 00 5b
> 41 5c 5d c3 48 89 fa 48 c7 c6 40 93 a3 81 48 c7 c7 00 8b cb 81 e8 95 b5
> f3 ff <0f> ff eb a8 0f ff eb b3 48 89 df e8 14 fc ff ff eb c7 66 90 e8
> [    0.00] ---[ end trace c12d07f00399ce78 ]---
> [    0.00] Built 2 zonelists, mobility grouping on.  Total pages:
> 33006159
> [    0.00] Policy zone: Normal
> 
> which is bisected to
> 
> commit 9043442b43b1fddf202591b84702863286700c1a
> Author: Juergen Gross <jgr...@suse.com>
> AuthorDate: Wed Sep 6 19:36:24 2017 +0200
> Commit: Ingo Molnar <mi...@kernel.org>
> CommitDate: Tue Oct 10 11:50:12 2017 +0200
> 
>    locking/paravirt: Use new static key for controlling call of
> virt_spin_lock()
> 
>    There are cases where a guest tries to switch spinlocks to bare metal
>    behavior (e.g. by setting "xen_nopvspin" boot parameter). Today this
>    has the downside of falling back to unfair test and set scheme for
>    qspinlocks due to virt_spin_lock() detecting the virtualized
>    environment.
> 
>    Add a static k

Re: [Xen-devel] [PATCH v4 1/1] xen/time: do not decrease steal time after live migration on xen

2017-10-30 Thread Juergen Gross
On 30/10/17 06:26, Dongli Zhang wrote:
> After guest live migration on xen, steal time in /proc/stat
> (cpustat[CPUTIME_STEAL]) might decrease because steal returned by
> xen_steal_lock() might be less than this_rq()->prev_steal_time which is
> derived from previous return value of xen_steal_clock().
> 
> For instance, steal time of each vcpu is 335 before live migration.
> 
> cpu  198 0 368 200064 1962 0 0 1340 0 0
> cpu0 38 0 81 50063 492 0 0 335 0 0
> cpu1 65 0 97 49763 634 0 0 335 0 0
> cpu2 38 0 81 50098 462 0 0 335 0 0
> cpu3 56 0 107 50138 374 0 0 335 0 0
> 
> After live migration, steal time is reduced to 312.
> 
> cpu  200 0 370 200330 1971 0 0 1248 0 0
> cpu0 38 0 82 50123 500 0 0 312 0 0
> cpu1 65 0 97 49832 634 0 0 312 0 0
> cpu2 39 0 82 50167 462 0 0 312 0 0
> cpu3 56 0 107 50207 374 0 0 312 0 0
> 
> Since runstate times are cumulative and cleared during xen live migration
> by xen hypervisor, the idea of this patch is to accumulate runstate times
> to global percpu variables before live migration suspend. Once guest VM is
> resumed, xen_get_runstate_snapshot_cpu() would always return the sum of new
> runstate times and previously accumulated times stored in global percpu
> variables.
> 
> Similar and more severe issue would impact prior linux 4.8-4.10 as
> discussed by Michael Las at
> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest,
> which would overflow steal time and lead to 100% st usage in top command
> for linux 4.8-4.10. A backport of this patch would fix that issue.
> 
> References: 
> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
> Signed-off-by: Dongli Zhang 
> 
> ---
> Changed since v1:
>   * relocate modification to xen_get_runstate_snapshot_cpu
> 
> Changed since v2:
>   * accumulate runstate times before live migration
> 
> Changed since v3:
>   * do not accumulate times in the case of guest checkpointing
> 
> ---
>  drivers/xen/manage.c |  2 ++
>  drivers/xen/time.c   | 83 
> ++--
>  include/xen/interface/vcpu.h |  2 ++
>  include/xen/xen-ops.h|  1 +
>  4 files changed, 86 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
> index c425d03..3dc085d 100644
> --- a/drivers/xen/manage.c
> +++ b/drivers/xen/manage.c
> @@ -72,6 +72,7 @@ static int xen_suspend(void *data)
>   }
>  
>   gnttab_suspend();
> + xen_accumulate_runstate_time(-1);
>   xen_arch_pre_suspend();
>  
>   /*
> @@ -84,6 +85,7 @@ static int xen_suspend(void *data)
> : 0);
>  
>   xen_arch_post_suspend(si->cancelled);
> + xen_accumulate_runstate_time(si->cancelled);
>   gnttab_resume();
>  
>   if (!si->cancelled) {
> diff --git a/drivers/xen/time.c b/drivers/xen/time.c
> index ac5f23f..18e2b76 100644
> --- a/drivers/xen/time.c
> +++ b/drivers/xen/time.c
> @@ -19,6 +19,9 @@
>  /* runstate info updated by Xen */
>  static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate);
>  
> +static DEFINE_PER_CPU(u64[RUNSTATE_max], old_runstate_time);
> +static u64 **runstate_time_delta;
> +
>  /* return an consistent snapshot of 64-bit time/counter value */
>  static u64 get64(const u64 *p)
>  {
> @@ -47,8 +50,8 @@ static u64 get64(const u64 *p)
>   return ret;
>  }
>  
> -static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info *res,
> -   unsigned int cpu)
> +static void xen_get_runstate_snapshot_cpu_delta(
> + struct vcpu_runstate_info *res, unsigned int cpu)
>  {
>   u64 state_time;
>   struct vcpu_runstate_info *state;
> @@ -66,6 +69,82 @@ static void xen_get_runstate_snapshot_cpu(struct 
> vcpu_runstate_info *res,
>(state_time & XEN_RUNSTATE_UPDATE));
>  }
>  
> +static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info *res,
> +   unsigned int cpu)
> +{
> + int i;
> +
> + xen_get_runstate_snapshot_cpu_delta(res, cpu);
> +
> + for (i = 0; i < RUNSTATE_max; i++)
> + res->time[i] += per_cpu(old_runstate_time, cpu)[i];
> +}
> +
> +void xen_accumulate_runstate_time(int action)
> +{
> + struct vcpu_runstate_info state;
> + int cpu, i;
> +
> + switch (action) {
> + case -1: /* backup runstate time before suspend */
> + WARN_ON_ONCE(unlikely(runstate_time_delta));
> +
> + runstate_time_delta = kcalloc(num_possible_cpus(),
> +   sizeof(*runstate_time_delta),
> +   GFP_KERNEL);

You know the number of cpus, so you can just allocate an array of
struct vcpu_runstate_info:

struct vcpu_runstate_info *runstate_time_delta;

kcalloc(num_possible_cpus(), sizeof(*runstate_time_delta), GFP_KERNEL);

then ...

> + if 

  1   2   3   4   5   6   7   8   9   10   >