Re: [PATCH] x86/paravirt: use %rip-relative addressing in hook calls
On 05.10.2021 09:43, Juergen Gross wrote: > On 30.09.21 14:40, Jan Beulich via Virtualization wrote: >> While using a plain (constant) address works, its use needlessly invokes >> a SIB addressing mode, making every call site one byte larger than >> necessary. Instead of using an "i" constraint with address-of operator >> and a 'c' operand modifier, simply use an ordinary "m" constraint, which >> the 64-bit compiler will translate to %rip-relative addressing. This way >> we also tell the compiler the truth about operand usage - the memory >> location gets actually read, after all. >> >> 32-bit code generation is unaffected by the change. >> >> Signed-off-by: Jan Beulich > > Reviewed-by: Juergen Gross Thanks. I notice this wasn't part of your 5.16-rc1 pull request, nor did it make it into Linus'es tree via any other route. May I ask what the plans here are? Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH] x86/paravirt: use %rip-relative addressing in hook calls
While using a plain (constant) address works, its use needlessly invokes a SIB addressing mode, making every call site one byte larger than necessary. Instead of using an "i" constraint with address-of operator and a 'c' operand modifier, simply use an ordinary "m" constraint, which the 64-bit compiler will translate to %rip-relative addressing. This way we also tell the compiler the truth about operand usage - the memory location gets actually read, after all. 32-bit code generation is unaffected by the change. Signed-off-by: Jan Beulich --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -278,7 +278,7 @@ extern void (*paravirt_iret)(void); #define paravirt_type(op) \ [paravirt_typenum] "i" (PARAVIRT_PATCH(op)),\ - [paravirt_opptr] "i" (&(pv_ops.op)) + [paravirt_opptr] "m" (pv_ops.op) #define paravirt_clobber(clobber) \ [paravirt_clobber] "i" (clobber) @@ -315,7 +315,7 @@ int paravirt_disable_iospace(void); */ #define PARAVIRT_CALL \ ANNOTATE_RETPOLINE_SAFE \ - "call *%c[paravirt_opptr];" + "call *%[paravirt_opptr];" /* * These macros are intended to wrap calls through one of the paravirt ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] x86/ioperm: add new paravirt function update_io_bitmap
On 19.02.2020 06:35, Jürgen Groß wrote: > On 18.02.20 22:03, Thomas Gleixner wrote: >> Juergen Gross writes: >>> Commit 111e7b15cf10f6 ("x86/ioperm: Extend IOPL config to control >>> ioperm() as well") reworked the iopl syscall to use I/O bitmaps. >>> >>> Unfortunately this broke Xen PV domains using that syscall as there >>> is currently no I/O bitmap support in PV domains. >>> >>> Add I/O bitmap support via a new paravirt function update_io_bitmap >>> which Xen PV domains can use to update their I/O bitmaps via a >>> hypercall. >>> >>> Fixes: 111e7b15cf10f6 ("x86/ioperm: Extend IOPL config to control ioperm() >>> as well") >>> Reported-by: Jan Beulich >>> Cc: # 5.5 >>> Signed-off-by: Juergen Gross >>> Reviewed-by: Jan Beulich >>> Tested-by: Jan Beulich >> >> Duh, sorry about that and thanks for fixing it. >> >> BTW, why isn't stuff like this not catched during next or at least >> before the final release? Is nothing running CI on upstream with all >> that XEN muck active? > > This problem showed up by not being able to start the X server (probably > not the freshest one) in dom0 on a moderate aged AMD system. Not the freshest one, yes, but also on a system where KMS would not be available (my success rate with KMS is rather low overall, and with newer Linux I see rather more systems to stop working than ones to become working, but I simply don't have the time to investigate). Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH 0/2] Remove 32-bit Xen PV guest support
On 15.07.2019 19:28, Andy Lutomirski wrote: > On Mon, Jul 15, 2019 at 9:34 AM Andi Kleen wrote: >> >> Juergen Gross writes: >> >>> The long term plan has been to replace Xen PV guests by PVH. The first >>> victim of that plan are now 32-bit PV guests, as those are used only >>> rather seldom these days. Xen on x86 requires 64-bit support and with >>> Grub2 now supporting PVH officially since version 2.04 there is no >>> need to keep 32-bit PV guest support alive in the Linux kernel. >>> Additionally Meltdown mitigation is not available in the kernel running >>> as 32-bit PV guest, so dropping this mode makes sense from security >>> point of view, too. >> >> Normally we have a deprecation period for feature removals like this. >> You would make the kernel print a warning for some releases, and when >> no user complains you can then remove. If a user complains you can't. >> > > As I understand it, the kernel rules do allow changes like this even > if there's a complaint: this is a patch that removes what is > effectively hardware support. If the maintenance cost exceeds the > value, then removal is fair game. (Obviously we weight the value to > preserving compatibility quite highly, but in this case, Xen dropped > 32-bit hardware support a long time ago. If the Xen hypervisor says > that 32-bit PV guest support is deprecated, it's deprecated.) Since it was implied but not explicit from Andrew's reply, just to make it explicit: So far 32-bit PV guest support has not been deprecated in Xen itself. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH RESEND] x86-64: use RIP-relative calls for paravirt indirect ones
This saves one insn byte per instance, summing up to a savings of over 4k in my (stripped down) configuration. No variant of to be patched in replacement code relies on the one byte larger size. Signed-off-by: Jan Beulich Reviewed-by: Juergen Gross --- Resend to include x86 maintainers, who aren't listed explicitly for the file changed. --- arch/x86/include/asm/paravirt_types.h |6 ++ 1 file changed, 6 insertions(+) --- 4.18/arch/x86/include/asm/paravirt_types.h +++ 4.18-x86_64-pvops-call-RIPrel/arch/x86/include/asm/paravirt_types.h @@ -393,9 +393,15 @@ int paravirt_disable_iospace(void); * offset into the paravirt_patch_template structure, and can therefore be * freely converted back into a structure offset. */ +#ifdef CONFIG_X86_32 #define PARAVIRT_CALL \ ANNOTATE_RETPOLINE_SAFE \ "call *%c[paravirt_opptr];" +#else +#define PARAVIRT_CALL \ + ANNOTATE_RETPOLINE_SAFE \ + "call *%c[paravirt_opptr](%%rip);" +#endif /* * These macros are intended to wrap calls through one of the paravirt ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH 04/10] x86/paravirt: use a single ops structure
>>> On 10.08.18 at 13:52, wrote: > --- a/arch/x86/hyperv/mmu.c > +++ b/arch/x86/hyperv/mmu.c > @@ -228,9 +228,9 @@ void hyperv_setup_mmu_ops(void) > > if (!(ms_hyperv.hints & HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED)) { > pr_info("Using hypercall for remote TLB flush\n"); > - pv_mmu_ops.flush_tlb_others = hyperv_flush_tlb_others; > + pv_ops.pv_mmu_ops.flush_tlb_others = hyperv_flush_tlb_others; Taking just this as example, why not pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others; ? Both pv_ and _ops are redundant on the field names. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH] x86-64: use RIP-relative calls for paravirt indirect ones
This saves one insn byte per instance, summing up to a savings of over 4k in my (stripped down) configuration. No variant of to be patched in replacement code relies on the one byte larger size. Signed-off-by: Jan Beulich --- arch/x86/include/asm/paravirt_types.h |6 ++ 1 file changed, 6 insertions(+) --- 4.18-rc2/arch/x86/include/asm/paravirt_types.h +++ 4.18-rc2-x86_64-pvops-call-RIPrel/arch/x86/include/asm/paravirt_types.h @@ -393,9 +393,15 @@ int paravirt_disable_iospace(void); * offset into the paravirt_patch_template structure, and can therefore be * freely converted back into a structure offset. */ +#ifdef CONFIG_X86_32 #define PARAVIRT_CALL \ ANNOTATE_RETPOLINE_SAFE \ "call *%c[paravirt_opptr];" +#else +#define PARAVIRT_CALL \ + ANNOTATE_RETPOLINE_SAFE \ + "call *%c[paravirt_opptr](%%rip);" +#endif /* * These macros are intended to wrap calls through one of the paravirt ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH v2 5/6] virt, sched: add cpu pinning to smp_call_sync_on_phys_cpu()
>>> On 21.03.16 at 13:24,wrote: > @@ -758,9 +759,14 @@ struct smp_sync_call_struct { > static void smp_call_sync_callback(struct work_struct *work) > { > struct smp_sync_call_struct *sscs; > + unsigned int cpu = smp_processor_id(); So this obtains the vCPU number, yet ... > sscs = container_of(work, struct smp_sync_call_struct, work); > + preempt_disable(); > + hypervisor_pin_vcpu(cpu); ... here you're supposed to pass a pCPU number. Also don't you need to call smp_processor_id() after preempt_disable()? Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] kasan_map_early_shadow() on Xen
On 03.03.15 at 10:40, mcg...@suse.com wrote: Let's go down the rabbit hole for a bit. HAVE_ARCH_KASAN will be selected on x86 when: if X86_64 SPARSEMEM_VMEMMAP Now Xen should not have SPARSEMEM_VMEMMAP Why would that be? Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH] xen: remove unused Kconfig parameter
On 09.07.13 at 02:26, Konrad Rzeszutek Wilk konrad.w...@oracle.com wrote: Could you explain to me please why the check in the scripts is superfluous? This is not really the question - _any_ such check can only be wrong. The boot loader has absolutely no business caring about kernel internals, and the kernel shouldn't be limited by it in how it renames/adds/deletes Kconfig options and anything else. Especially as the grand plan is to get rid of CONFIG_XEN_DOM0 and more or less have a backend and fronted config option (since that makes more sense nowadays). And that would make the XEN_DOM0 be obsolete and the XEN_PRIV would be the one that turns a lot of the options needed to compile a kernel that can provide backend driver support. (I am hand waving here). That's pretty odd a plan, considering that Dom0 is more than just an environment to provide backends. In fact, Dom0 may not be providing any backends at all, and instead just serve the controlling hardware and/or control domain purpose that it really is meant to be. But of course, if none of _that_ functionality depends on this config option, then it indeed could go away. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH] xen: remove unused Kconfig parameter
On 09.07.13 at 16:48, Konrad Rzeszutek Wilk konrad.w...@oracle.com wrote: Dom0 has been both - but there is nothing wrong with seperating these two labels. And therein I was thinking that the 'hardware backend domain' should be the introduced. I am not good with names so the best option seems CONFIG_XEN_PRIVILIGED, but that sounds to be like 'controlling domain'. Any good ideas for names? CONFIG_XEN_HARDWARE_DOMAIN ? CONFIG_XEN_PRIVILIGED_DOMAIN? Following the naming in the hypervisor, CONFIG_XEN_HARDWARE_DOMAIN and CONFIG_XEN_CONTROL_DOMAIN would seem reasonable to me. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/2] xen: remove bm_rld_set of xen_processor_flags
On 04.06.13 at 10:05, liguang lig.f...@cn.fujitsu.com wrote: bm_rld_set seems obsolete now Signed-off-by: liguang lig.f...@cn.fujitsu.com --- include/xen/interface/platform.h |1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/include/xen/interface/platform.h b/include/xen/interface/platform.h index c57d5f6..733 100644 --- a/include/xen/interface/platform.h +++ b/include/xen/interface/platform.h @@ -240,7 +240,6 @@ struct xen_processor_flags { uint32_t bm_check:1; uint32_t has_cst:1; uint32_t power_setup_done:1; - uint32_t bm_rld_set:1; }; struct xen_processor_power { Any such patch would need to be submitted against the master copy of the header (in the Xen repo), and by recognizing that you'd also notice that this is part of a public ABI, and hence can't be removed, but at best can be documented as obsolete. Of course you'd first need to check whether the hypervisor makes any use of that bit when passed down from Dom0. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] xen-pciback: fix error return code in pcistub_irq_handler_switch()
On 31.05.13 at 13:59, Wei Yongjun weiyj...@gmail.com wrote: From: Wei Yongjun yongjun_...@trendmicro.com.cn Fix to return -ENOENT in the pcistub_device_find() and pci_get_drvdata() error handling case instead of 0(overwrite to 0 by str_to_slot()), as done elsewhere in this function. Signed-off-by: Wei Yongjun yongjun_...@trendmicro.com.cn Acked-by: Jan Beulich jbeul...@suse.com ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] Readonly GDT
On 10.04.13 at 02:43, H. Peter Anvin h...@zytor.com wrote: OK, thinking about the GDT here. The GDT is quite small -- 256 bytes on i386, 128 bytes on x86-64. As such, we probably don't want to allocate a full page to it for only that. This means that in order to create a readonly mapping we have to pack GDTs from different CPUs together in the same pages, *or* we tolerate that other things on the same page gets reflected in the same mapping. I think a read-only GDT is incompatible with exceptions delivered through task gates (i.e. double fault on 32-bit), so I would assume this needs to remain a 64-bit only thing. However, the packing solution has the advantage of reducing address space consumption which matters on 32 bits: even on i386 we can easily burn a megabyte of address space for 4096 processors, but burning 16 megabytes starts to hurt. Packing would have the additional benefit of Xen not needing to become a special case in yet another area (because pages containing live descriptor table entries need to be read-only for PV guests, and need to consist of only descriptor table entries). Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] x86/xen: don't assume %ds is usable in xen_iret for 32-bit PVOPS.
On 14.02.13 at 22:52, Greg KH gre...@linuxfoundation.org wrote: Jan, any reason why this patch isn't in Linus's tree already, I see Konrad answered that already, and with the embargo on the CVE having expired only on Tuesday, I think it's not unreasonable to not see this there yet. and why it wasn't marked for inclusion in a -stable kernel release? Forgot to put the tag on when putting the patch together, and then none of the reviewers noticed either. I'm sorry for that. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE
On 04.01.13 at 18:25, Daniel Kiper daniel.ki...@oracle.com wrote: Right, so where is virtual mapping of control page established? I could not find relevant code in SLES kernel which does that. In the hypervisor (xen/arch/x86/machine_kexec.c:machine_kexec_load()). xen/arch/x86/machine_kexec.c:machine_kexec() then simply uses image-page_list[1]. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE
On 07.01.13 at 13:52, Daniel Kiper daniel.ki...@oracle.com wrote: On Mon, Jan 07, 2013 at 09:48:20AM +, Jan Beulich wrote: On 04.01.13 at 18:25, Daniel Kiper daniel.ki...@oracle.com wrote: Right, so where is virtual mapping of control page established? I could not find relevant code in SLES kernel which does that. In the hypervisor (xen/arch/x86/machine_kexec.c:machine_kexec_load()). xen/arch/x86/machine_kexec.c:machine_kexec() then simply uses image-page_list[1]. This (xen/arch/x86/machine_kexec.c:machine_kexec_load()) maps relevant page (allocated earlier by dom0) in hypervisor fixmap area. However, it does not make relevant mapping in transition page table which leads to crash when %cr3 is switched from Xen page table to transition page table. That indeed could explain _random_ failures - the fixmap entries get created with _PAGE_GLOBAL set, i.e. don't get flushed with the CR3 write unless CR4.PGE is clear. And I don't see how your allocation of intermediate page tables would help: You wouldn't know where the mapping of the control page lives until you're actually in the early relocate_kernel code. Or was it that what distinguishes your cloned code from the native original? Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] drivers/xen: avoid out-of-range write in xen_add_device
On 07.01.13 at 16:08, Konrad Rzeszutek Wilk konrad.w...@oracle.com wrote: On Sat, Jan 05, 2013 at 02:18:46PM -0500, Nickolai Zeldovich wrote: xen_add_device() in drivers/xen/pci.c allocates a struct physdev_pci_device_add on the stack and then writes to optarr[0]. The previous declaration of struct physdev_pci_device_add contained a zero-length optarr[] array, presumably assuming it will be allocated with kmalloc with a suitable number of trailing elements, but the code in xen_add_device() as a result wrote past the end of the (stack-allocated) data structure. Since xen_add_device() is the only use of struct physdev_pci_device_add in the kernel, turn optarr[] into a single-element array instead. Lets include Jan and Xen-devel on this email - as this is also changing the official header that is used in Xen. Correct - for that reason it is not the header that needs changing here, but the consumer of the header. I have to admit that I find it odd that the compiler allows automatic variables of variable size types without warning - otherwise this wouldn't have gone unnoticed. Jan Signed-off-by: Nickolai Zeldovich nicko...@csail.mit.edu --- include/xen/interface/physdev.h |6 +- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/include/xen/interface/physdev.h b/include/xen/interface/physdev.h index 1844d31..24fd218 100644 --- a/include/xen/interface/physdev.h +++ b/include/xen/interface/physdev.h @@ -242,11 +242,7 @@ struct physdev_pci_device_add { uint8_t bus; uint8_t devfn; } physfn; -#if defined(__STDC_VERSION__) __STDC_VERSION__ = 199901L -uint32_t optarr[]; -#elif defined(__GNUC__) -uint32_t optarr[0]; -#endif +uint32_t optarr[1]; }; #define PHYSDEVOP_pci_device_remove 26 -- 1.7.10.4 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation
On 04.01.13 at 15:22, Daniel Kiper daniel.ki...@oracle.com wrote: On Wed, Jan 02, 2013 at 11:26:43AM +, Andrew Cooper wrote: /sbin/kexec can load the Xen crash kernel itself by issuing hypercalls using /dev/xen/privcmd. This would remove the need for the dom0 kernel to distinguish between loading a crash kernel for itself and loading a kernel for Xen. Or is this just a silly idea complicating the matter? This is impossible with current Xen kexec/kdump interface. Why? It should be changed to do that. However, I suppose that Xen community would not be interested in such changes. And again - why? Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE
On 04.01.13 at 16:15, Daniel Kiper daniel.ki...@oracle.com wrote: On Thu, Jan 03, 2013 at 09:34:55AM +, Jan Beulich wrote: On 27.12.12 at 03:18, Daniel Kiper daniel.ki...@oracle.com wrote: Some implementations (e.g. Xen PVOPS) could not use part of identity page table to construct transition page table. It means that they require separate PUDs, PMDs and PTEs for virtual and physical (identity) mapping. To satisfy that requirement add extra pointer to PGD, PUD, PMD and PTE and align existing code. So you keep posting this despite it having got pointed out on each earlier submission that this is unnecessary, proven by the fact that the non-pvops Xen kernels can get away without it. Why? Sorry but I forgot to reply for your email last time. I am still not convinced. I have tested SUSE kernel itself and it does not work. Maybe I missed something but... Please check arch/x86/kernel/machine_kexec_64.c:init_transition_pgtable() I can see: vaddr = (unsigned long)relocate_kernel; and later: pgd += pgd_index(vaddr); ... I think that mapping is simply irrelevant, as the code at relocate_kernel gets copied to the control page and invoked there (other than in the native case, where relocate_kernel() gets invoked directly). Jan It is wrong. relocate_kernel() virtual address in Xen is different than its virtual address in Linux Kernel. That is why transition page table could not be established in Linux Kernel and so on... How does this work in SUSE? I do not have an idea. I am happy to fix that but whatever fix for it is I would like to be sure that it works. Daniel ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation
On 02.01.13 at 12:26, Andrew Cooper andrew.coop...@citrix.com wrote: On 27/12/12 18:02, Eric W. Biederman wrote: It probably make sense to split them apart a little even. Thinking about this split, there might be a way to simply it even more. /sbin/kexec can load the Xen crash kernel itself by issuing hypercalls using /dev/xen/privcmd. This would remove the need for the dom0 kernel to distinguish between loading a crash kernel for itself and loading a kernel for Xen. Or is this just a silly idea complicating the matter? I don't think so (and suggested that before as a response to an earlier submission of this patch set), and it would make most of the discussion here mute. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE
On 27.12.12 at 03:18, Daniel Kiper daniel.ki...@oracle.com wrote: Some implementations (e.g. Xen PVOPS) could not use part of identity page table to construct transition page table. It means that they require separate PUDs, PMDs and PTEs for virtual and physical (identity) mapping. To satisfy that requirement add extra pointer to PGD, PUD, PMD and PTE and align existing code. So you keep posting this despite it having got pointed out on each earlier submission that this is unnecessary, proven by the fact that the non-pvops Xen kernels can get away without it. Why? Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
On 22.11.12 at 18:37, H. Peter Anvin h...@zytor.com wrote: I actually talked to Ian Jackson at LCE, and mentioned among other things the bogosity of requiring a PUD page for three-level paging in Linux -- a bogosity which has spread from Xen into native. It's a page wasted for no good reason, since it only contains 32 bytes worth of data, *inherently*. Furthermore, contrary to popular belief, it is *not* pa page table per se. Ian told me: I didn't know we did that, and we shouldn't have to. Here we have suffered this overhead for at least six years, ... Even the Xen kernel only needs the full page when running on a 64-bit hypervisor (now that we don't have a 32-bit hypervisor anymore, that of course basically means always). But yes, I too never liked this enforced over-allocation for native kernels (and was surprised that it was allowed in at all). Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
On 23.11.12 at 11:37, Daniel Kiper daniel.ki...@oracle.com wrote: On Fri, Nov 23, 2012 at 09:53:37AM +, Jan Beulich wrote: On 23.11.12 at 02:56, Andrew Cooper andrew.coop...@citrix.com wrote: On 23/11/2012 01:38, H. Peter Anvin wrote: I still don't really get why it can't be isolated from dom0, which would make more sense to me, even for a Xen crash. The crash region (as specified by crashkernel= on the Xen command line) is isolated from dom0. dom0 (using the kexec utility etc) has the task of locating the Xen crash notes (using the kexec hypercall interface), constructing a binary blob containing kernel, initram and gubbins, and asking Xen to put this blob in the crash region (again, using the kexec hypercall interface). I do not see how this is very much different from the native case currently (although please correct me if I am misinformed). Linux has extra work to do by populating /proc/iomem with the Xen crash regions boot (so the kexec utility can reference their physical addresses when constructing the blob), and should just act as a conduit between the kexec system call and the kexec hypercall to load the blob. But all of this _could_ be done completely independent of the Dom0 kernel's kexec infrastructure (i.e. fully from user space, invoking the necessary hypercalls through the privcmd driver). No, this is impossible. kexec/kdump image lives in dom0 kernel memory until execution. That is why privcmd driver itself is not a solution in this case. Even if so, there's no fundamental reason why that kernel image can't be put into Xen controlled space instead. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE
On 20.11.12 at 16:04, Daniel Kiper daniel.ki...@oracle.com wrote: Some implementations (e.g. Xen PVOPS) could not use part of identity page table to construct transition page table. It means that they require separate PUDs, PMDs and PTEs for virtual and physical (identity) mapping. To satisfy that requirement add extra pointer to PGD, PUD, PMD and PTE and align existing code. As said for v1 already - this is not really a requirement of the interface, or else none of our Xen kernels since 2.6.30 would have worked. I don't think it is desirable to introduce overhead for everyone if it's not even needed for Xen. Jan Signed-off-by: Daniel Kiper daniel.ki...@oracle.com --- arch/x86/include/asm/kexec.h | 10 +++--- arch/x86/kernel/machine_kexec_64.c | 12 ++-- 2 files changed, 13 insertions(+), 9 deletions(-) diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index 317ff17..3cf5600 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -157,9 +157,13 @@ struct kimage_arch { }; #else struct kimage_arch { - pud_t *pud; - pmd_t *pmd; - pte_t *pte; + pgd_t *pgd; + pud_t *pud0; + pud_t *pud1; + pmd_t *pmd0; + pmd_t *pmd1; + pte_t *pte0; + pte_t *pte1; }; #endif diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c index b3ea9db..976e54b 100644 --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -137,9 +137,9 @@ out: static void free_transition_pgtable(struct kimage *image) { - free_page((unsigned long)image-arch.pud); - free_page((unsigned long)image-arch.pmd); - free_page((unsigned long)image-arch.pte); + free_page((unsigned long)image-arch.pud0); + free_page((unsigned long)image-arch.pmd0); + free_page((unsigned long)image-arch.pte0); } static int init_transition_pgtable(struct kimage *image, pgd_t *pgd) @@ -157,7 +157,7 @@ static int init_transition_pgtable(struct kimage *image, pgd_t *pgd) pud = (pud_t *)get_zeroed_page(GFP_KERNEL); if (!pud) goto err; - image-arch.pud = pud; + image-arch.pud0 = pud; set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE)); } pud = pud_offset(pgd, vaddr); @@ -165,7 +165,7 @@ static int init_transition_pgtable(struct kimage *image, pgd_t *pgd) pmd = (pmd_t *)get_zeroed_page(GFP_KERNEL); if (!pmd) goto err; - image-arch.pmd = pmd; + image-arch.pmd0 = pmd; set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE)); } pmd = pmd_offset(pud, vaddr); @@ -173,7 +173,7 @@ static int init_transition_pgtable(struct kimage *image, pgd_t *pgd) pte = (pte_t *)get_zeroed_page(GFP_KERNEL); if (!pte) goto err; - image-arch.pte = pte; + image-arch.pte0 = pte; set_pmd(pmd, __pmd(__pa(pte) | _KERNPG_TABLE)); } pte = pte_offset_kernel(pmd, vaddr); -- 1.5.6.5 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] linux-next: Tree for Oct 24 (xen)
On 25.10.12 at 15:46, Konrad Rzeszutek Wilk konrad.w...@oracle.com wrote: On Thu, Oct 25, 2012 at 11:48:30AM +0100, Stefano Stabellini wrote: On Thu, 25 Oct 2012, Jan Beulich wrote: On 24.10.12 at 23:33, Randy Dunlap rdun...@xenotime.net wrote: On 10/23/2012 09:19 PM, Stephen Rothwell wrote: Hi all, Changes since 201201023: on x86_64: drivers/built-in.o: In function `dbgp_reset_prep': (.text+0xb96b5): undefined reference to `xen_dbgp_reset_prep' drivers/built-in.o: In function `dbgp_external_startup': (.text+0xb9d95): undefined reference to `xen_dbgp_external_startup' Full randconfig file is attached. So this is because with !USB_SUPPORT but EARLY_PRINTK_DBGP dbgp_reset_prep() and dbgp_external_startup() get pointlessly defined and exported. This got broken by the merge recommendation for the ARM side changes (originally compilation of drivers/xen/dbgp.c depended on just CONFIG_XEN_DOM0). From my pov, fixing the USB side would be the clean solution (i.e. putting those function definitions inside a CONFIG_USB_SUPPORT conditional). The alternative of a smaller change would be to extend the conditional around the respective xen_dbgp_...() declarations in include/linux/usb/ehci_def.h to become #if defined(CONFIG_XEN_DOM0) defined(CONFIG_USB_SUPPORT) Please advise towards your preference. I think that your first suggestion is the right one. Can you guys spin up a patch pls and make sure it does not break compilation. Thx. I'd really like to hear Greg's opinion on which route to take before pointlessly trying the other one. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] potential integer overflow in xenbus_file_write()
On 15.10.12 at 12:25, Ian Campbell ian.campb...@citrix.com wrote: On Thu, 2012-09-13 at 19:00 +0300, Dan Carpenter wrote: Hi, Thanks Dan. I'm not sure anyone from Xen-land really monitors virtualization@. Adding xen-devel and Konrad. I was reading some code and had a question in xenbus_file_write() drivers/xen/xenbus/xenbus_dev_frontend.c 461 if ((len + u-len) sizeof(u-u.buffer)) { Can this addition overflow? len is a size_t and u-len is an unsigned int, so I expect so. Should the test be something like: if (len sizeof(u-u.buffer) || len + u-len sizeof(u-u.buffer)) { I think that would do it. Actually, it can remain a single range check: if (len sizeof(u-u.buffer) - u-len) { because the rest of the code guarantees u-len to be less or equal to sizeof(u-u.buffer) at all times. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] memory corruption in HYPERVISOR_physdev_op()
On 15.10.12 at 12:27, Ian Campbell ian.campb...@citrix.com wrote: My static analyzer complains about potential memory corruption in HYPERVISOR_physdev_op() arch/x86/include/asm/xen/hypercall.h 389 static inline int 390 HYPERVISOR_physdev_op(int cmd, void *arg) 391 { 392 int rc = _hypercall2(int, physdev_op, cmd, arg); 393 if (unlikely(rc == -ENOSYS)) { 394 struct physdev_op op; 395 op.cmd = cmd; 396 memcpy(op.u, arg, sizeof(op.u)); 397 rc = _hypercall1(int, physdev_op_compat, op); 398 memcpy(arg, op.u, sizeof(op.u)); ^ Some of the arg buffers are not as large as sizeof(op.u) which is either 12 or 16 depending on the size of longs in struct physdev_apic. Nasty! Doesn't the same problem also exist for HYPERVISOR_event_channel_op() then, at least theoretically (EVTCHNOP_reset is apparently the only addition here so far, but is being used by the tools only afaics)? Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] memory corruption in HYPERVISOR_physdev_op()
On 16.10.12 at 11:07, Jan Beulich jbeul...@suse.com wrote: On 15.10.12 at 12:27, Ian Campbell ian.campb...@citrix.com wrote: My static analyzer complains about potential memory corruption in HYPERVISOR_physdev_op() arch/x86/include/asm/xen/hypercall.h 389 static inline int 390 HYPERVISOR_physdev_op(int cmd, void *arg) 391 { 392 int rc = _hypercall2(int, physdev_op, cmd, arg); 393 if (unlikely(rc == -ENOSYS)) { 394 struct physdev_op op; 395 op.cmd = cmd; 396 memcpy(op.u, arg, sizeof(op.u)); 397 rc = _hypercall1(int, physdev_op_compat, op); 398 memcpy(arg, op.u, sizeof(op.u)); ^ Some of the arg buffers are not as large as sizeof(op.u) which is either 12 or 16 depending on the size of longs in struct physdev_apic. Nasty! Doesn't the same problem also exist for HYPERVISOR_event_channel_op() then, at least theoretically (EVTCHNOP_reset is apparently the only addition here so far, but is being used by the tools only afaics)? Actually, the problem isn't tied to new additions of sub-hypercalls (I was wrongly implying this from the example originally provided), and should be visible e.g. on any use of EVTCHNOP_unmask. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] memory corruption in HYPERVISOR_physdev_op()
On 15.10.12 at 12:27, Ian Campbell ian.campb...@citrix.com wrote: On Fri, 2012-09-14 at 14:24 +0300, Dan Carpenter wrote: My static analyzer complains about potential memory corruption in HYPERVISOR_physdev_op() arch/x86/include/asm/xen/hypercall.h 389 static inline int 390 HYPERVISOR_physdev_op(int cmd, void *arg) 391 { 392 int rc = _hypercall2(int, physdev_op, cmd, arg); 393 if (unlikely(rc == -ENOSYS)) { 394 struct physdev_op op; 395 op.cmd = cmd; 396 memcpy(op.u, arg, sizeof(op.u)); 397 rc = _hypercall1(int, physdev_op_compat, op); 398 memcpy(arg, op.u, sizeof(op.u)); ^ Some of the arg buffers are not as large as sizeof(op.u) which is either 12 or 16 depending on the size of longs in struct physdev_apic. Nasty! Wasn't it that pv-ops expects Xen 4.0.1 or newer anyway? If so, what does this code exist for in the first place (it's framed by #if CONFIG_XEN_COMPAT = 0x030002 in the Xenified kernel)? 399 } 400 return rc; 401 } One example of this is in xen_initdom_restore_msi_irqs(). arch/x86/pci/xen.c 337 struct physdev_pci_device restore_ext; 338 339 restore_ext.seg = pci_domain_nr(dev-bus); 340 restore_ext.bus = dev-bus-number; 341 restore_ext.devfn = dev-devfn; 342 ret = HYPERVISOR_physdev_op(PHYSDEVOP_restore_msi_ext, 343 restore_ext); There are only 4 bytes here. 344 if (ret == -ENOSYS) ^^ If we hit this condition, we have corrupted some memory. I can see the memory corruption but how does it relate to ret == -ENOSYS? The (supposedly) corrupting code site inside an if (unlikely(rc == -ENOSYS)) { Supposedly because as long as the argument passed to the function is in memory accessed by the local CPU only and doesn't overlap with storage used for rc (e.g. living in a register), there's no corruption possible afaict - the second memcpy() would just copy back what the first one obtained from there. Fixing this other than by removing the broken code would be pretty hard I'm afraid (and I tend to leave the code untouched altogether in the Xenified tree). Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] memory corruption in HYPERVISOR_physdev_op()
On 15.10.12 at 12:58, Ian Campbell ian.campb...@citrix.com wrote: On Mon, 2012-10-15 at 11:48 +0100, Jan Beulich wrote: On 15.10.12 at 12:27, Ian Campbell ian.campb...@citrix.com wrote: On Fri, 2012-09-14 at 14:24 +0300, Dan Carpenter wrote: My static analyzer complains about potential memory corruption in HYPERVISOR_physdev_op() arch/x86/include/asm/xen/hypercall.h 389 static inline int 390 HYPERVISOR_physdev_op(int cmd, void *arg) 391 { 392 int rc = _hypercall2(int, physdev_op, cmd, arg); 393 if (unlikely(rc == -ENOSYS)) { 394 struct physdev_op op; 395 op.cmd = cmd; 396 memcpy(op.u, arg, sizeof(op.u)); 397 rc = _hypercall1(int, physdev_op_compat, op); 398 memcpy(arg, op.u, sizeof(op.u)); ^ Some of the arg buffers are not as large as sizeof(op.u) which is either 12 or 16 depending on the size of longs in struct physdev_apic. Nasty! Wasn't it that pv-ops expects Xen 4.0.1 or newer anyway? If so, what does this code exist for in the first place (it's framed by #if CONFIG_XEN_COMPAT = 0x030002 in the Xenified kernel)? I think the 4.0.1 or newer requirement is for dom0 only. I guess physdev op is only used in dom0 though? Or does passthrough need it? No, it's only platform_op that is Dom0-only. I can see the memory corruption but how does it relate to ret == -ENOSYS? The (supposedly) corrupting code site inside an if (unlikely(rc == -ENOSYS)) { Ah, for some reason I assumed this was in the eventual caller, even though it was staring me right in the face in the full quote. I think Dan's reference was to an eventual caller - it would see the -ENOSYS, as the compat call wouldn't return anything else than the modern one, and the modern one (to enter the code in question) must have returned -ENOSYS. Fixing this other than by removing the broken code would be pretty hard I'm afraid (and I tend to leave the code untouched altogether in the Xenified tree). Given that it is compat code the list of subops which needs to supported in this case is small and finite so a simple lookup table or even switch stmt for the size might be an option. Ugly, particularly for an inline function. But possible of course. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [patch 3/3] xen/privcmd: remove const modifier from declaration
On 08.09.12 at 11:58, Dan Carpenter dan.carpen...@oracle.com wrote: When we use this pointer, we cast away the const modifier and modify the data. I think it was an accident to declare it as const. NAK - the const is very valid here, as the v2 interface (as opposed to the v1 one) does _not_ modify this array (or if it does, it's a bug). This is a guarantee made to user mode, so it should also be expressed that way in the interface. But of course the cast used before this patch isn't right either, as it indeed inappropriately discards the qualifier. Afaiu this was done to simplify the internal workings of the code, but I don't think it's desirable to sacrifice type safety for implementation simplicity. Jan Signed-off-by: Dan Carpenter dan.carpen...@oracle.com diff --git a/include/xen/privcmd.h b/include/xen/privcmd.h index a853168..58ed953 100644 --- a/include/xen/privcmd.h +++ b/include/xen/privcmd.h @@ -69,7 +69,7 @@ struct privcmd_mmapbatch_v2 { unsigned int num; /* number of pages to populate */ domid_t dom; /* target domain */ __u64 addr; /* virtual address */ - const xen_pfn_t __user *arr; /* array of mfns */ + xen_pfn_t __user *arr; /* array of mfns */ int __user *err; /* array of error codes */ }; diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c index 0ce006a..fceb83e 100644 --- a/drivers/xen/privcmd.c +++ b/drivers/xen/privcmd.c @@ -389,7 +389,7 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version) if (state.global_error (version == 1)) { /* Write back errors in second pass. */ - state.user_mfn = (xen_pfn_t *)m.arr; + state.user_mfn = m.arr; state.err = err_array; ret = traverse_pages(m.num, sizeof(xen_pfn_t), pagelist, mmap_return_errors_v1, state); ___ Xen-devel mailing list xen-de...@lists.xen.org http://lists.xen.org/xen-devel ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH 05/27] xen, cpu hotplug: Don't call cpu_bringup() in xen_play_dead()
On 01.06.12 at 11:11, Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com wrote: xen_play_dead calls cpu_bringup() which looks weird, because xen_play_dead() is invoked in the cpu down path, whereas cpu_bringup() (as the name suggests) is useful in the cpu bringup path. This might not be correct - the code as it is without this change is safe even when the vCPU gets onlined back later by an external entity (e.g. the Xen tool stack), and it would in that case resume at the return point of the VCPUOP_down hypercall. That might be a heritage from the original XenoLinux tree though, and be meaningless in pv-ops context - Jeremy, Konrad? Possibly it was bogus/unused even in that original tree - Keir? Jan Getting rid of xen_play_dead()'s dependency on cpu_bringup() helps in hooking on to the generic SMP booting framework. Also remove the extra call to preempt_enable() added by commit 41bd956 (xen/smp: Fix CPU online/offline bug triggering a BUG: scheduling while atomic) because it becomes unnecessary after this change. Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com Cc: Jeremy Fitzhardinge jer...@goop.org Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@redhat.com Cc: H. Peter Anvin h...@zytor.com Cc: x...@kernel.org Cc: xen-de...@lists.xensource.com Cc: virtualization@lists.linux-foundation.org Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com --- arch/x86/xen/smp.c |8 1 files changed, 0 insertions(+), 8 deletions(-) diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c index 09a7199..602d6b7 100644 --- a/arch/x86/xen/smp.c +++ b/arch/x86/xen/smp.c @@ -417,14 +417,6 @@ static void __cpuinit xen_play_dead(void) /* used only with HOTPLUG_CPU */ { play_dead_common(); HYPERVISOR_vcpu_op(VCPUOP_down, smp_processor_id(), NULL); - cpu_bringup(); - /* - * Balance out the preempt calls - as we are running in cpu_idle - * loop which has been called at bootup from cpu_bringup_and_idle. - * The cpucpu_bringup_and_idle called cpu_bringup which made a - * preempt_disable() So this preempt_enable will balance it out. - */ - preempt_enable(); } #else /* !CONFIG_HOTPLUG_CPU */ ___ Xen-devel mailing list xen-de...@lists.xen.org http://lists.xen.org/xen-devel ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH 05/27] xen, cpu hotplug: Don't call cpu_bringup() in xen_play_dead()
On 01.06.12 at 17:13, Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com wrote: On 06/01/2012 06:29 PM, Jan Beulich wrote: On 01.06.12 at 11:11, Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com wrote: xen_play_dead calls cpu_bringup() which looks weird, because xen_play_dead() is invoked in the cpu down path, whereas cpu_bringup() (as the name suggests) is useful in the cpu bringup path. This might not be correct - the code as it is without this change is safe even when the vCPU gets onlined back later by an external entity (e.g. the Xen tool stack), and it would in that case resume at the return point of the VCPUOP_down hypercall. That might be a heritage from the original XenoLinux tree though, and be meaningless in pv-ops context - Jeremy, Konrad? Possibly it was bogus/unused even in that original tree - Keir? Thanks for your comments Jan! In case this change is wrong, the other method I had in mind was to call cpu_bringup_and_idle() in xen_play_dead(). (Even ARM does something similar, in the sense that it runs the cpu bringup code including cpu_idle(), in the cpu offline path, namely the cpu_die() function). Would that approach work for xen as well? If yes, then we wouldn't have any issues to convert xen to generic code. No, that wouldn't work either afaict - the function is expected to return. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 07.05.12 at 19:25, Ingo Molnar mi...@kernel.org wrote: (apologies for the late reply, the mail just now made it to my inbox via xen-devel) I'll hold off on the whole thing - frankly, we don't want this kind of Xen-only complexity. If KVM can make use of PLE then Xen ought to be able to do it as well. It does - for fully virtualized guests. For para-virtualized ones, it can't (as the hardware feature is an extension to VMX/SVM). If both Xen and KVM makes good use of it then that's a different matter. I saw in a later reply that you're now tending towards trying it out at least - thanks. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices
On 14.03.12 at 18:17, Justin T. Gibbs gi...@scsiguy.com wrote: On Mar 6, 2012, at 1:34 AM, Jan Beulich wrote: On 05.03.12 at 22:49, Santosh Jodh santosh.j...@citrix.com wrote: … + } + /* Create shared ring, alloc event channel. */ err = setup_blkring(dev, info); if (err) @@ -889,12 +916,35 @@ again: goto destroy_blkring; } - err = xenbus_printf(xbt, dev-nodename, - ring-ref, %u, info-ring_ref); - if (err) { - message = writing ring-ref; - goto abort_transaction; + if (legacy_backend) { Why not use the simpler interface always when info-ring_order == 0? Because, as I just found out today via a FreeBSD bug report, that's not how XenServer works. If the front-end publishes ring-page-order, the backend assumes the ring-refNN XenStore nodes are in effect, even if the order is 0. I was certainly implying to not write the ring-page-order and num-ring-pages nodes in that case. I'm working on a documentation update for blkif.h now. sigh -- Justin ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices
On 14.03.12 at 07:32, Justin Gibbs just...@spectralogic.com wrote: There's another problem here that I brought up during the Xen Hack-a-thon. The ring macros require that the ring element count be a power of two. This doesn't mean that the ring will be a power of 2 pages in size. To illustrate this point, I modified the FreeBSD blkback driver to provide negotiated ring stats via sysctl. Here's a connection to a Windows VM running the Citrix PV drivers: dev.xbbd.2.max_requests: 128 dev.xbbd.2.max_request_segments: 11 dev.xbbd.2.max_request_size: 45056 dev.xbbd.2.ring_elem_size: 108 = 32bit ABI dev.xbbd.2.ring_pages: 4 dev.xbbd.2.ring_elements: 128 dev.xbbd.2.ring_waste: 2496 Over half a page is wasted when ring-page-order is 2. I'm sure you can see where this is going. :-) Here are the limits published by our backend to the XenStore: max-ring-pages = 113 max-ring-page-order = 7 max-requests = 256 max-request-segments = 129 max-request-size = 524288 Because we allow so many concurrent, large requests in our product, the ring wastage really adds up if the front end doesn't support the ring-pages variant of the extension. However, you only need a ring-page-order of 3 with this protocol to start seeing pages of wasted ring space. You don't really want to negotiate ring-pages either. The backends often need to support multiple ABIs. I can easily construct a set of limits for the FreeBSD blkback driver which will cause the ring limits to vary by a page between the 32bit and 64bit ABIs. With all this in mind, the backend must do a dance of rounding up, taking the max of the ring sizes for the different ABIs, and then validating the front-end published limits taking its ABI into account. The front-end does some of this too. Its way too messy and error prone because we don't communicate the ring element limit directly. max-ring-element-order anyone? :-) Interesting observation - yes, I think deprecating both pre-existing methods in favor of something along those lines would be desirable. (But I'd favor not using the term order here as it is - at least in Linux - usually implied to be used on pages. max-ringent-log2 perhaps?) What you say also implies that all currently floating around Linux backend patches are flawed in their way of calculating the number of ring entries, as this number really depends on the protocol the frontend advertises. Further, if you're concerned about wasting ring space (and particularly in the context of your request number/size/segments extension), shouldn't we bother to define pairs (or larger groups) of struct blkif_request_segment (as currently a quarter of the space is mere padding)? Or split grefs from {first,last}_sect altogether? Finally, while looking at all this again, I stumbled across the use of blkif_vdev_t in the ring structures: At least Linux'es blkback completely ignores this field - {xen_,}vbd_translate() simply overwrites what dispatch_rw_block_io() put there (and with this, struct phys_req's dev and bdev members seem rather pointless too). Does anyone recall what the original intention with this request field was? Allowing I/O on multiple devices over a single ring? Bottom line - shouldn't we define a blkif2 interface to cleanly accommodate all the various extensions (and do away with the protocol variations)? Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices
On 14.03.12 at 07:32, Justin Gibbs just...@spectralogic.com wrote: There's another problem here that I brought up during the Xen Hack-a-thon. The ring macros require that the ring element count be a power of two. This doesn't mean that the ring will be a power of 2 pages in size. To illustrate this point, I modified the FreeBSD blkback driver to provide negotiated ring stats via sysctl. Here's a connection to a Windows VM running the Citrix PV drivers: dev.xbbd.2.max_requests: 128 dev.xbbd.2.max_request_segments: 11 dev.xbbd.2.max_request_size: 45056 dev.xbbd.2.ring_elem_size: 108 = 32bit ABI dev.xbbd.2.ring_pages: 4 dev.xbbd.2.ring_elements: 128 dev.xbbd.2.ring_waste: 2496 Over half a page is wasted when ring-page-order is 2. I'm sure you can see where this is going. :-) Having looked a little closer on how the wasted space is progressing, I find myself in the odd position that I can't explain the original (and still active) definition of BLKIF_MAX_SEGMENTS_PER_REQUEST (11): With ring-order zero, there's 0x240/0x1c0 bytes (32/64-bit respectively) are unused. With 32 requests fitting in the ring, and with each segment occupying 6 bytes (padded to 8), in the 64-bit variant there's enough space for a 12th segment (32-bit would even have space for a 13th). Am I missing anything here? Plus all this assumes a page size of 4k, yet ia64 had always been using pages of 16k iirc. Here are the limits published by our backend to the XenStore: max-ring-pages = 113 max-ring-page-order = 7 max-requests = 256 max-request-segments = 129 max-request-size = 524288 Oh, so this protocol doesn't require ring-pages (and max-ring-pages) to be a power of two? In which case I think it is a mistake to also advertise max-ring-page-order, as at least the (Linux) frontend code I know of interprets this as being able to set up a ring of (using the numbers above) 128 pages (unless, of course, your backend can deal with this regardless of the max-ring-pages value it announces). Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices
On 06.03.12 at 18:20, Konrad Rzeszutek Wilk kon...@darnok.org wrote: - the usage of XenbusStateInitWait? Why do we introduce that? Looks like a fix to something. No, this is required to get the negotiation working (the frontend must not try to read the new nodes until it can be certain that the backend populated them). However, as already pointed out in an earlier reply to Santosh, the way this is done here doesn't appear to allow for the backend to already be in InitWait state when the frontend gets invoked. - XENBUS_MAX_RING_PAGES - why 2? Why not 4? What is the optimal default size for SSD usage? 16? What do SSDs have to do with a XenBus definition? Imo it's wrong (and unnecessary) to introduce a limit at the XenBus level at all - each driver can do this for itself. As to the limit for SSDs in the block interface - I don't think the number of possibly simultaneous requests has anything to do with this. Instead, I'd expect the request number/size/segments extension that NetBSD apparently implements to possibly have an effect. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0001/001] xen: multi page ring support for block devices
On 05.03.12 at 22:49, Santosh Jodh santosh.j...@citrix.com wrote: Could this be split up into 3 patches, for easier reviewing: - one adjusting the xenbus interface to allow for multiple ring pages (and maybe even that one should be split into the backend and frontend related parts), syncing with the similar netback effort? - one for the blkback changes - one for the blkfront changes? --- a/drivers/block/xen-blkback/xenbus.c +++ b/drivers/block/xen-blkback/xenbus.c @@ -122,8 +122,8 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid) return blkif; } -static int xen_blkif_map(struct xen_blkif *blkif, unsigned long shared_page, -unsigned int evtchn) +static int xen_blkif_map(struct xen_blkif *blkif, int ring_ref[], As you need to touch this anyway, can you please switch this to the proper type (grant_ref_t) rather than using plain int (not just here)? +unsigned int ring_order, unsigned int evtchn) { int err; --- a/drivers/block/xen-blkfront.c +++ b/drivers/block/xen-blkfront.c @@ -135,7 +139,7 @@ static DEFINE_SPINLOCK(minor_lock); static int get_id_from_freelist(struct blkfront_info *info) { unsigned long free = info-shadow_free; - BUG_ON(free = BLK_RING_SIZE); + BUG_ON(free = BLK_MAX_RING_SIZE); Wouldn't you better check against the actual limit here? info-shadow_free = info-shadow[free].req.u.rw.id; info-shadow[free].req.u.rw.id = 0x0fee; /* debug */ return free; @@ -698,16 +704,19 @@ static void blkif_free(struct blkfront_info *info, int suspend) flush_work_sync(info-work); /* Free resources associated with old device channel. */ - if (info-ring_ref != GRANT_INVALID_REF) { - gnttab_end_foreign_access(info-ring_ref, 0, - (unsigned long)info-ring.sring); - info-ring_ref = GRANT_INVALID_REF; - info-ring.sring = NULL; + for (i = 0; i (1 info-ring_order); i++) { + if (info-ring_ref[i] != GRANT_INVALID_REF) { + gnttab_end_foreign_access(info-ring_ref[i], 0, 0); + info-ring_ref[i] = GRANT_INVALID_REF; + } } + + free_pages((unsigned long)info-ring.sring, info-ring_order); No. The freeing must continue happen in gnttab_end_foreign_access() (with the sole exception when a page was allocated but the grant didn't get established), since it must be suppressed/delayed when the grant is still in use (otherwise the kernel will die on the first re-use of the page). I just happened to fix that problem at the end of last week in the variant of the patch that we pulled into our tree. Further, rather than doing a non-zero order allocation here, I'd suggest allocating individual pages and vmap()-ing them. + info-ring.sring = NULL; + if (info-irq) unbind_from_irqhandler(info-irq, info); info-evtchn = info-irq = 0; - } static void blkif_completion(struct blk_shadow *s) @@ -875,8 +883,27 @@ static int talk_to_blkback(struct xenbus_device *dev, { const char *message = NULL; struct xenbus_transaction xbt; + unsigned int ring_order; + int legacy_backend; + int i; int err; + for (i = 0; i (1 info-ring_order); i++) + info-ring_ref[i] = GRANT_INVALID_REF; + + err = xenbus_scanf(XBT_NIL, dev-otherend, max-ring-page-order, %u, + ring_order); At least the frontend should imo also support the alternative interface (using max-ring-pages etc). + + legacy_backend = !(err == 1); + + if (legacy_backend) { + info-ring_order = 0; + } else { + info-ring_order = (ring_order = xen_blkif_ring_order) ? + ring_order : + xen_blkif_ring_order; min()? + } + /* Create shared ring, alloc event channel. */ err = setup_blkring(dev, info); if (err) @@ -889,12 +916,35 @@ again: goto destroy_blkring; } - err = xenbus_printf(xbt, dev-nodename, - ring-ref, %u, info-ring_ref); - if (err) { - message = writing ring-ref; - goto abort_transaction; + if (legacy_backend) { Why not use the simpler interface always when info-ring_order == 0? + err = xenbus_printf(xbt, dev-nodename, + ring-ref, %d, info-ring_ref[0]); + if (err) { + message = writing ring-ref; + goto abort_transaction; + } + } else { + for (i = 0; i (1 info-ring_order); i++) { + char key[sizeof(ring-ref) + 2]; + +
Re: [Xen-devel] [PATCH] blkfront: don't change to closing if we're busy
On 20.02.12 at 11:35, Andrew Jones drjo...@redhat.com wrote: - Original Message - On Fri, Feb 17, 2012 at 05:52:54PM +0100, Andrew Jones wrote: On Thu, Feb 16, 2012 at 12:33:32PM -0500, Andrew Jones wrote: Hmm, I should maybe self-nack this. The bug that led me to writing it is likely only with older tooling, such as RHEL5's. The problem was if you attempted to detach a mounted disk twice, then the second time it would succeed. The guest had flipped to Closing on the first try, and thus didn't issue an error to xenbus on the second. I see Actually, it's even worse than I thought. Just a single attempt to detach the device will cause the guest grief (with RHEL5's tools anyway). Once xenbus shows the device is in the Closing state, then tasks accessing the device may hang. The reason I only say maybe self-nack though, is because this state change seemed to be thrown in with another fix[1]. I'm not sure if the new behavior on legacy hosts was considered or not. If not, then we can consider it now. Do we want to have deferred asynch detaches over protecting the guest from multiple detach calls on legacy hosts? I guess we can keep the feature and protect the guest with a patch like I'll send in a moment. It doesn't really work for me with a RHEL5 host though. The deferred close works from the pov of the guest, but the state of the block device gets left in Closed after the guest unmounts it, and then RHEL5's tools can't detach/reattach it. If the deferred close ever worked on other Xen hosts though, then I don't believe this patch would break it, and it will also keep the guest safe when run on hosts that don't treat state=Closing the way it currently expects. There was another fix that sounds similar to this in the backend. 6f5986bce558e64fe867bff600a2127a3cb0c006 Thanks for the pointer. It doesn't look like the upstream 2.6.18 tree has that, but it probably would be a good idea there too. While I had seen the change and considered pulling it in, I wasn't really convinced this is the right behavior here: After all, if the host admin requested a resource to be removed from a guest, it shouldn't depend on the guest whether and when to honor that request, yet by deferring the disconnect you basically allow the guest to continue using the disk indefinitely. Jan However, even with that ability to patch backends, we probably want the frontends to be more robust on legacy backends for a while longer. So, it probably would be best to avoid changing the state to Closing while we're still busy, unless it's absolutely necessary. Drew ___ Xen-devel mailing list xen-de...@lists.xensource.com http://lists.xensource.com/xen-devel ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH] blkfront: don't change to closing if we're busy
On 21.02.12 at 10:23, Andrew Jones drjo...@redhat.com wrote: On 20.02.12 at 11:35, Andrew Jones drjo...@redhat.com wrote: On Fri, Feb 17, 2012 at 05:52:54PM +0100, Andrew Jones wrote: There was another fix that sounds similar to this in the backend. 6f5986bce558e64fe867bff600a2127a3cb0c006 Thanks for the pointer. It doesn't look like the upstream 2.6.18 tree has that, but it probably would be a good idea there too. While I had seen the change and considered pulling it in, I wasn't really convinced this is the right behavior here: After all, if the host admin requested a resource to be removed from a guest, it shouldn't depend on the guest whether and when to honor that request, yet by deferring the disconnect you basically allow the guest to continue using the disk indefinitely. I agree. Yesterday I wrote[1] asking if deferred detach is really something we want. At the moment, Igor and I are poking through xen-blkfront.c, and currently we'd rather see the feature dropped in favor of a simplified driver. One that has less release paths, and/or release paths with more consistent locking behavior. I must have missed this, or it's one more instance of delayed mail delivery via xen-devel. Konrad - care to revert that original change as having barked up the wrong tree? Jan [1] http://lists.xen.org/archives/html/xen-devel/2012-02/msg01672.html ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[Xen-devel] Re: [PATCH] xen: drop anti-dependency on X86_VISWS
On 08.04.11 at 17:25, Jeremy Fitzhardinge jer...@goop.org wrote: On 04/07/2011 11:38 PM, Ian Campbell wrote: Is there any downside to this patch (is X86_CMPXCHG in the same sort of boat?) Only if we don't use cmpxchg in shared memory with other domains or the hypervisor. (I don't think it will dynamically switch between real and emulated cmpxchg depending on availability.) Actually it does - see the #ifndef CONFIG_X86_CMPXCHG section in asm/cmpxchg_32.h. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 13/14] x86/ticketlock: add slowpath logic
On 19.01.11 at 19:55, Jeremy Fitzhardinge jer...@goop.org wrote: On 01/19/2011 10:39 AM, Srivatsa Vaddagiri wrote: I have tested quite extensively with booting a 16-vcpu guest (on a 16-pcpu host) and running kernel compine (with 32-threads). Without this patch, I had difficulty booting/shutting-down successfully (it would hang mid-way). Sounds good. But I like to test with make -j 100-200 to really give things a workout. Hmm, in my experience, heavily over-committing CPUs (e.g. a domain with double or more the vCPU-s that the system has pCPU-s, or the pCPU-s the domain is allowed to run on) is a much better test for eventual problems in the spin lock code paths. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 09/14] xen/pvticketlock: Xen implementation for PV ticket locks
On 16.11.10 at 22:08, Jeremy Fitzhardinge jer...@goop.org wrote: +static void xen_lock_spinning(struct arch_spinlock *lock, unsigned want) { - struct xen_spinlock *xl = (struct xen_spinlock *)lock; - struct xen_spinlock *prev; int irq = __get_cpu_var(lock_kicker_irq); - int ret; + struct xen_lock_waiting *w = __get_cpu_var(lock_waiting); + int cpu = smp_processor_id(); u64 start; /* If kicker interrupts not initialized yet, just spin */ if (irq == -1) - return 0; + return; start = spin_time_start(); - /* announce we're spinning */ - prev = spinning_lock(xl); + w-want = want; + w-lock = lock; + + /* This uses set_bit, which atomic and therefore a barrier */ + cpumask_set_cpu(cpu, waiting_cpus); Since you don't allow nesting, don't you need to disable interrupts before you touch per-CPU state? Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 13/14] x86/ticketlock: add slowpath logic
On 16.11.10 at 22:08, Jeremy Fitzhardinge jer...@goop.org wrote: +static inline void __ticket_enter_slowpath(struct arch_spinlock *lock) +{ + if (sizeof(lock-tickets.tail) == sizeof(u8)) + asm (LOCK_PREFIX orb %1, %0 + : +m (lock-tickets.tail) + : i (TICKET_SLOWPATH_FLAG) : memory); + else + asm (LOCK_PREFIX orw %1, %0 + : +m (lock-tickets.tail) + : i (TICKET_SLOWPATH_FLAG) : memory); +} Came only now to mind: Here and elsewhere, did you try using %z0 to have gcc produce the opcode suffix character, rather than having these somewhat ugly if()-s? Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 13/14] x86/ticketlock: add slowpath logic
On 17.11.10 at 10:08, Jeremy Fitzhardinge jer...@goop.org wrote: On 11/17/2010 12:56 AM, Jeremy Fitzhardinge wrote: On 11/17/2010 12:52 AM, Jeremy Fitzhardinge wrote: But, yes, %z0 sounds interesting. Is it documented anywhere? I think I've tried to use it in the past and run into gcc bugs. This one: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39590 Should be OK in this case because there's no 64-bit values to be seen... Hm, it fails when __ticket_t is 16 bits: /home/jeremy/git/linux/arch/x86/include/asm/spinlock.h: Assembler messages: /home/jeremy/git/linux/arch/x86/include/asm/spinlock.h:73: Error: suffix or operands invalid for `or' lock; ors $1, 2(%rbx) #, So I don't think that's going to work out... Indeed, it's only with 4.5 that non-float operands are properly supported here. Sad. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] Re: [PATCH 09/14] xen/pvticketlock: Xen implementation for PV ticket locks
On 17.11.10 at 10:57, Jeremy Fitzhardinge jer...@goop.org wrote: On 11/17/2010 12:52 AM, Jeremy Fitzhardinge wrote: On 11/17/2010 12:11 AM, Jan Beulich wrote: On 16.11.10 at 22:08, Jeremy Fitzhardinge jer...@goop.org wrote: +static void xen_lock_spinning(struct arch_spinlock *lock, unsigned want) { - struct xen_spinlock *xl = (struct xen_spinlock *)lock; - struct xen_spinlock *prev; int irq = __get_cpu_var(lock_kicker_irq); - int ret; + struct xen_lock_waiting *w = __get_cpu_var(lock_waiting); + int cpu = smp_processor_id(); u64 start; /* If kicker interrupts not initialized yet, just spin */ if (irq == -1) - return 0; + return; start = spin_time_start(); - /* announce we're spinning */ - prev = spinning_lock(xl); + w-want = want; + w-lock = lock; + + /* This uses set_bit, which atomic and therefore a barrier */ + cpumask_set_cpu(cpu, waiting_cpus); Since you don't allow nesting, don't you need to disable interrupts before you touch per-CPU state? Yes, I think you're right - interrupts need to be disabled for the bulk of this function. Actually, on second thoughts, maybe it doesn't matter so much. The main issue is making sure that the interrupt will make the VCPU drop out of xen_poll_irq() - if it happens before xen_poll_irq(), it should leave the event pending, which will cause the poll to return immediately. I hope. Certainly disabling interrupts for some of the function will make it easier to analyze with respect to interrupt nesting. That's not my main concern. Instead, what if you get interrupted anywhere here, the interrupt handler tries to acquire another spinlock and also has to go into the slow path? It'll overwrite part or all of the outer context's state. Another issue may be making sure the writes and reads of w-want and w-lock are ordered properly to make sure that xen_unlock_kick() never sees an inconsistent view of the (lock,want) tuple. The risk being that xen_unlock_kick() sees a random, spurious (lock,want) pairing and sends the kick event to the wrong VCPU, leaving the deserving one hung. Yes, proper operation sequence (and barriers) is certainly required here. If you allowed nesting, this may even become simpler (as you'd have a single write making visible the new head pointer, after having written all relevant fields of the new head structure). Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] Re: Next steps with pv_ops for Xen
It breaks with: Intel machine check architecture supported. (XEN) traps.c:1734:d0 Domain attempted WRMSR 0404 from :0001 to :. Intel machine check reporting enabled on CPU#0. general protection fault: [#1] SMP Modules linked in: Hm. Looks like Xen is getting upset about dom0 trying to disable caching. No, wait: 0x:? That's strange; I wonder if its just misreporting the value, because the code doesn't look like its trying to write that. Either way, the fix is to implement xen_write_cr0, and mask off any bits that Xen won't want us to set/clear (or if it doesn't allow dom0 to change cr0, just ignore all updates). Why do you think that's a CR0 write? The messages clearly indicate an MSR write, and these writes are clearly visible in intel_p{4,6}_mcheck_init() and amd_mcheck_init(). The question is why intel_p4_mcheck_init() doesn't check CPUID bits before trying to touch any registers... (And similarly amd_mcheck_init() is checking only the MCE bit, not the MCA one.) But then I just noticed that Xen itself doesn't clear the MCE/MCA bits either in emulate_forced_invalid_op(), apparently under the assumption that PV guests wouldn't try to make use of this feature. A simple workaround would be to force mce_disabled to 1 in early Xen initialization. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] Re: Next steps with pv_ops for Xen
The oops and backtrace doesn't suggest it's an MSR write. Does a crX Oh, right, the MSR write is being ignored, not failed. write take the same path through the emulator as an MSR write? No, the two operations take different paths. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [patch 21/33] xen: Xen SMP guest support
--- a/arch/i386/xen/time.c +++ b/arch/i386/xen/time.c @@ -105,17 +105,15 @@ static void get_runstate_snapshot(struct preempt_enable(); } -static void setup_runstate_info(void) +static void setup_runstate_info(int cpu) { struct vcpu_register_runstate_memory_area area; - area.addr.v = __get_cpu_var(runstate); + area.addr.v = per_cpu(runstate, cpu); if (HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area, smp_processor_id(), area)) Shouldn't this be 'cpu' rather than 'smp_processor_id()'? BUG(); - - get_runstate_snapshot(__get_cpu_var(runstate_snapshot)); } static void do_stolen_accounting(void) Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [patch 14/33] xen: xen time implementation
Andi Kleen [EMAIL PROTECTED] 06.06.07 14:18 Yes, this could be an issue. Is there any way to get an interrupt or MCE when thermal throttling occurs? Yes you can get an thermal interrupt from the local APIC. See the Linux kernel source. Of course there would be still a race window. On the other hand some timing issues on throttling are probably the smallest of the users' problems when it really happens. Not if this results in your box hanging - I think throttling is exactly intended to keep the box alive as long as possible (and I've seen throttling in action, with the box happily recovering from the situation - after having seen it a few times I checked and found the fan covered with dust). Standard Linux just ignores it. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [patch 1/2] Relocate VDSO ELF headers to match mapped location with COMPAT_VDSO
+static __cpuinit void reloc_dyn(Elf32_Ehdr *ehdr, unsigned offset) +{ + Elf32_Dyn *dyn = (void *)ehdr + offset; + + for(; dyn-d_tag != DT_NULL; dyn++) + switch(dyn-d_tag) { + case DT_PLTGOT: + case DT_HASH: + case DT_STRTAB: + case DT_SYMTAB: + case DT_RELA: + case DT_INIT: + case DT_FINI: + case DT_REL: + case DT_JMPREL: + case DT_VERSYM: + case DT_VERDEF: + case DT_VERNEED: + dyn-d_un.d_val += VDSO_HIGH_BASE; + } +} While there's a certain level of control on what DT_* may appear in the vDSO, not even considering other than the above types seems fragile to me. Since future additions to the set are supposedly following a fixed scheme (distinguishing pointers and values via the low bit when below OLD_DT_LOOS, and using sub-ranges when between DT_HIOS and OLD_DT_HIOS), at least also handling those would seem like a good idea, as would warning about unrecognized types. Also, even though it shouldn't matter for the final result, if doing things spec-conforming here you should use d_un.d_ptr. In addition to Roland's remarks about missing symbol table relocation, I would also assume section headers, if present, should be relocated. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [patch 1/2] Relocate VDSO ELF headers to match mapped location with COMPAT_VDSO
Jeremy Fitzhardinge [EMAIL PROTECTED] 05.04.07 09:31 Jan Beulich wrote: While there's a certain level of control on what DT_* may appear in the vDSO, not even considering other than the above types seems fragile to me. Since future additions to the set are supposedly following a fixed scheme (distinguishing pointers and values via the low bit when below OLD_DT_LOOS, and using sub-ranges when between DT_HIOS and OLD_DT_HIOS), at least also handling those would seem like a good idea, as would warning about unrecognized types. I wasn't aware of this scheme. Where is it documented? Regarding the low bit part, quoting from working gABI chapter 5: To make it simpler for tools to interpret the contents of dynamic section entries, the value of each tag, except for those in two special compatibility ranges, will determine the interpretation of the d_un union. A tag whose value is an even number indicates a dynamic section entry that uses d_ptr. A tag whose value is an odd number indicates a dynamic section entry that uses d_val or that uses neither d_ptr nor d_val. Tags whose values are less than the special value DT_ENCODING and tags whose values fall between DT_HIOS and DT_LOPROC do not follow these rules. Regarding the OS range, all I can point you to is binutils' include/elf/common.h. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [patch 1/2] Relocate VDSO ELF headers to match mapped location with COMPAT_VDSO
+ for(; dyn-d_tag != DT_NULL; dyn++) + switch(dyn-d_tag) { + case DT_PLTGOT: + case DT_HASH: + case DT_STRTAB: + case DT_SYMTAB: + case DT_RELA: + case DT_INIT: + case DT_FINI: + case DT_REL: + case DT_DEBUG: + case DT_JMPREL: + case DT_VERSYM: + case DT_VERDEF: + case DT_VERNEED: + case DT_ENCODING ... DT_HIOS: + /* tags above DT_ENCODING are even if they're + a pointer, so skip odd ones */ + if (dyn-d_tag = DT_ENCODING + (dyn-d_tag 1) == 1) + break; + + dyn-d_un.d_ptr += VDSO_HIGH_BASE; + } I'm pretty certain the range OLD_DT_LOOS ... DT_LOOS must be excluded here (the document version I'm looking at is inconsistent in itself here, saying in one place to stop at DT_LOOS, in a second to stop at DT_HIOS, and in a third to include DT_LOOS ... ST_HIOS - the inconsistency goes away if assuming that the stop at DT_LOOS really means stop at OLD_DT_LOOS, and the stop at DT_HIOS misses to special-case the OLD_DT_LOOS ... DT_LOOS range. Additionally I'm a little worried about excluding DT_ADDRRNGLO ... DT_ADDRRNGHI in case future binutils ever start defaulting to generate any of these (namely DT_GNU_HASH). +#define DT_ENCODING 32 Hmm, I was about to say this ought to be 31 when I realized the discrepancy between document and binutils. I'll have to ask about this... Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization