Re: [PATCH] x86/paravirt: use %rip-relative addressing in hook calls

2021-11-23 Thread Jan Beulich via Virtualization
On 05.10.2021 09:43, Juergen Gross wrote:
> On 30.09.21 14:40, Jan Beulich via Virtualization wrote:
>> While using a plain (constant) address works, its use needlessly invokes
>> a SIB addressing mode, making every call site one byte larger than
>> necessary. Instead of using an "i" constraint with address-of operator
>> and a 'c' operand modifier, simply use an ordinary "m" constraint, which
>> the 64-bit compiler will translate to %rip-relative addressing. This way
>> we also tell the compiler the truth about operand usage - the memory
>> location gets actually read, after all.
>>
>> 32-bit code generation is unaffected by the change.
>>
>> Signed-off-by: Jan Beulich 
> 
> Reviewed-by: Juergen Gross 

Thanks. I notice this wasn't part of your 5.16-rc1 pull request, nor
did it make it into Linus'es tree via any other route. May I ask what
the plans here are?

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] x86/paravirt: use %rip-relative addressing in hook calls

2021-09-30 Thread Jan Beulich via Virtualization
While using a plain (constant) address works, its use needlessly invokes
a SIB addressing mode, making every call site one byte larger than
necessary. Instead of using an "i" constraint with address-of operator
and a 'c' operand modifier, simply use an ordinary "m" constraint, which
the 64-bit compiler will translate to %rip-relative addressing. This way
we also tell the compiler the truth about operand usage - the memory
location gets actually read, after all.

32-bit code generation is unaffected by the change.

Signed-off-by: Jan Beulich 

--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -278,7 +278,7 @@ extern void (*paravirt_iret)(void);
 
 #define paravirt_type(op)  \
[paravirt_typenum] "i" (PARAVIRT_PATCH(op)),\
-   [paravirt_opptr] "i" (&(pv_ops.op))
+   [paravirt_opptr] "m" (pv_ops.op)
 #define paravirt_clobber(clobber)  \
[paravirt_clobber] "i" (clobber)
 
@@ -315,7 +315,7 @@ int paravirt_disable_iospace(void);
  */
 #define PARAVIRT_CALL  \
ANNOTATE_RETPOLINE_SAFE \
-   "call *%c[paravirt_opptr];"
+   "call *%[paravirt_opptr];"
 
 /*
  * These macros are intended to wrap calls through one of the paravirt

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86/ioperm: add new paravirt function update_io_bitmap

2020-02-19 Thread Jan Beulich
On 19.02.2020 06:35, Jürgen Groß wrote:
> On 18.02.20 22:03, Thomas Gleixner wrote:
>> Juergen Gross  writes:
>>> Commit 111e7b15cf10f6 ("x86/ioperm: Extend IOPL config to control
>>> ioperm() as well") reworked the iopl syscall to use I/O bitmaps.
>>>
>>> Unfortunately this broke Xen PV domains using that syscall as there
>>> is currently no I/O bitmap support in PV domains.
>>>
>>> Add I/O bitmap support via a new paravirt function update_io_bitmap
>>> which Xen PV domains can use to update their I/O bitmaps via a
>>> hypercall.
>>>
>>> Fixes: 111e7b15cf10f6 ("x86/ioperm: Extend IOPL config to control ioperm() 
>>> as well")
>>> Reported-by: Jan Beulich 
>>> Cc:  # 5.5
>>> Signed-off-by: Juergen Gross 
>>> Reviewed-by: Jan Beulich 
>>> Tested-by: Jan Beulich 
>>
>> Duh, sorry about that and thanks for fixing it.
>>
>> BTW, why isn't stuff like this not catched during next or at least
>> before the final release? Is nothing running CI on upstream with all
>> that XEN muck active?
> 
> This problem showed up by not being able to start the X server (probably
> not the freshest one) in dom0 on a moderate aged AMD system.

Not the freshest one, yes, but also on a system where KMS would not
be available (my success rate with KMS is rather low overall, and
with newer Linux I see rather more systems to stop working than ones
to become working, but I simply don't have the time to investigate).

Jan
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Xen-devel] [PATCH 0/2] Remove 32-bit Xen PV guest support

2019-07-16 Thread Jan Beulich
On 15.07.2019 19:28, Andy Lutomirski wrote:
> On Mon, Jul 15, 2019 at 9:34 AM Andi Kleen  wrote:
>>
>> Juergen Gross  writes:
>>
>>> The long term plan has been to replace Xen PV guests by PVH. The first
>>> victim of that plan are now 32-bit PV guests, as those are used only
>>> rather seldom these days. Xen on x86 requires 64-bit support and with
>>> Grub2 now supporting PVH officially since version 2.04 there is no
>>> need to keep 32-bit PV guest support alive in the Linux kernel.
>>> Additionally Meltdown mitigation is not available in the kernel running
>>> as 32-bit PV guest, so dropping this mode makes sense from security
>>> point of view, too.
>>
>> Normally we have a deprecation period for feature removals like this.
>> You would make the kernel print a warning for some releases, and when
>> no user complains you can then remove. If a user complains you can't.
>>
> 
> As I understand it, the kernel rules do allow changes like this even
> if there's a complaint: this is a patch that removes what is
> effectively hardware support.  If the maintenance cost exceeds the
> value, then removal is fair game.  (Obviously we weight the value to
> preserving compatibility quite highly, but in this case, Xen dropped
> 32-bit hardware support a long time ago.  If the Xen hypervisor says
> that 32-bit PV guest support is deprecated, it's deprecated.)

Since it was implied but not explicit from Andrew's reply, just to
make it explicit: So far 32-bit PV guest support has not been
deprecated in Xen itself.

Jan
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH RESEND] x86-64: use RIP-relative calls for paravirt indirect ones

2018-08-15 Thread Jan Beulich
This saves one insn byte per instance, summing up to a savings of over
4k in my (stripped down) configuration. No variant of to be patched in
replacement code relies on the one byte larger size.

Signed-off-by: Jan Beulich 
Reviewed-by: Juergen Gross 
---
Resend to include x86 maintainers, who aren't listed explicitly for the
file changed.
---
 arch/x86/include/asm/paravirt_types.h |6 ++
 1 file changed, 6 insertions(+)

--- 4.18/arch/x86/include/asm/paravirt_types.h
+++ 4.18-x86_64-pvops-call-RIPrel/arch/x86/include/asm/paravirt_types.h
@@ -393,9 +393,15 @@ int paravirt_disable_iospace(void);
  * offset into the paravirt_patch_template structure, and can therefore be
  * freely converted back into a structure offset.
  */
+#ifdef CONFIG_X86_32
 #define PARAVIRT_CALL  \
ANNOTATE_RETPOLINE_SAFE \
"call *%c[paravirt_opptr];"
+#else
+#define PARAVIRT_CALL  \
+   ANNOTATE_RETPOLINE_SAFE \
+   "call *%c[paravirt_opptr](%%rip);"
+#endif
 
 /*
  * These macros are intended to wrap calls through one of the paravirt


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH 04/10] x86/paravirt: use a single ops structure

2018-08-10 Thread Jan Beulich
>>> On 10.08.18 at 13:52,  wrote:
> --- a/arch/x86/hyperv/mmu.c
> +++ b/arch/x86/hyperv/mmu.c
> @@ -228,9 +228,9 @@ void hyperv_setup_mmu_ops(void)
>  
>   if (!(ms_hyperv.hints & HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED)) {
>   pr_info("Using hypercall for remote TLB flush\n");
> - pv_mmu_ops.flush_tlb_others = hyperv_flush_tlb_others;
> + pv_ops.pv_mmu_ops.flush_tlb_others = hyperv_flush_tlb_others;

Taking just this as example, why not

pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;

? Both pv_ and _ops are redundant on the field names.

Jan


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] x86-64: use RIP-relative calls for paravirt indirect ones

2018-06-25 Thread Jan Beulich
This saves one insn byte per instance, summing up to a savings of over
4k in my (stripped down) configuration. No variant of to be patched in
replacement code relies on the one byte larger size.

Signed-off-by: Jan Beulich 
---
 arch/x86/include/asm/paravirt_types.h |6 ++
 1 file changed, 6 insertions(+)

--- 4.18-rc2/arch/x86/include/asm/paravirt_types.h
+++ 4.18-rc2-x86_64-pvops-call-RIPrel/arch/x86/include/asm/paravirt_types.h
@@ -393,9 +393,15 @@ int paravirt_disable_iospace(void);
  * offset into the paravirt_patch_template structure, and can therefore be
  * freely converted back into a structure offset.
  */
+#ifdef CONFIG_X86_32
 #define PARAVIRT_CALL  \
ANNOTATE_RETPOLINE_SAFE \
"call *%c[paravirt_opptr];"
+#else
+#define PARAVIRT_CALL  \
+   ANNOTATE_RETPOLINE_SAFE \
+   "call *%c[paravirt_opptr](%%rip);"
+#endif
 
 /*
  * These macros are intended to wrap calls through one of the paravirt


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH v2 5/6] virt, sched: add cpu pinning to smp_call_sync_on_phys_cpu()

2016-03-21 Thread Jan Beulich
>>> On 21.03.16 at 13:24,  wrote:
> @@ -758,9 +759,14 @@ struct smp_sync_call_struct {
>  static void smp_call_sync_callback(struct work_struct *work)
>  {
>   struct smp_sync_call_struct *sscs;
> + unsigned int cpu = smp_processor_id();

So this obtains the vCPU number, yet ...

>   sscs = container_of(work, struct smp_sync_call_struct, work);
> + preempt_disable();
> + hypervisor_pin_vcpu(cpu);

... here you're supposed to pass a pCPU number.

Also don't you need to call smp_processor_id() after preempt_disable()?

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] kasan_map_early_shadow() on Xen

2015-03-03 Thread Jan Beulich
 On 03.03.15 at 10:40, mcg...@suse.com wrote:
 Let's go down the rabbit hole for a bit. HAVE_ARCH_KASAN will be
 selected on x86 when:
 
 if X86_64  SPARSEMEM_VMEMMAP
 
 Now Xen should not have SPARSEMEM_VMEMMAP

Why would that be?

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH] xen: remove unused Kconfig parameter

2013-07-09 Thread Jan Beulich
 On 09.07.13 at 02:26, Konrad Rzeszutek Wilk konrad.w...@oracle.com wrote:
 Could you explain to me please why the check in the scripts is superfluous?

This is not really the question - _any_ such check can only be
wrong. The boot loader has absolutely no business caring about
kernel internals, and the kernel shouldn't be limited by it in how it
renames/adds/deletes Kconfig options and anything else.

 Especially as the grand plan is to get rid of CONFIG_XEN_DOM0 and more or 
 less have a backend and fronted config option (since that makes more sense 
 nowadays). And that would make the XEN_DOM0 be obsolete and the XEN_PRIV 
 would be the one that turns a lot of the options needed to compile a kernel 
 that can provide backend driver support. (I am hand waving here). 

That's pretty odd a plan, considering that Dom0 is more than just
an environment to provide backends. In fact, Dom0 may not be
providing any backends at all, and instead just serve the
controlling hardware and/or control domain purpose that it
really is meant to be. But of course, if none of _that_ functionality
depends on this config option, then it indeed could go away.

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH] xen: remove unused Kconfig parameter

2013-07-09 Thread Jan Beulich
 On 09.07.13 at 16:48, Konrad Rzeszutek Wilk konrad.w...@oracle.com wrote:
 Dom0 has been both - but there is nothing wrong with seperating these
 two labels. And therein I was thinking that the 'hardware backend domain'
 should be the introduced. I am not good with names so the best option
 seems CONFIG_XEN_PRIVILIGED, but that sounds to be like 'controlling 
 domain'.
 
 Any good ideas for names? CONFIG_XEN_HARDWARE_DOMAIN ? 
 CONFIG_XEN_PRIVILIGED_DOMAIN?

Following the naming in the hypervisor, CONFIG_XEN_HARDWARE_DOMAIN
and CONFIG_XEN_CONTROL_DOMAIN would seem reasonable to me.

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] xen: remove bm_rld_set of xen_processor_flags

2013-06-04 Thread Jan Beulich
 On 04.06.13 at 10:05, liguang lig.f...@cn.fujitsu.com wrote:
 bm_rld_set seems obsolete now
 
 Signed-off-by: liguang lig.f...@cn.fujitsu.com
 ---
  include/xen/interface/platform.h |1 -
  1 files changed, 0 insertions(+), 1 deletions(-)
 
 diff --git a/include/xen/interface/platform.h 
 b/include/xen/interface/platform.h
 index c57d5f6..733 100644
 --- a/include/xen/interface/platform.h
 +++ b/include/xen/interface/platform.h
 @@ -240,7 +240,6 @@ struct xen_processor_flags {
   uint32_t bm_check:1;
   uint32_t has_cst:1;
   uint32_t power_setup_done:1;
 - uint32_t bm_rld_set:1;
  };
  
  struct xen_processor_power {

Any such patch would need to be submitted against the master copy
of the header (in the Xen repo), and by recognizing that you'd also
notice that this is part of a public ABI, and hence can't be removed,
but at best can be documented as obsolete. Of course you'd first
need to check whether the hypervisor makes any use of that bit
when passed down from Dom0.

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] xen-pciback: fix error return code in pcistub_irq_handler_switch()

2013-05-31 Thread Jan Beulich
 On 31.05.13 at 13:59, Wei Yongjun weiyj...@gmail.com wrote:
 From: Wei Yongjun yongjun_...@trendmicro.com.cn
 
 Fix to return -ENOENT in the pcistub_device_find() and pci_get_drvdata()
 error handling case instead of 0(overwrite to 0 by str_to_slot()), as done
 elsewhere in this function.
 
 Signed-off-by: Wei Yongjun yongjun_...@trendmicro.com.cn

Acked-by: Jan Beulich jbeul...@suse.com

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Readonly GDT

2013-04-10 Thread Jan Beulich
 On 10.04.13 at 02:43, H. Peter Anvin h...@zytor.com wrote:
 OK, thinking about the GDT here.
 
 The GDT is quite small -- 256 bytes on i386, 128 bytes on x86-64.  As
 such, we probably don't want to allocate a full page to it for only
 that.  This means that in order to create a readonly mapping we have to
 pack GDTs from different CPUs together in the same pages, *or* we
 tolerate that other things on the same page gets reflected in the same
 mapping.

I think a read-only GDT is incompatible with exceptions delivered
through task gates (i.e. double fault on 32-bit), so I would assume
this needs to remain a 64-bit only thing.

 However, the packing solution has the advantage of reducing address
 space consumption which matters on 32 bits: even on i386 we can easily
 burn a megabyte of address space for 4096 processors, but burning 16
 megabytes starts to hurt.

Packing would have the additional benefit of Xen not needing to
become a special case in yet another area (because pages
containing live descriptor table entries need to be read-only for
PV guests, and need to consist of only descriptor table entries).

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86/xen: don't assume %ds is usable in xen_iret for 32-bit PVOPS.

2013-02-14 Thread Jan Beulich
 On 14.02.13 at 22:52, Greg KH gre...@linuxfoundation.org wrote:
 Jan, any reason why this patch isn't in Linus's tree already,

I see Konrad answered that already, and with the embargo on the
CVE having expired only on Tuesday, I think it's not unreasonable
to not see this there yet.

 and why it
 wasn't marked for inclusion in a -stable kernel release?

Forgot to put the tag on when putting the patch together, and
then none of the reviewers noticed either. I'm sorry for that.

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE

2013-01-07 Thread Jan Beulich
 On 04.01.13 at 18:25, Daniel Kiper daniel.ki...@oracle.com wrote:
 Right, so where is virtual mapping of control page established?
 I could not find relevant code in SLES kernel which does that.

In the hypervisor (xen/arch/x86/machine_kexec.c:machine_kexec_load()).
xen/arch/x86/machine_kexec.c:machine_kexec() then simply uses
image-page_list[1].

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE

2013-01-07 Thread Jan Beulich
 On 07.01.13 at 13:52, Daniel Kiper daniel.ki...@oracle.com wrote:
 On Mon, Jan 07, 2013 at 09:48:20AM +, Jan Beulich wrote:
  On 04.01.13 at 18:25, Daniel Kiper daniel.ki...@oracle.com wrote:
  Right, so where is virtual mapping of control page established?
  I could not find relevant code in SLES kernel which does that.

 In the hypervisor (xen/arch/x86/machine_kexec.c:machine_kexec_load()).
 xen/arch/x86/machine_kexec.c:machine_kexec() then simply uses
 image-page_list[1].
 
 This (xen/arch/x86/machine_kexec.c:machine_kexec_load()) maps relevant
 page (allocated earlier by dom0) in hypervisor fixmap area. However,
 it does not make relevant mapping in transition page table which
 leads to crash when %cr3 is switched from Xen page table to
 transition page table.

That indeed could explain _random_ failures - the fixmap entries
get created with _PAGE_GLOBAL set, i.e. don't get flushed with
the CR3 write unless CR4.PGE is clear.

And I don't see how your allocation of intermediate page tables
would help: You wouldn't know where the mapping of the control
page lives until you're actually in the early relocate_kernel code.
Or was it that what distinguishes your cloned code from the
native original?

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] drivers/xen: avoid out-of-range write in xen_add_device

2013-01-07 Thread Jan Beulich
 On 07.01.13 at 16:08, Konrad Rzeszutek Wilk konrad.w...@oracle.com wrote:
 On Sat, Jan 05, 2013 at 02:18:46PM -0500, Nickolai Zeldovich wrote:
 xen_add_device() in drivers/xen/pci.c allocates a struct
 physdev_pci_device_add on the stack and then writes to optarr[0].
 The previous declaration of struct physdev_pci_device_add contained
 a zero-length optarr[] array, presumably assuming it will be allocated
 with kmalloc with a suitable number of trailing elements, but the code in
 xen_add_device() as a result wrote past the end of the (stack-allocated)
 data structure.
 
 Since xen_add_device() is the only use of struct physdev_pci_device_add
 in the kernel, turn optarr[] into a single-element array instead.
 
 
 Lets include Jan and Xen-devel on this email - as this is also changing
 the official header that is used in Xen.

Correct - for that reason it is not the header that needs changing
here, but the consumer of the header. I have to admit that I find
it odd that the compiler allows automatic variables of variable size
types without warning - otherwise this wouldn't have gone
unnoticed.

Jan

 Signed-off-by: Nickolai Zeldovich nicko...@csail.mit.edu
 ---
  include/xen/interface/physdev.h |6 +-
  1 file changed, 1 insertion(+), 5 deletions(-)
 
 diff --git a/include/xen/interface/physdev.h 
 b/include/xen/interface/physdev.h
 index 1844d31..24fd218 100644
 --- a/include/xen/interface/physdev.h
 +++ b/include/xen/interface/physdev.h
 @@ -242,11 +242,7 @@ struct physdev_pci_device_add {
  uint8_t bus;
  uint8_t devfn;
  } physfn;
 -#if defined(__STDC_VERSION__)  __STDC_VERSION__ = 199901L
 -uint32_t optarr[];
 -#elif defined(__GNUC__)
 -uint32_t optarr[0];
 -#endif
 +uint32_t optarr[1];
  };
  
  #define PHYSDEVOP_pci_device_remove 26
 -- 
 1.7.10.4
 



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation

2013-01-04 Thread Jan Beulich
 On 04.01.13 at 15:22, Daniel Kiper daniel.ki...@oracle.com wrote:
 On Wed, Jan 02, 2013 at 11:26:43AM +, Andrew Cooper wrote:
 /sbin/kexec can load the Xen crash kernel itself by issuing
 hypercalls using /dev/xen/privcmd.  This would remove the need for
 the dom0 kernel to distinguish between loading a crash kernel for
 itself and loading a kernel for Xen.

 Or is this just a silly idea complicating the matter?
 
 This is impossible with current Xen kexec/kdump interface.

Why?

 It should be changed to do that. However, I suppose that
 Xen community would not be interested in such changes.

And again - why?

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE

2013-01-04 Thread Jan Beulich
 On 04.01.13 at 16:15, Daniel Kiper daniel.ki...@oracle.com wrote:
 On Thu, Jan 03, 2013 at 09:34:55AM +, Jan Beulich wrote:
  On 27.12.12 at 03:18, Daniel Kiper daniel.ki...@oracle.com wrote:
  Some implementations (e.g. Xen PVOPS) could not use part of identity page 
 table
  to construct transition page table. It means that they require separate 
 PUDs,
  PMDs and PTEs for virtual and physical (identity) mapping. To satisfy that
  requirement add extra pointer to PGD, PUD, PMD and PTE and align existing
  code.

 So you keep posting this despite it having got pointed out on each
 earlier submission that this is unnecessary, proven by the fact that
 the non-pvops Xen kernels can get away without it. Why?
 
 Sorry but I forgot to reply for your email last time.
 
 I am still not convinced. I have tested SUSE kernel itself and it does not 
 work.
 Maybe I missed something but... Please check 
 arch/x86/kernel/machine_kexec_64.c:init_transition_pgtable()
 
 I can see:
 
 vaddr = (unsigned long)relocate_kernel;
 
 and later:
 
 pgd += pgd_index(vaddr);
 ...

I think that mapping is simply irrelevant, as the code at
relocate_kernel gets copied to the control page and
invoked there (other than in the native case, where
relocate_kernel() gets invoked directly).

Jan

 It is wrong. relocate_kernel() virtual address in Xen is different
 than its virtual address in Linux Kernel. That is why transition
 page table could not be established in Linux Kernel and so on...
 How does this work in SUSE? I do not have an idea.
 
 I am happy to fix that but whatever fix for it is
 I would like to be sure that it works.
 
 Daniel



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation

2013-01-03 Thread Jan Beulich
 On 02.01.13 at 12:26, Andrew Cooper andrew.coop...@citrix.com wrote:
 On 27/12/12 18:02, Eric W. Biederman wrote:
 It probably make sense to split them apart a little even.
 
 Thinking about this split, there might be a way to simply it even more.
 
 /sbin/kexec can load the Xen crash kernel itself by issuing hypercalls 
 using /dev/xen/privcmd.  This would remove the need for the dom0 kernel 
 to distinguish between loading a crash kernel for itself and loading a 
 kernel for Xen.
 
 Or is this just a silly idea complicating the matter?

I don't think so (and suggested that before as a response to an
earlier submission of this patch set), and it would make most of
the discussion here mute.

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE

2013-01-03 Thread Jan Beulich
 On 27.12.12 at 03:18, Daniel Kiper daniel.ki...@oracle.com wrote:
 Some implementations (e.g. Xen PVOPS) could not use part of identity page 
 table
 to construct transition page table. It means that they require separate 
 PUDs,
 PMDs and PTEs for virtual and physical (identity) mapping. To satisfy that
 requirement add extra pointer to PGD, PUD, PMD and PTE and align existing 
 code.

So you keep posting this despite it having got pointed out on each
earlier submission that this is unnecessary, proven by the fact that
the non-pvops Xen kernels can get away without it. Why?

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct

2012-11-23 Thread Jan Beulich
 On 22.11.12 at 18:37, H. Peter Anvin h...@zytor.com wrote:
 I actually talked to Ian Jackson at LCE, and mentioned among other 
 things the bogosity of requiring a PUD page for three-level paging in 
 Linux -- a bogosity which has spread from Xen into native.  It's a page 
 wasted for no good reason, since it only contains 32 bytes worth of 
 data, *inherently*.  Furthermore, contrary to popular belief, it is 
 *not* pa page table per se.
 
 Ian told me: I didn't know we did that, and we shouldn't have to. 
 Here we have suffered this overhead for at least six years, ...

Even the Xen kernel only needs the full page when running on a
64-bit hypervisor (now that we don't have a 32-bit hypervisor
anymore, that of course basically means always). But yes, I too
never liked this enforced over-allocation for native kernels (and
was surprised that it was allowed in at all).

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct

2012-11-23 Thread Jan Beulich
 On 23.11.12 at 11:37, Daniel Kiper daniel.ki...@oracle.com wrote:
 On Fri, Nov 23, 2012 at 09:53:37AM +, Jan Beulich wrote:
  On 23.11.12 at 02:56, Andrew Cooper andrew.coop...@citrix.com wrote:
  On 23/11/2012 01:38, H. Peter Anvin wrote:
  I still don't really get why it can't be isolated from dom0, which would
  make more sense to me, even for a Xen crash.
 
 
  The crash region (as specified by crashkernel= on the Xen command line)
  is isolated from dom0.
 
  dom0 (using the kexec utility etc) has the task of locating the Xen
  crash notes (using the kexec hypercall interface), constructing a binary
  blob containing kernel, initram and gubbins, and asking Xen to put this
  blob in the crash region (again, using the kexec hypercall interface).
 
  I do not see how this is very much different from the native case
  currently (although please correct me if I am misinformed).  Linux has
  extra work to do by populating /proc/iomem with the Xen crash regions
  boot (so the kexec utility can reference their physical addresses when
  constructing the blob), and should just act as a conduit between the
  kexec system call and the kexec hypercall to load the blob.

 But all of this _could_ be done completely independent of the
 Dom0 kernel's kexec infrastructure (i.e. fully from user space,
 invoking the necessary hypercalls through the privcmd driver).
 
 No, this is impossible. kexec/kdump image lives in dom0 kernel memory
 until execution. That is why privcmd driver itself is not a solution
 in this case.

Even if so, there's no fundamental reason why that kernel image
can't be put into Xen controlled space instead.

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE

2012-11-20 Thread Jan Beulich
 On 20.11.12 at 16:04, Daniel Kiper daniel.ki...@oracle.com wrote:
 Some implementations (e.g. Xen PVOPS) could not use part of identity page 
 table
 to construct transition page table. It means that they require separate 
 PUDs,
 PMDs and PTEs for virtual and physical (identity) mapping. To satisfy that
 requirement add extra pointer to PGD, PUD, PMD and PTE and align existing 
 code.

As said for v1 already - this is not really a requirement of the
interface, or else none of our Xen kernels since 2.6.30 would
have worked. I don't think it is desirable to introduce overhead
for everyone if it's not even needed for Xen.

Jan

 Signed-off-by: Daniel Kiper daniel.ki...@oracle.com
 ---
  arch/x86/include/asm/kexec.h   |   10 +++---
  arch/x86/kernel/machine_kexec_64.c |   12 ++--
  2 files changed, 13 insertions(+), 9 deletions(-)
 
 diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
 index 317ff17..3cf5600 100644
 --- a/arch/x86/include/asm/kexec.h
 +++ b/arch/x86/include/asm/kexec.h
 @@ -157,9 +157,13 @@ struct kimage_arch {
  };
  #else
  struct kimage_arch {
 - pud_t *pud;
 - pmd_t *pmd;
 - pte_t *pte;
 + pgd_t *pgd;
 + pud_t *pud0;
 + pud_t *pud1;
 + pmd_t *pmd0;
 + pmd_t *pmd1;
 + pte_t *pte0;
 + pte_t *pte1;
  };
  #endif
  
 diff --git a/arch/x86/kernel/machine_kexec_64.c 
 b/arch/x86/kernel/machine_kexec_64.c
 index b3ea9db..976e54b 100644
 --- a/arch/x86/kernel/machine_kexec_64.c
 +++ b/arch/x86/kernel/machine_kexec_64.c
 @@ -137,9 +137,9 @@ out:
  
  static void free_transition_pgtable(struct kimage *image)
  {
 - free_page((unsigned long)image-arch.pud);
 - free_page((unsigned long)image-arch.pmd);
 - free_page((unsigned long)image-arch.pte);
 + free_page((unsigned long)image-arch.pud0);
 + free_page((unsigned long)image-arch.pmd0);
 + free_page((unsigned long)image-arch.pte0);
  }
  
  static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
 @@ -157,7 +157,7 @@ static int init_transition_pgtable(struct kimage *image, 
 pgd_t *pgd)
   pud = (pud_t *)get_zeroed_page(GFP_KERNEL);
   if (!pud)
   goto err;
 - image-arch.pud = pud;
 + image-arch.pud0 = pud;
   set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE));
   }
   pud = pud_offset(pgd, vaddr);
 @@ -165,7 +165,7 @@ static int init_transition_pgtable(struct kimage *image, 
 pgd_t *pgd)
   pmd = (pmd_t *)get_zeroed_page(GFP_KERNEL);
   if (!pmd)
   goto err;
 - image-arch.pmd = pmd;
 + image-arch.pmd0 = pmd;
   set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE));
   }
   pmd = pmd_offset(pud, vaddr);
 @@ -173,7 +173,7 @@ static int init_transition_pgtable(struct kimage *image, 
 pgd_t *pgd)
   pte = (pte_t *)get_zeroed_page(GFP_KERNEL);
   if (!pte)
   goto err;
 - image-arch.pte = pte;
 + image-arch.pte0 = pte;
   set_pmd(pmd, __pmd(__pa(pte) | _KERNPG_TABLE));
   }
   pte = pte_offset_kernel(pmd, vaddr);
 -- 
 1.5.6.5



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] linux-next: Tree for Oct 24 (xen)

2012-10-25 Thread Jan Beulich
 On 25.10.12 at 15:46, Konrad Rzeszutek Wilk konrad.w...@oracle.com wrote:
 On Thu, Oct 25, 2012 at 11:48:30AM +0100, Stefano Stabellini wrote:
 On Thu, 25 Oct 2012, Jan Beulich wrote:
   On 24.10.12 at 23:33, Randy Dunlap rdun...@xenotime.net wrote:
   On 10/23/2012 09:19 PM, Stephen Rothwell wrote:
   
   Hi all,
   
   Changes since 201201023:
   
   
   on x86_64:
   
   drivers/built-in.o: In function `dbgp_reset_prep':
   (.text+0xb96b5): undefined reference to `xen_dbgp_reset_prep'
   drivers/built-in.o: In function `dbgp_external_startup':
   (.text+0xb9d95): undefined reference to `xen_dbgp_external_startup'
   
   
   Full randconfig file is attached.
  
  So this is because with !USB_SUPPORT but EARLY_PRINTK_DBGP
  dbgp_reset_prep() and dbgp_external_startup() get pointlessly
  defined and exported. This got broken by the merge
  recommendation for the ARM side changes (originally compilation
  of drivers/xen/dbgp.c depended on just CONFIG_XEN_DOM0).
  
  From my pov, fixing the USB side would be the clean solution (i.e.
  putting those function definitions inside a CONFIG_USB_SUPPORT
  conditional).
 
  The alternative of a smaller change would be to extend the
  conditional around the respective xen_dbgp_...() declarations
  in include/linux/usb/ehci_def.h to become
  
  #if defined(CONFIG_XEN_DOM0)  defined(CONFIG_USB_SUPPORT)
  
  Please advise towards your preference.
 
 I think that your first suggestion is the right one.
 
 Can you guys spin up a patch pls and make sure it does not break
 compilation. Thx.

I'd really like to hear Greg's opinion on which route to take before
pointlessly trying the other one.

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] potential integer overflow in xenbus_file_write()

2012-10-16 Thread Jan Beulich
 On 15.10.12 at 12:25, Ian Campbell ian.campb...@citrix.com wrote:
 On Thu, 2012-09-13 at 19:00 +0300, Dan Carpenter wrote:
 Hi,
 
 Thanks Dan. I'm not sure anyone from Xen-land really monitors
 virtualization@. Adding xen-devel and Konrad.
 
 
 I was reading some code and had a question in xenbus_file_write()
 
 drivers/xen/xenbus/xenbus_dev_frontend.c
461  if ((len + u-len)  sizeof(u-u.buffer)) {
  
 Can this addition overflow?
 
 len is a size_t and u-len is an unsigned int, so I expect so.
 
   Should the test be something like:
 
  if (len  sizeof(u-u.buffer) || len + u-len  sizeof(u-u.buffer)) {
 
 I think that would do it.

Actually, it can remain a single range check:

if (len  sizeof(u-u.buffer) - u-len) {

because the rest of the code guarantees u-len to be less or
equal to sizeof(u-u.buffer) at all times.

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] memory corruption in HYPERVISOR_physdev_op()

2012-10-16 Thread Jan Beulich
 On 15.10.12 at 12:27, Ian Campbell ian.campb...@citrix.com wrote:
 My static analyzer complains about potential memory corruption in
 HYPERVISOR_physdev_op()
 
 arch/x86/include/asm/xen/hypercall.h
389  static inline int
390  HYPERVISOR_physdev_op(int cmd, void *arg)
391  {
392  int rc = _hypercall2(int, physdev_op, cmd, arg);
393  if (unlikely(rc == -ENOSYS)) {
394  struct physdev_op op;
395  op.cmd = cmd;
396  memcpy(op.u, arg, sizeof(op.u));
397  rc = _hypercall1(int, physdev_op_compat, op);
398  memcpy(arg, op.u, sizeof(op.u));
 ^
 Some of the arg buffers are not as large as sizeof(op.u) which is either
 12 or 16 depending on the size of longs in struct physdev_apic.
 
 Nasty!

Doesn't the same problem also exist for
HYPERVISOR_event_channel_op() then, at least theoretically
(EVTCHNOP_reset is apparently the only addition here so far,
but is being used by the tools only afaics)?

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] memory corruption in HYPERVISOR_physdev_op()

2012-10-16 Thread Jan Beulich
 On 16.10.12 at 11:07, Jan Beulich jbeul...@suse.com wrote:
 On 15.10.12 at 12:27, Ian Campbell ian.campb...@citrix.com wrote:
 My static analyzer complains about potential memory corruption in
 HYPERVISOR_physdev_op()
 
 arch/x86/include/asm/xen/hypercall.h
389  static inline int
390  HYPERVISOR_physdev_op(int cmd, void *arg)
391  {
392  int rc = _hypercall2(int, physdev_op, cmd, arg);
393  if (unlikely(rc == -ENOSYS)) {
394  struct physdev_op op;
395  op.cmd = cmd;
396  memcpy(op.u, arg, sizeof(op.u));
397  rc = _hypercall1(int, physdev_op_compat, op);
398  memcpy(arg, op.u, sizeof(op.u));
 ^
 Some of the arg buffers are not as large as sizeof(op.u) which is either
 12 or 16 depending on the size of longs in struct physdev_apic.
 
 Nasty!
 
 Doesn't the same problem also exist for
 HYPERVISOR_event_channel_op() then, at least theoretically
 (EVTCHNOP_reset is apparently the only addition here so far,
 but is being used by the tools only afaics)?

Actually, the problem isn't tied to new additions of sub-hypercalls
(I was wrongly implying this from the example originally provided),
and should be visible e.g. on any use of EVTCHNOP_unmask.

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] memory corruption in HYPERVISOR_physdev_op()

2012-10-15 Thread Jan Beulich
 On 15.10.12 at 12:27, Ian Campbell ian.campb...@citrix.com wrote:
 On Fri, 2012-09-14 at 14:24 +0300, Dan Carpenter wrote:
 My static analyzer complains about potential memory corruption in
 HYPERVISOR_physdev_op()
 
 arch/x86/include/asm/xen/hypercall.h
389  static inline int
390  HYPERVISOR_physdev_op(int cmd, void *arg)
391  {
392  int rc = _hypercall2(int, physdev_op, cmd, arg);
393  if (unlikely(rc == -ENOSYS)) {
394  struct physdev_op op;
395  op.cmd = cmd;
396  memcpy(op.u, arg, sizeof(op.u));
397  rc = _hypercall1(int, physdev_op_compat, op);
398  memcpy(arg, op.u, sizeof(op.u));
 ^
 Some of the arg buffers are not as large as sizeof(op.u) which is either
 12 or 16 depending on the size of longs in struct physdev_apic.
 
 Nasty!

Wasn't it that pv-ops expects Xen 4.0.1 or newer anyway? If so,
what does this code exist for in the first place (it's framed by
#if CONFIG_XEN_COMPAT = 0x030002 in the Xenified kernel)?

399  }
400  return rc;
401  }
 
 One example of this is in xen_initdom_restore_msi_irqs().
 
 arch/x86/pci/xen.c
337  struct physdev_pci_device restore_ext;
338  
339  restore_ext.seg = pci_domain_nr(dev-bus);
340  restore_ext.bus = dev-bus-number;
341  restore_ext.devfn = dev-devfn;
342  ret = 
 HYPERVISOR_physdev_op(PHYSDEVOP_restore_msi_ext,
343  restore_ext);
 
 There are only 4 bytes here.
 
344  if (ret == -ENOSYS)
 ^^
 If we hit this condition, we have corrupted some memory.
 
 I can see the memory corruption but how does it relate to ret ==
 -ENOSYS?

The (supposedly) corrupting code site inside an

if (unlikely(rc == -ENOSYS)) {

Supposedly because as long as the argument passed to the
function is in memory accessed by the local CPU only and
doesn't overlap with storage used for rc (e.g. living in a
register), there's no corruption possible afaict - the second
memcpy() would just copy back what the first one obtained
from there.

Fixing this other than by removing the broken code would be
pretty hard I'm afraid (and I tend to leave the code untouched
altogether in the Xenified tree).

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] memory corruption in HYPERVISOR_physdev_op()

2012-10-15 Thread Jan Beulich
 On 15.10.12 at 12:58, Ian Campbell ian.campb...@citrix.com wrote:
 On Mon, 2012-10-15 at 11:48 +0100, Jan Beulich wrote:
  On 15.10.12 at 12:27, Ian Campbell ian.campb...@citrix.com wrote:
  On Fri, 2012-09-14 at 14:24 +0300, Dan Carpenter wrote:
  My static analyzer complains about potential memory corruption in
  HYPERVISOR_physdev_op()
  
  arch/x86/include/asm/xen/hypercall.h
 389  static inline int
 390  HYPERVISOR_physdev_op(int cmd, void *arg)
 391  {
 392  int rc = _hypercall2(int, physdev_op, cmd, arg);
 393  if (unlikely(rc == -ENOSYS)) {
 394  struct physdev_op op;
 395  op.cmd = cmd;
 396  memcpy(op.u, arg, sizeof(op.u));
 397  rc = _hypercall1(int, physdev_op_compat, op);
 398  memcpy(arg, op.u, sizeof(op.u));
  ^
  Some of the arg buffers are not as large as sizeof(op.u) which is either
  12 or 16 depending on the size of longs in struct physdev_apic.
  
  Nasty!
 
 Wasn't it that pv-ops expects Xen 4.0.1 or newer anyway? If so,
 what does this code exist for in the first place (it's framed by
 #if CONFIG_XEN_COMPAT = 0x030002 in the Xenified kernel)?
 
 I think the 4.0.1 or newer requirement is for dom0 only. I guess physdev
 op is only used in dom0 though? Or does passthrough need it?

No, it's only platform_op that is Dom0-only.

  I can see the memory corruption but how does it relate to ret ==
  -ENOSYS?
 
 The (supposedly) corrupting code site inside an
 
  if (unlikely(rc == -ENOSYS)) {
 
 Ah, for some reason I assumed this was in the eventual caller, even
 though it was staring me right in the face in the full quote.

I think Dan's reference was to an eventual caller - it would see
the -ENOSYS, as the compat call wouldn't return anything else
than the modern one, and the modern one (to enter the code
in question) must have returned -ENOSYS.

 Fixing this other than by removing the broken code would be
 pretty hard I'm afraid (and I tend to leave the code untouched
 altogether in the Xenified tree).
 
 Given that it is compat code the list of subops which needs to supported
 in this case is small and finite so a simple lookup table or even switch
 stmt for the size might be an option.

Ugly, particularly for an inline function. But possible of course.

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [patch 3/3] xen/privcmd: remove const modifier from declaration

2012-09-10 Thread Jan Beulich
 On 08.09.12 at 11:58, Dan Carpenter dan.carpen...@oracle.com wrote:
 When we use this pointer, we cast away the const modifier and modify the
 data.  I think it was an accident to declare it as const.

NAK - the const is very valid here, as the v2 interface (as opposed
to the v1 one) does _not_ modify this array (or if it does, it's a
bug). This is a guarantee made to user mode, so it should also be
expressed that way in the interface.

But of course the cast used before this patch isn't right either, as
it indeed inappropriately discards the qualifier. Afaiu this was done
to simplify the internal workings of the code, but I don't think it's
desirable to sacrifice type safety for implementation simplicity.

Jan

 Signed-off-by: Dan Carpenter dan.carpen...@oracle.com
 
 diff --git a/include/xen/privcmd.h b/include/xen/privcmd.h
 index a853168..58ed953 100644
 --- a/include/xen/privcmd.h
 +++ b/include/xen/privcmd.h
 @@ -69,7 +69,7 @@ struct privcmd_mmapbatch_v2 {
   unsigned int num; /* number of pages to populate */
   domid_t dom;  /* target domain */
   __u64 addr;   /* virtual address */
 - const xen_pfn_t __user *arr; /* array of mfns */
 + xen_pfn_t __user *arr; /* array of mfns */
   int __user *err;  /* array of error codes */
  };
  
 diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
 index 0ce006a..fceb83e 100644
 --- a/drivers/xen/privcmd.c
 +++ b/drivers/xen/privcmd.c
 @@ -389,7 +389,7 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, 
 int version)
  
   if (state.global_error  (version == 1)) {
   /* Write back errors in second pass. */
 - state.user_mfn = (xen_pfn_t *)m.arr;
 + state.user_mfn = m.arr;
   state.err  = err_array;
   ret = traverse_pages(m.num, sizeof(xen_pfn_t),
pagelist, mmap_return_errors_v1, state);
 
 ___
 Xen-devel mailing list
 xen-de...@lists.xen.org 
 http://lists.xen.org/xen-devel 



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH 05/27] xen, cpu hotplug: Don't call cpu_bringup() in xen_play_dead()

2012-06-01 Thread Jan Beulich
 On 01.06.12 at 11:11, Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com
wrote:
 xen_play_dead calls cpu_bringup() which looks weird, because xen_play_dead()
 is invoked in the cpu down path, whereas cpu_bringup() (as the name 
 suggests) is useful in the cpu bringup path.

This might not be correct - the code as it is without this change is
safe even when the vCPU gets onlined back later by an external
entity (e.g. the Xen tool stack), and it would in that case resume
at the return point of the VCPUOP_down hypercall. That might
be a heritage from the original XenoLinux tree though, and be
meaningless in pv-ops context - Jeremy, Konrad?

Possibly it was bogus/unused even in that original tree - Keir?

Jan

 Getting rid of xen_play_dead()'s dependency on cpu_bringup() helps in 
 hooking on to the generic SMP booting framework.
 
 Also remove the extra call to preempt_enable() added by commit 41bd956
 (xen/smp: Fix CPU online/offline bug triggering a BUG: scheduling while
 atomic) because it becomes unnecessary after this change.
 
 Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com
 Cc: Jeremy Fitzhardinge jer...@goop.org
 Cc: Thomas Gleixner t...@linutronix.de
 Cc: Ingo Molnar mi...@redhat.com
 Cc: H. Peter Anvin h...@zytor.com
 Cc: x...@kernel.org 
 Cc: xen-de...@lists.xensource.com 
 Cc: virtualization@lists.linux-foundation.org 
 Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com
 ---
 
  arch/x86/xen/smp.c |8 
  1 files changed, 0 insertions(+), 8 deletions(-)
 
 diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
 index 09a7199..602d6b7 100644
 --- a/arch/x86/xen/smp.c
 +++ b/arch/x86/xen/smp.c
 @@ -417,14 +417,6 @@ static void __cpuinit xen_play_dead(void) /* used only 
 with HOTPLUG_CPU */
  {
   play_dead_common();
   HYPERVISOR_vcpu_op(VCPUOP_down, smp_processor_id(), NULL);
 - cpu_bringup();
 - /*
 -  * Balance out the preempt calls - as we are running in cpu_idle
 -  * loop which has been called at bootup from cpu_bringup_and_idle.
 -  * The cpucpu_bringup_and_idle called cpu_bringup which made a
 -  * preempt_disable() So this preempt_enable will balance it out.
 -  */
 - preempt_enable();
  }
  
  #else /* !CONFIG_HOTPLUG_CPU */
 
 
 ___
 Xen-devel mailing list
 xen-de...@lists.xen.org 
 http://lists.xen.org/xen-devel 



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH 05/27] xen, cpu hotplug: Don't call cpu_bringup() in xen_play_dead()

2012-06-01 Thread Jan Beulich
 On 01.06.12 at 17:13, Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com
wrote:
 On 06/01/2012 06:29 PM, Jan Beulich wrote:
 
 On 01.06.12 at 11:11, Srivatsa S. Bhat 
 srivatsa.b...@linux.vnet.ibm.com
 wrote:
 xen_play_dead calls cpu_bringup() which looks weird, because xen_play_dead()
 is invoked in the cpu down path, whereas cpu_bringup() (as the name 
 suggests) is useful in the cpu bringup path.
 
 This might not be correct - the code as it is without this change is
 safe even when the vCPU gets onlined back later by an external
 entity (e.g. the Xen tool stack), and it would in that case resume
 at the return point of the VCPUOP_down hypercall. That might
 be a heritage from the original XenoLinux tree though, and be
 meaningless in pv-ops context - Jeremy, Konrad?
 
 Possibly it was bogus/unused even in that original tree - Keir?

 
 
 Thanks for your comments Jan!
 
 In case this change is wrong, the other method I had in mind was to call
 cpu_bringup_and_idle() in xen_play_dead(). (Even ARM does something similar,
 in the sense that it runs the cpu bringup code including cpu_idle(), in the
 cpu offline path, namely the cpu_die() function). Would that approach work
 for xen as well? If yes, then we wouldn't have any issues to convert xen to
 generic code.

No, that wouldn't work either afaict - the function is expected
to return.

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-15 Thread Jan Beulich
 On 07.05.12 at 19:25, Ingo Molnar mi...@kernel.org wrote:

(apologies for the late reply, the mail just now made it to my inbox
via xen-devel)

 I'll hold off on the whole thing - frankly, we don't want this 
 kind of Xen-only complexity. If KVM can make use of PLE then Xen 
 ought to be able to do it as well.

It does - for fully virtualized guests. For para-virtualized ones,
it can't (as the hardware feature is an extension to VMX/SVM).

 If both Xen and KVM makes good use of it then that's a different 
 matter.

I saw in a later reply that you're now tending towards trying it
out at least - thanks.

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices

2012-03-15 Thread Jan Beulich
 On 14.03.12 at 18:17, Justin T. Gibbs gi...@scsiguy.com wrote:
 On Mar 6, 2012, at 1:34 AM, Jan Beulich wrote:
 
 On 05.03.12 at 22:49, Santosh Jodh santosh.j...@citrix.com wrote:
 
 …
 
 +   }
 +
/* Create shared ring, alloc event channel. */
err = setup_blkring(dev, info);
if (err)
 @@ -889,12 +916,35 @@ again:
goto destroy_blkring;
}
 
 -   err = xenbus_printf(xbt, dev-nodename,
 -   ring-ref, %u, info-ring_ref);
 -   if (err) {
 -   message = writing ring-ref;
 -   goto abort_transaction;
 +   if (legacy_backend) {
 
 Why not use the simpler interface always when info-ring_order == 0?
 
 Because, as I just found out today via a FreeBSD bug report, that's
 not how XenServer works.  If the front-end publishes ring-page-order,
 the backend assumes the ring-refNN XenStore nodes are in effect,
 even if the order is 0.

I was certainly implying to not write the ring-page-order and
num-ring-pages nodes in that case.

 I'm working on a documentation update for blkif.h now.
 
 sigh
 
 --
 Justin


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices

2012-03-14 Thread Jan Beulich
 On 14.03.12 at 07:32, Justin Gibbs just...@spectralogic.com wrote:
 There's another problem here that I brought up during the Xen
 Hack-a-thon.  The ring macros require that the ring element count
 be a power of two.  This doesn't mean that the ring will be a power
 of 2 pages in size.  To illustrate this point, I modified the FreeBSD
 blkback driver to provide negotiated ring stats via sysctl.
 
 Here's a connection to a Windows VM running the Citrix PV drivers:
 
 dev.xbbd.2.max_requests: 128
 dev.xbbd.2.max_request_segments: 11
 dev.xbbd.2.max_request_size: 45056
 dev.xbbd.2.ring_elem_size: 108  = 32bit ABI
 dev.xbbd.2.ring_pages: 4
 dev.xbbd.2.ring_elements: 128
 dev.xbbd.2.ring_waste: 2496
 
 Over half a page is wasted when ring-page-order is 2.  I'm sure you
 can see where this is going.  :-)
 
 Here are the limits published by our backend to the XenStore:
 
 max-ring-pages = 113
 max-ring-page-order = 7
 max-requests = 256
 max-request-segments = 129
 max-request-size = 524288
 
 Because we allow so many concurrent, large requests in our product,
 the ring wastage really adds up if the front end doesn't support
 the ring-pages variant of the extension.  However, you only need
 a ring-page-order of 3 with this protocol to start seeing pages of
 wasted ring space.
 
 You don't really want to negotiate ring-pages either.  The backends
 often need to support multiple ABIs.  I can easily construct a set
 of limits for the FreeBSD blkback driver which will cause the ring
 limits to vary by a page between the 32bit and 64bit ABIs.
 
 With all this in mind, the backend must do a dance of rounding up,
 taking the max of the ring sizes for the different ABIs, and then
 validating the front-end published limits taking its ABI into
 account.  The front-end does some of this too.  Its way too messy
 and error prone because we don't communicate the ring element limit
 directly.
 
 max-ring-element-order anyone? :-)

Interesting observation - yes, I think deprecating both pre-existing
methods in favor of something along those lines would be desirable.
(But I'd favor not using the term order here as it is - at least in
Linux - usually implied to be used on pages. max-ringent-log2
perhaps?)

What you say also implies that all currently floating around Linux
backend patches are flawed in their way of calculating the number
of ring entries, as this number really depends on the protocol the
frontend advertises.

Further, if you're concerned about wasting ring space (and
particularly in the context of your request number/size/segments
extension), shouldn't we bother to define pairs (or larger groups)
of struct blkif_request_segment (as currently a quarter of the space
is mere padding)? Or split grefs from {first,last}_sect altogether?

Finally, while looking at all this again, I stumbled across the use
of blkif_vdev_t in the ring structures: At least Linux'es blkback
completely ignores this field - {xen_,}vbd_translate() simply
overwrites what dispatch_rw_block_io() put there (and with this,
struct phys_req's dev and bdev members seem rather pointless too).
Does anyone recall what the original intention with this request field
was? Allowing I/O on multiple devices over a single ring?

Bottom line - shouldn't we define a blkif2 interface to cleanly
accommodate all the various extensions (and do away with the
protocol variations)?

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices

2012-03-14 Thread Jan Beulich
 On 14.03.12 at 07:32, Justin Gibbs just...@spectralogic.com wrote:
 There's another problem here that I brought up during the Xen
 Hack-a-thon.  The ring macros require that the ring element count
 be a power of two.  This doesn't mean that the ring will be a power
 of 2 pages in size.  To illustrate this point, I modified the FreeBSD
 blkback driver to provide negotiated ring stats via sysctl.
 
 Here's a connection to a Windows VM running the Citrix PV drivers:
 
 dev.xbbd.2.max_requests: 128
 dev.xbbd.2.max_request_segments: 11
 dev.xbbd.2.max_request_size: 45056
 dev.xbbd.2.ring_elem_size: 108  = 32bit ABI
 dev.xbbd.2.ring_pages: 4
 dev.xbbd.2.ring_elements: 128
 dev.xbbd.2.ring_waste: 2496
 
 Over half a page is wasted when ring-page-order is 2.  I'm sure you
 can see where this is going.  :-)

Having looked a little closer on how the wasted space is progressing,
I find myself in the odd position that I can't explain the original (and
still active) definition of BLKIF_MAX_SEGMENTS_PER_REQUEST (11):
With ring-order zero, there's 0x240/0x1c0 bytes (32/64-bit
respectively) are unused. With 32 requests fitting in the ring, and with
each segment occupying 6 bytes (padded to 8), in the 64-bit variant
there's enough space for a 12th segment (32-bit would even have
space for a 13th). Am I missing anything here?

Plus all this assumes a page size of 4k, yet ia64 had always been using
pages of 16k iirc.

 Here are the limits published by our backend to the XenStore:

 max-ring-pages = 113
 max-ring-page-order = 7
 max-requests = 256
 max-request-segments = 129
 max-request-size = 524288

Oh, so this protocol doesn't require ring-pages (and max-ring-pages)
to be a power of two? In which case I think it is a mistake to also
advertise max-ring-page-order, as at least the (Linux) frontend code
I know of interprets this as being able to set up a ring of (using the
numbers above) 128 pages (unless, of course, your backend can deal
with this regardless of the max-ring-pages value it announces).

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices

2012-03-07 Thread Jan Beulich
 On 06.03.12 at 18:20, Konrad Rzeszutek Wilk kon...@darnok.org wrote:
  - the usage of XenbusStateInitWait? Why do we introduce that? Looks
 like a fix to something.

No, this is required to get the negotiation working (the frontend must
not try to read the new nodes until it can be certain that the backend
populated them). However, as already pointed out in an earlier reply
to Santosh, the way this is done here doesn't appear to allow for the
backend to already be in InitWait state when the frontend gets
invoked.

 - XENBUS_MAX_RING_PAGES - why 2? Why not 4? What is the optimal
 default size for SSD usage? 16?

What do SSDs have to do with a XenBus definition? Imo it's wrong (and
unnecessary) to introduce a limit at the XenBus level at all - each driver
can do this for itself.

As to the limit for SSDs in the block interface - I don't think the number
of possibly simultaneous requests has anything to do with this. Instead,
I'd expect the request number/size/segments extension that NetBSD
apparently implements to possibly have an effect.

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0001/001] xen: multi page ring support for block devices

2012-03-06 Thread Jan Beulich
 On 05.03.12 at 22:49, Santosh Jodh santosh.j...@citrix.com wrote:

Could this be split up into 3 patches, for easier reviewing:
- one adjusting the xenbus interface to allow for multiple ring pages (and
  maybe even that one should be split into the backend and frontend
  related parts), syncing with the similar netback effort?
- one for the blkback changes
- one for the blkfront changes?

 --- a/drivers/block/xen-blkback/xenbus.c
 +++ b/drivers/block/xen-blkback/xenbus.c
 @@ -122,8 +122,8 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
 return blkif;
  }
 
 -static int xen_blkif_map(struct xen_blkif *blkif, unsigned long shared_page,
 -unsigned int evtchn)
 +static int xen_blkif_map(struct xen_blkif *blkif, int ring_ref[],

As you need to touch this anyway, can you please switch this to the
proper type (grant_ref_t) rather than using plain int (not just here)?

 +unsigned int ring_order, unsigned int evtchn)
  {
 int err;
 
 --- a/drivers/block/xen-blkfront.c
 +++ b/drivers/block/xen-blkfront.c
 @@ -135,7 +139,7 @@ static DEFINE_SPINLOCK(minor_lock);
  static int get_id_from_freelist(struct blkfront_info *info)
  {
 unsigned long free = info-shadow_free;
 -   BUG_ON(free = BLK_RING_SIZE);
 +   BUG_ON(free = BLK_MAX_RING_SIZE);

Wouldn't you better check against the actual limit here?

 info-shadow_free = info-shadow[free].req.u.rw.id;
 info-shadow[free].req.u.rw.id = 0x0fee; /* debug */
 return free;
 @@ -698,16 +704,19 @@ static void blkif_free(struct blkfront_info *info, int 
 suspend)
 flush_work_sync(info-work);
 
 /* Free resources associated with old device channel. */
 -   if (info-ring_ref != GRANT_INVALID_REF) {
 -   gnttab_end_foreign_access(info-ring_ref, 0,
 - (unsigned long)info-ring.sring);
 -   info-ring_ref = GRANT_INVALID_REF;
 -   info-ring.sring = NULL;
 +   for (i = 0; i  (1  info-ring_order); i++) {
 +   if (info-ring_ref[i] != GRANT_INVALID_REF) {
 +   gnttab_end_foreign_access(info-ring_ref[i], 0, 0);
 +   info-ring_ref[i] = GRANT_INVALID_REF;
 +   }
 }
 +
 +   free_pages((unsigned long)info-ring.sring, info-ring_order);

No. The freeing must continue happen in gnttab_end_foreign_access()
(with the sole exception when a page was allocated but the grant
didn't get established), since it must be suppressed/delayed when the
grant is still in use (otherwise the kernel will die on the first re-use of
the page). I just happened to fix that problem at the end of last week
in the variant of the patch that we pulled into our tree.

Further, rather than doing a non-zero order allocation here, I'd
suggest allocating individual pages and vmap()-ing them.

 +   info-ring.sring = NULL;
 +
 if (info-irq)
 unbind_from_irqhandler(info-irq, info);
 info-evtchn = info-irq = 0;
 -
  }
 
  static void blkif_completion(struct blk_shadow *s)
 @@ -875,8 +883,27 @@ static int talk_to_blkback(struct xenbus_device *dev,
  {
 const char *message = NULL;
 struct xenbus_transaction xbt;
 +   unsigned int ring_order;
 +   int legacy_backend;
 +   int i;
 int err;
 
 +   for (i = 0; i  (1  info-ring_order); i++)
 +   info-ring_ref[i] = GRANT_INVALID_REF;
 +
 +   err = xenbus_scanf(XBT_NIL, dev-otherend, max-ring-page-order, 
 %u,
 +  ring_order);

At least the frontend should imo also support the alternative interface
(using max-ring-pages etc).

 +
 +   legacy_backend = !(err == 1);
 +
 +   if (legacy_backend) {
 +   info-ring_order = 0;
 +   } else {
 +   info-ring_order = (ring_order = xen_blkif_ring_order) ?
 +  ring_order :
 +  xen_blkif_ring_order;

min()?

 +   }
 +
 /* Create shared ring, alloc event channel. */
 err = setup_blkring(dev, info);
 if (err)
 @@ -889,12 +916,35 @@ again:
 goto destroy_blkring;
 }
 
 -   err = xenbus_printf(xbt, dev-nodename,
 -   ring-ref, %u, info-ring_ref);
 -   if (err) {
 -   message = writing ring-ref;
 -   goto abort_transaction;
 +   if (legacy_backend) {

Why not use the simpler interface always when info-ring_order == 0?

 +   err = xenbus_printf(xbt, dev-nodename,
 +   ring-ref, %d, info-ring_ref[0]);
 +   if (err) {
 +   message = writing ring-ref;
 +   goto abort_transaction;
 +   }
 +   } else {
 +   for (i = 0; i  (1  info-ring_order); i++) {
 +   char key[sizeof(ring-ref) + 2];
 +
 +   

Re: [Xen-devel] [PATCH] blkfront: don't change to closing if we're busy

2012-02-21 Thread Jan Beulich
 On 20.02.12 at 11:35, Andrew Jones drjo...@redhat.com wrote:

 
 - Original Message -
 On Fri, Feb 17, 2012 at 05:52:54PM +0100, Andrew Jones wrote:
  On Thu, Feb 16, 2012 at 12:33:32PM -0500, Andrew Jones wrote:
   Hmm, I should maybe self-nack this. The bug that led me to
   writing
   it is likely only with older tooling, such as RHEL5's. The
   problem
   was if you attempted to detach a mounted disk twice, then the
   second
   time it would succeed. The guest had flipped to Closing on the
   first
   try, and thus didn't issue an error to xenbus on the second. I
   see
  
  Actually, it's even worse than I thought. Just a single attempt to
  detach the device will cause the guest grief (with RHEL5's tools
  anyway). Once xenbus shows the device is in the Closing state, then
  tasks accessing the device may hang.
  
   The reason I only say maybe self-nack though, is because this
   state
   change seemed to be thrown in with another fix[1]. I'm not sure
   if
   the new behavior on legacy hosts was considered or not. If not,
   then
   we can consider it now. Do we want to have deferred asynch
   detaches
   over protecting the guest from multiple detach calls on legacy
   hosts?
   
  
  I guess we can keep the feature and protect the guest with a patch
  like
  I'll send in a moment. It doesn't really work for me with a RHEL5
  host
  though. The deferred close works from the pov of the guest, but the
  state of the block device gets left in Closed after the guest
  unmounts
  it, and then RHEL5's tools can't detach/reattach it. If the
  deferred
  close ever worked on other Xen hosts though, then I don't believe
  this
  patch would break it, and it will also keep the guest safe when run
  on
  hosts that don't treat state=Closing the way it currently expects.
 
 There was another fix that sounds similar to this in the backend.
 6f5986bce558e64fe867bff600a2127a3cb0c006
 
 
 Thanks for the pointer. It doesn't look like the upstream 2.6.18
 tree has that, but it probably would be a good idea there too.

While I had seen the change and considered pulling it in, I wasn't
really convinced this is the right behavior here: After all, if the host
admin requested a resource to be removed from a guest, it shouldn't
depend on the guest whether and when to honor that request, yet
by deferring the disconnect you basically allow the guest to continue
using the disk indefinitely.

Jan

 However, even with that ability to patch backends, we probably
 want the frontends to be more robust on legacy backends for a while
 longer. So, it probably would be best to avoid changing the state to
 Closing while we're still busy, unless it's absolutely necessary.
 
 Drew
 
 ___
 Xen-devel mailing list
 xen-de...@lists.xensource.com 
 http://lists.xensource.com/xen-devel 



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH] blkfront: don't change to closing if we're busy

2012-02-21 Thread Jan Beulich
 On 21.02.12 at 10:23, Andrew Jones drjo...@redhat.com wrote:
  On 20.02.12 at 11:35, Andrew Jones drjo...@redhat.com wrote:
  On Fri, Feb 17, 2012 at 05:52:54PM +0100, Andrew Jones wrote:
  There was another fix that sounds similar to this in the backend.
  6f5986bce558e64fe867bff600a2127a3cb0c006
  
  
  Thanks for the pointer. It doesn't look like the upstream 2.6.18
  tree has that, but it probably would be a good idea there too.
 
 While I had seen the change and considered pulling it in, I wasn't
 really convinced this is the right behavior here: After all, if the
 host
 admin requested a resource to be removed from a guest, it shouldn't
 depend on the guest whether and when to honor that request, yet
 by deferring the disconnect you basically allow the guest to continue
 using the disk indefinitely.
 
 
 I agree. Yesterday I wrote[1] asking if deferred detach is really
 something we want. At the moment, Igor and I are poking through
 xen-blkfront.c, and currently we'd rather see the feature dropped
 in favor of a simplified driver. One that has less release paths,
 and/or release paths with more consistent locking behavior.

I must have missed this, or it's one more instance of delayed mail
delivery via xen-devel.

Konrad - care to revert that original change as having barked up
the wrong tree?

Jan

 [1] http://lists.xen.org/archives/html/xen-devel/2012-02/msg01672.html 


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[Xen-devel] Re: [PATCH] xen: drop anti-dependency on X86_VISWS

2011-04-08 Thread Jan Beulich
 On 08.04.11 at 17:25, Jeremy Fitzhardinge jer...@goop.org wrote:
 On 04/07/2011 11:38 PM, Ian Campbell wrote:
 Is there any downside to this patch (is X86_CMPXCHG in the same sort of
 boat?)
 
 Only if we don't use cmpxchg in shared memory with other domains or the
 hypervisor.  (I don't think it will dynamically switch between real and
 emulated cmpxchg depending on availability.)

Actually it does - see the #ifndef CONFIG_X86_CMPXCHG section
in asm/cmpxchg_32.h.

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 13/14] x86/ticketlock: add slowpath logic

2011-01-20 Thread Jan Beulich
 On 19.01.11 at 19:55, Jeremy Fitzhardinge jer...@goop.org wrote:
 On 01/19/2011 10:39 AM, Srivatsa Vaddagiri wrote:
 I have tested quite extensively with booting a 16-vcpu guest (on a 16-pcpu 
 host)
 and running kernel compine (with 32-threads). Without this patch, I had
 difficulty booting/shutting-down successfully (it would hang mid-way).
 
 Sounds good.  But I like to test with make -j 100-200 to really give
 things a workout.

Hmm, in my experience, heavily over-committing CPUs (e.g. a
domain with double or more the vCPU-s that the system has
pCPU-s, or the pCPU-s the domain is allowed to run on) is a
much better test for eventual problems in the spin lock code
paths.

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 09/14] xen/pvticketlock: Xen implementation for PV ticket locks

2010-11-17 Thread Jan Beulich
 On 16.11.10 at 22:08, Jeremy Fitzhardinge jer...@goop.org wrote:
 +static void xen_lock_spinning(struct arch_spinlock *lock, unsigned want)
  {
 - struct xen_spinlock *xl = (struct xen_spinlock *)lock;
 - struct xen_spinlock *prev;
   int irq = __get_cpu_var(lock_kicker_irq);
 - int ret;
 + struct xen_lock_waiting *w = __get_cpu_var(lock_waiting);
 + int cpu = smp_processor_id();
   u64 start;
  
   /* If kicker interrupts not initialized yet, just spin */
   if (irq == -1)
 - return 0;
 + return;
  
   start = spin_time_start();
  
 - /* announce we're spinning */
 - prev = spinning_lock(xl);
 + w-want = want;
 + w-lock = lock;
 +
 + /* This uses set_bit, which atomic and therefore a barrier */
 + cpumask_set_cpu(cpu, waiting_cpus);

Since you don't allow nesting, don't you need to disable
interrupts before you touch per-CPU state?

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 13/14] x86/ticketlock: add slowpath logic

2010-11-17 Thread Jan Beulich
 On 16.11.10 at 22:08, Jeremy Fitzhardinge jer...@goop.org wrote:
 +static inline void __ticket_enter_slowpath(struct arch_spinlock *lock)
 +{
 + if (sizeof(lock-tickets.tail) == sizeof(u8))
 + asm (LOCK_PREFIX orb %1, %0
 +  : +m (lock-tickets.tail)
 +  : i (TICKET_SLOWPATH_FLAG) : memory);
 + else
 + asm (LOCK_PREFIX orw %1, %0
 +  : +m (lock-tickets.tail)
 +  : i (TICKET_SLOWPATH_FLAG) : memory);
 +}

Came only now to mind: Here and elsewhere, did you try using
%z0 to have gcc produce the opcode suffix character, rather
than having these somewhat ugly if()-s?

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 13/14] x86/ticketlock: add slowpath logic

2010-11-17 Thread Jan Beulich
 On 17.11.10 at 10:08, Jeremy Fitzhardinge jer...@goop.org wrote:
 On 11/17/2010 12:56 AM, Jeremy Fitzhardinge wrote:
 On 11/17/2010 12:52 AM, Jeremy Fitzhardinge wrote:
 But, yes, %z0 sounds interesting.  Is it documented anywhere?  I think
 I've tried to use it in the past and run into gcc bugs.
 This one: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39590 

 Should be OK in this case because there's no 64-bit values to be seen...
 Hm, it fails when __ticket_t is 16 bits:
 
 /home/jeremy/git/linux/arch/x86/include/asm/spinlock.h: Assembler messages:
 /home/jeremy/git/linux/arch/x86/include/asm/spinlock.h:73: Error: suffix
 or operands invalid for `or'
 
 lock; ors $1, 2(%rbx)   #,
 
 
 So I don't think that's going to work out...

Indeed, it's only with 4.5 that non-float operands are properly
supported here. Sad.

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: [PATCH 09/14] xen/pvticketlock: Xen implementation for PV ticket locks

2010-11-17 Thread Jan Beulich
 On 17.11.10 at 10:57, Jeremy Fitzhardinge jer...@goop.org wrote:
 On 11/17/2010 12:52 AM, Jeremy Fitzhardinge wrote:
 On 11/17/2010 12:11 AM, Jan Beulich wrote:
 On 16.11.10 at 22:08, Jeremy Fitzhardinge jer...@goop.org wrote:
 +static void xen_lock_spinning(struct arch_spinlock *lock, unsigned want)
  {
 -  struct xen_spinlock *xl = (struct xen_spinlock *)lock;
 -  struct xen_spinlock *prev;
int irq = __get_cpu_var(lock_kicker_irq);
 -  int ret;
 +  struct xen_lock_waiting *w = __get_cpu_var(lock_waiting);
 +  int cpu = smp_processor_id();
u64 start;
  
/* If kicker interrupts not initialized yet, just spin */
if (irq == -1)
 -  return 0;
 +  return;
  
start = spin_time_start();
  
 -  /* announce we're spinning */
 -  prev = spinning_lock(xl);
 +  w-want = want;
 +  w-lock = lock;
 +
 +  /* This uses set_bit, which atomic and therefore a barrier */
 +  cpumask_set_cpu(cpu, waiting_cpus);
 Since you don't allow nesting, don't you need to disable
 interrupts before you touch per-CPU state?
 Yes, I think you're right - interrupts need to be disabled for the bulk
 of this function.
 
 Actually, on second thoughts, maybe it doesn't matter so much.  The main
 issue is making sure that the interrupt will make the VCPU drop out of
 xen_poll_irq() - if it happens before xen_poll_irq(), it should leave
 the event pending, which will cause the poll to return immediately.  I
 hope.  Certainly disabling interrupts for some of the function will make
 it easier to analyze with respect to interrupt nesting.

That's not my main concern. Instead, what if you get interrupted
anywhere here, the interrupt handler tries to acquire another
spinlock and also has to go into the slow path? It'll overwrite part
or all of the outer context's state.

 Another issue may be making sure the writes and reads of w-want and
 w-lock are ordered properly to make sure that xen_unlock_kick() never
 sees an inconsistent view of the (lock,want) tuple.  The risk being that
 xen_unlock_kick() sees a random, spurious (lock,want) pairing and sends
 the kick event to the wrong VCPU, leaving the deserving one hung.

Yes, proper operation sequence (and barriers) is certainly
required here. If you allowed nesting, this may even become
simpler (as you'd have a single write making visible the new
head pointer, after having written all relevant fields of the
new head structure).

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-11-27 Thread Jan Beulich
 It breaks with:

 Intel machine check architecture supported.
 (XEN) traps.c:1734:d0 Domain attempted WRMSR 0404 from :0001 
 to
 :.
 Intel machine check reporting enabled on CPU#0.
 general protection fault:  [#1] SMP
 Modules linked in:
   

Hm.  Looks like Xen is getting upset about dom0 trying to disable
caching.  No, wait: 0x:?  That's strange; I wonder if
its just misreporting the value, because the code doesn't look like its
trying to write that.

Either way, the fix is to implement xen_write_cr0, and mask off any bits
that Xen won't want us to set/clear (or if it doesn't allow dom0 to
change cr0, just ignore all updates).

Why do you think that's a CR0 write? The messages clearly indicate an
MSR write, and these writes are clearly visible in intel_p{4,6}_mcheck_init()
and amd_mcheck_init(). The question is why intel_p4_mcheck_init() doesn't
check CPUID bits before trying to touch any registers... (And similarly
amd_mcheck_init() is checking only the MCE bit, not the MCA one.)

But then I just noticed that Xen itself doesn't clear the MCE/MCA bits either
in emulate_forced_invalid_op(), apparently under the assumption that PV
guests wouldn't try to make use of this feature.

A simple workaround would be to force mce_disabled to 1 in early Xen
initialization.

Jan


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-11-27 Thread Jan Beulich
The oops and backtrace doesn't suggest it's an MSR write.  Does a crX

Oh, right, the MSR write is being ignored, not failed.

write take the same path through the emulator as an MSR write?

No, the two operations take different paths.

Jan


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [patch 21/33] xen: Xen SMP guest support

2007-06-06 Thread Jan Beulich
--- a/arch/i386/xen/time.c
+++ b/arch/i386/xen/time.c
@@ -105,17 +105,15 @@ static void get_runstate_snapshot(struct
   preempt_enable();
 }
 
-static void setup_runstate_info(void)
+static void setup_runstate_info(int cpu)
 {
   struct vcpu_register_runstate_memory_area area;
 
-  area.addr.v = __get_cpu_var(runstate);
+  area.addr.v = per_cpu(runstate, cpu);
 
   if (HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area,
  smp_processor_id(), area))

Shouldn't this be 'cpu' rather than 'smp_processor_id()'?

   BUG();
-
-  get_runstate_snapshot(__get_cpu_var(runstate_snapshot));
 }
 
 static void do_stolen_accounting(void)

Jan


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [patch 14/33] xen: xen time implementation

2007-06-06 Thread Jan Beulich
 Andi Kleen [EMAIL PROTECTED] 06.06.07 14:18 

 
 Yes, this could be an issue. Is there any way to get an interrupt or MCE
 when thermal throttling occurs?

Yes you can get an thermal interrupt from the local APIC. See the Linux
kernel source. Of course there would be still a race window.

On the other hand some timing issues on throttling are probably 
the smallest of the users' problems when it really happens.

Not if this results in your box hanging - I think throttling is exactly intended
to keep the box alive as long as possible (and I've seen throttling in action,
with the box happily recovering from the situation - after having seen it a
few times I checked and found the fan covered with dust).

Standard Linux just ignores it.

Jan


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 1/2] Relocate VDSO ELF headers to match mapped location with COMPAT_VDSO

2007-04-05 Thread Jan Beulich
+static __cpuinit void reloc_dyn(Elf32_Ehdr *ehdr, unsigned offset)
+{
+  Elf32_Dyn *dyn = (void *)ehdr + offset;
+
+  for(; dyn-d_tag != DT_NULL; dyn++)
+  switch(dyn-d_tag) {
+  case DT_PLTGOT:
+  case DT_HASH:
+  case DT_STRTAB:
+  case DT_SYMTAB:
+  case DT_RELA:
+  case DT_INIT:
+  case DT_FINI:
+  case DT_REL:
+  case DT_JMPREL:
+  case DT_VERSYM:
+  case DT_VERDEF:
+  case DT_VERNEED:
+  dyn-d_un.d_val += VDSO_HIGH_BASE;
+  }
+}

While there's a certain level of control on what DT_* may appear in the
vDSO, not even considering other than the above types seems fragile to
me. Since future additions to the set are supposedly following a fixed
scheme (distinguishing pointers and values via the low bit when below
OLD_DT_LOOS, and using sub-ranges when between DT_HIOS and
OLD_DT_HIOS), at least also handling those would seem like a good
idea, as would warning about unrecognized types.

Also, even though it shouldn't matter for the final result, if doing things
spec-conforming here you should use d_un.d_ptr.

In addition to Roland's remarks about missing symbol table relocation, I
would also assume section headers, if present, should be relocated.

Jan
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 1/2] Relocate VDSO ELF headers to match mapped location with COMPAT_VDSO

2007-04-05 Thread Jan Beulich
 Jeremy Fitzhardinge [EMAIL PROTECTED] 05.04.07 09:31 
Jan Beulich wrote:
 While there's a certain level of control on what DT_* may appear in the
 vDSO, not even considering other than the above types seems fragile to
 me. Since future additions to the set are supposedly following a fixed
 scheme (distinguishing pointers and values via the low bit when below
 OLD_DT_LOOS, and using sub-ranges when between DT_HIOS and
 OLD_DT_HIOS), at least also handling those would seem like a good
 idea, as would warning about unrecognized types.

I wasn't aware of this scheme.  Where is it documented?

Regarding the low bit part, quoting from working gABI chapter 5: To make
it simpler for tools to interpret the contents of dynamic section entries, the
value of each tag, except for those in two special compatibility ranges, will
determine the interpretation of the d_un union. A tag whose value is an
even number indicates a dynamic section entry that uses d_ptr. A tag
whose value is an odd number indicates a dynamic section entry that uses
d_val or that uses neither d_ptr nor d_val. Tags whose values are less
than the special value DT_ENCODING and tags whose values fall between
DT_HIOS and DT_LOPROC do not follow these rules.

Regarding the OS range, all I can point you to is binutils' 
include/elf/common.h.

Jan
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 1/2] Relocate VDSO ELF headers to match mapped location with COMPAT_VDSO

2007-04-05 Thread Jan Beulich
+  for(; dyn-d_tag != DT_NULL; dyn++)
+  switch(dyn-d_tag) {
+  case DT_PLTGOT:
+  case DT_HASH:
+  case DT_STRTAB:
+  case DT_SYMTAB:
+  case DT_RELA:
+  case DT_INIT:
+  case DT_FINI:
+  case DT_REL:
+  case DT_DEBUG:
+  case DT_JMPREL:
+  case DT_VERSYM:
+  case DT_VERDEF:
+  case DT_VERNEED:
+  case DT_ENCODING ... DT_HIOS:
+  /* tags above DT_ENCODING are even if they're
+ a pointer, so skip odd ones */
+  if (dyn-d_tag = DT_ENCODING 
+  (dyn-d_tag  1) == 1)
+  break;
+
+  dyn-d_un.d_ptr += VDSO_HIGH_BASE;
+  }

I'm pretty certain the range OLD_DT_LOOS ... DT_LOOS must be excluded
here (the document version I'm looking at is inconsistent in itself here, saying
in one place to stop at DT_LOOS, in a second to stop at DT_HIOS, and in a
third to include DT_LOOS ... ST_HIOS - the inconsistency goes away if
assuming that the stop at DT_LOOS really means stop at OLD_DT_LOOS,
and the stop at DT_HIOS misses to special-case the OLD_DT_LOOS ...
DT_LOOS range.
Additionally I'm a little worried about excluding DT_ADDRRNGLO ...
DT_ADDRRNGHI in case future binutils ever start defaulting to generate
any of these (namely DT_GNU_HASH).

+#define DT_ENCODING   32

Hmm, I was about to say this ought to be 31 when I realized the discrepancy
between document and binutils. I'll have to ask about this...

Jan
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization