Re: [Qemu-devel] qemu, numa: non-contiguous cpusets

2013-09-30 Thread Borislav Petkov
On Mon, Sep 30, 2013 at 11:05:10AM +0200, Paolo Bonzini wrote:
 I think there are already patches on the list to do that, as part of
 the NUMA memory binding series from Wanlong Gao.

Yeah: https://lists.gnu.org/archive/html/qemu-devel/2013-09/msg02833.html

Although I don't see from it how the syntax for -cpus will look like
from that QAPI magic except that it is an

+   '*cpus':   ['uint16'],

:-)

Thanks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--



Re: [Qemu-devel] [PATCH 1/6] kvm: Add KVM_GET_EMULATED_CPUID

2013-09-30 Thread Borislav Petkov
On Mon, Sep 30, 2013 at 01:13:34PM -0300, Eduardo Habkost wrote:
 I have added it to my TODO-list.  :-)

Cool, thanks. Let me know if I can test stuff and help out somehow.

  
  Also, there's another aspect, while we're here: now that QEMU emulates
  MOVBE with TCG too, how do we specify on the command line, which
  emulation should be used - kvm.ko or QEMU?
 
 You can use accel={tcg,kvm} option on the -machine argument, e.g.
 -machine pc,accel=kvm. Or the -enable-kvm option.

Ah, ok.

Thanks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--



Re: [Qemu-devel] [PATCH 1/6] kvm: Add KVM_GET_EMULATED_CPUID

2013-09-24 Thread Borislav Petkov
On Mon, September 23, 2013 6:28 pm, Eduardo Habkost wrote:
 On Sun, Sep 22, 2013 at 04:44:50PM +0200, Borislav Petkov wrote:
 From: Borislav Petkov b...@suse.de

 Add a kvm ioctl which states which system functionality kvm emulates.
 The format used is that of CPUID and we return the corresponding CPUID
 bits set for which we do emulate functionality.

 Let me check if I understood the purpose of the new ioctl correctly: the
 only reason for GET_EMULATED_CPUID to exist is to allow userspace to
 differentiate features that are native or that are emulated efficiently
 (GET_SUPPORTED_CPUID) and features that are emulated not very
 efficiently (GET_EMULATED_CPUID)?

Not only that - emulated features are not reported in CPUID so they
can be enabled only when specifically and explicitly requested, i.e.
+movbe. Basically, you want to emulate that feature for the guest but
only for this specific guest - the others shouldn't see it.

 If that's the case, how do we decide how efficient emulation should be,
 to deserve inclusion in GET_SUPPORTED_CPUID? I am guessing that the
 criterion will be: if enabling it doesn't risk making performance worse,
 it can get in GET_SUPPORTED_CPUID.

Well, in the MOVBE case, supported means, the host can execute this
instruction natively. Now, you guys say you can emulate x2apic very
efficiently and I'm guessing emulating x2apic doesn't bring any
emulation overhead, thus SUPPORTED_CPUID.

But for single instructions or group of instructions, the distinction
should be very clear.

At least this is how I see it but Gleb probably can comment too.

Thanks.

-- 
Regards/Gruss,
Boris.




Re: [Qemu-devel] [PATCH 1/6] kvm: Add KVM_GET_EMULATED_CPUID

2013-09-26 Thread Borislav Petkov
On Thu, Sep 26, 2013 at 11:19:15AM -0300, Eduardo Habkost wrote:
 Then we may have a problem: some CPU models already have movbe
 included (e.g. Haswell), and patch 6/6 will make -cpu Haswell get
 movbe enabled even if it is being emulated.

Huh? HSW has MOVBE so we won't #UD on it and MOVBE will get executed in
hardware when executing the guest. IOW, we'll never get to the emulation
path of piggybacking on the #UD.

 So if we really want to avoid enabling emulated features by mistake,
 we may need a new CPU flag in addition to enforce to tell QEMU that
 it is OK to enable emulated features (maybe -cpu ...,emulate?).

EMULATED_CPUID are off by default and only if you request them
specifically, they get enabled. If you start with -cpu Haswell, MOVBE
will be already set in the host CPUID.

Or am I missing something?

 But my question still stands: suppose we had x2apic emulation
 implemented but for some reason it was painfully slow, we wouldn't
 want to enable it by mistake. In this case, it would end up on
 EMULATED_CPUID and not on SUPPORTED_CPUID, right?

IMHO we want to enable emulation only when explicitly requested...
regardless of the emulation performance.

Thanks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--



Re: [Qemu-devel] [PATCH 1/6] kvm: Add KVM_GET_EMULATED_CPUID

2013-09-26 Thread Borislav Petkov
On Thu, Sep 26, 2013 at 04:20:59PM -0300, Eduardo Habkost wrote:
 Please point me to the code that does this, because I don't see it on
 patch 6/6.

@@ -1850,7 +1850,14 @@ static void filter_features_for_kvm(X86CPU *cpu)
  wi-cpuid_ecx,
  wi-cpuid_reg);
 uint32_t requested_features = env-features[w];
+
+uint32_t emul_features = kvm_arch_get_emulated_cpuid(s, wi-cpuid_eax,
+wi-cpuid_ecx,
+wi-cpuid_reg);
+
 env-features[w] = host_feat;
+env-features[w] |= (requested_features  emul_features);

Basically we give the requested_features a second chance here.

If we don't request an emulated feature, it won't get enabled.

  If you start with -cpu Haswell, MOVBE
  will be already set in the host CPUID.
  
  Or am I missing something?
 
 In the Haswell example, it is unlikely but possible in theory: you would
 need a CPU that supported all features from Haswell except movbe. But
 what will happen if you are using -cpu n270,enforce on a SandyBridge
 host?

That's an interesting question: AFAICT, it will fail because MOVBE is
not available on the host, right?

And if so, then this is correct behavior IMHO, or how exactly is the
enforce thing supposed to work? Enforce host CPUID?

 Also, we don't know anything about future CPUs or future features
 that will end up on EMULATED_CPUID. The current code doesn't have
 anything to differentiate features that were already included in the
 CPU definition and ones explicitly enabled in the command-line (and I
 would like to keep it that way).

Ok.

 And just because a feature was explicitly enabled in the command-line,
 that doesn't mean the user believe it is acceptable to get it running
 in emulated mode. That's why I propose a new emulate flag, to allow
 features to be enabled in emulated mode.

And I think, saying -cpu ...,+movbe is an explicit statement enough to
say that yes, I am starting this guest and I want MOVBE emulation.

 Well, x2apic is emulated by KVM, and it is on SUPPORTED_CPUID. Ditto
 for tsc-deadline. Or are you talking specifically about instruction
 emulation?

Basically, I'm viewing this from a very practical standpoint - if I
build a kernel which requires MOVBE support but I cannot boot it in kvm
because it doesn't emulate MOVBE (TCG does now but it didn't before)
I'd like to be able to address that shortcoming by emulating that
instruction, if possible.

And the whole discussion grew out from the standpoint of being able to
emulate stuff so that you can do quick and dirty booting of kernels but
not show that emulation capability to the wide audience since it is slow
and it shouldn't be used and then migration has issues, etc, etc.

But hey, I don't really care all that much if I have to also say
-emulate in order to get my functionality.

Thanks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--



Re: [Qemu-devel] [PATCH 1/6] kvm: Add KVM_GET_EMULATED_CPUID

2013-09-28 Thread Borislav Petkov
On Fri, Sep 27, 2013 at 11:21:34AM -0300, Eduardo Habkost wrote:
 The problem here is that requested_features doesn't include just
 the explicit +flag flags, but any flag included in the CPU model
 definition. See the -cpu n270 example below.

Oh, you mean if requested_features would contain a flag included from
the CPU model definition - a flag which we haven't requested explicitly
- and if kvm emulates that flag, then it will get enabled?

Hmm.

 It should, but your patch will make it stop failing because of MOVBE, as
 now it can be emulated[1].

Right.

 enforce makes sure all features are really being enabled. It makes
 QEMU abort if there's any feature that can't be enabled on that host.

Ok.

 [1] Maybe one source of confusion is that the existing code have two
 feature-filtering functions doing basically the same thing:
 filter_features_for_kvm() and kvm_check_features_against_host().  That's

Yes, and the first gets executed unconditionally and does the feature
filtering,  right after the second has run in the kvm_enabled() branch.

 something we must clean up, and they should be unified. enforce should
 become synonymous to make sure filtered_features is all zeroes.  This
 way, libvirt can emulate what 'enforce does while being able to collect
 detailed error information (which is not easy to do if QEMU simply
 aborts).

Ok, maybe someone who's more knowledgeable with this code should do it -
not me :)

Also, there's another aspect, while we're here: now that QEMU emulates
MOVBE with TCG too, how do we specify on the command line, which
emulation should be used - kvm.ko or QEMU?

Thanks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--



[Qemu-devel] qemu, numa: non-contiguous cpusets

2013-09-29 Thread Borislav Petkov
Btw,

while I got your attention, on a not-really related topic: how do we
feel about adding support for specifying a non-contiguous set of cpus
for a numa node in qemu with the -numa option? I.e., like this, for
example:

x86_64-softmmu/qemu-system-x86_64 -smp 8 -numa node,nodeid=0,cpus=0\;2\;4-5 
-numa node,nodeid=1,cpus=1\;3\;6-7

The ';' needs to be escaped from the shell but I'm open for better
suggestions.

Here's a diff:

---
diff --git a/vl.c b/vl.c
index 4e709d5c1c20..82a6c8451fb0 100644
--- a/vl.c
+++ b/vl.c
@@ -1261,9 +1261,27 @@ char *get_boot_devices_list(size_t *size)
 return list;
 }
 
+static int __numa_set_cpus(unsigned long *map, int start, int end)
+{
+if (end = MAX_CPUMASK_BITS) {
+end = MAX_CPUMASK_BITS - 1;
+fprintf(stderr,
+qemu: NUMA: A max of %d VCPUs are supported\n,
+ MAX_CPUMASK_BITS);
+return -EINVAL;
+}
+
+if (end  start) {
+return -EINVAL;
+}
+
+bitmap_set(map, start, end-start+1);
+return 0;
+}
+
 static void numa_node_parse_cpus(int nodenr, const char *cpus)
 {
-char *endptr;
+char *endptr, *ptr = (char *)cpus;
 unsigned long long value, endvalue;
 
 /* Empty CPU range strings will be considered valid, they will simply
@@ -1273,7 +1291,8 @@ static void numa_node_parse_cpus(int nodenr, const char 
*cpus)
 return;
 }
 
-if (parse_uint(cpus, value, endptr, 10)  0) {
+fromthetop:
+if (parse_uint(ptr, value, endptr, 10)  0) {
 goto error;
 }
 if (*endptr == '-') {
@@ -1282,22 +1301,22 @@ static void numa_node_parse_cpus(int nodenr, const char 
*cpus)
 }
 } else if (*endptr == '\0') {
 endvalue = value;
-} else {
-goto error;
-}
+} else if (*endptr == ';') {
+   if (__numa_set_cpus(node_cpumask[nodenr], value, value)  0) {
+   goto error;
+   }
+   endptr++;
+if (*endptr == '\0')
+   return;
 
-if (endvalue = MAX_CPUMASK_BITS) {
-endvalue = MAX_CPUMASK_BITS - 1;
-fprintf(stderr,
-qemu: NUMA: A max of %d VCPUs are supported\n,
- MAX_CPUMASK_BITS);
-}
+   ptr = endptr;
 
-if (endvalue  value) {
+   goto fromthetop;
+} else {
 goto error;
 }
 
-bitmap_set(node_cpumask[nodenr], value, endvalue-value+1);
+__numa_set_cpus(node_cpumask[nodenr], value, endvalue);
 return;
 
 error:
--

Thanks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--



Re: [Qemu-devel] [edk2] SetVirtualAddressMap and NX bit

2013-08-01 Thread Borislav Petkov
+ Matt.

On Wed, Jul 31, 2013 at 02:10:04PM +0200, Laszlo Ersek wrote:
 Just random ideas...

First of all, thanks for looking. You made me look too and find the fun
:-)

The fact that you guys didn't say Oh yeah, we do this because...  but
simply shruggingly suggested ideas should've been enough to give me the
hint to look in our own backyard and maybe to permit the possibility of
the kernel doing something funny. And it does, indeed!

And for that you need to look at SetVirtualAddressMap() itself or
rather, how we call it:

phys_efi_set_virtual_address_map
|- efi_call_phys_prelog
|- efi_call_phys4(efi_phys.set_virtual_address_map
|- efi_call_phys_epilog

Now guess what those pre- and epi- things do. Right:

efi_call_phys_prelog does early_code_mapping_set_exec(1) and
efi_call_phys_epilog does early_code_mapping_set_exec(0) and we end up
with that PTE's NX bit set:

before:
 [   47.379000] __lookup_address_in_pgd: pte: 0x7fb12063 
 (0x88007c823b68)

after:
 [   47.393000] __lookup_address_in_pgd: pte: 0x80007fb12163 
 (0x88007c823b68)

What is still missing from the big picture is why the PTE in my
pagetable (not the kernel's pagetable) gets that bit set??

I mean, the EFI code is using pgd_offset_k() which looks at init_mm and
my PGD is a different one. And I guess the explanation for that would
also clarify why this doesn't happen on baremetal so probably it has
something to do with the nested page table thingy.

Oh well...

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--



Re: [Qemu-devel] [PATCH 1/3] x86, mpx: add documentation on Intel MPX

2013-12-06 Thread Borislav Petkov
On Sat, Dec 07, 2013 at 02:52:54AM +0800, Qiaowei Ren wrote:
 This patch adds the Documentation/intel_mpx.txt file with some
 information about Intel MPX.
 
 Signed-off-by: Qiaowei Ren qiaowei@intel.com
 Signed-off-by: Xudong Hao xudong@intel.com
 Signed-off-by: Liu Jinsong jinsong@intel.com
 ---
  Documentation/intel_mpx.txt |   77 
 +++

Documentation/x86/ is probably a more fitting place for this.

  1 files changed, 77 insertions(+), 0 deletions(-)
  create mode 100644 Documentation/intel_mpx.txt
 
 diff --git a/Documentation/intel_mpx.txt b/Documentation/intel_mpx.txt
 new file mode 100644
 index 000..3d947d0
 --- /dev/null
 +++ b/Documentation/intel_mpx.txt
 @@ -0,0 +1,77 @@
 +Intel(R) MPX Overview:
 +=
 +
 +Intel(R) Memory Protection Extensions (Intel(R) MPX) is a new
 +capability introduced into Intel Architecture. Intel MPX can
 +increase the robustness of software when it is used in conjunction
 +with compiler changes to check memory references, for those
 +references whose compile-time normal intentions are usurped

That's a strange formulation, what does it actually mean? The intentions
of references??

 +at runtime due to buffer overflow or underflow.
 +
 +Two of the most important goals of Intel MPX are to provide
 +this capability at very low performance overhead for newly
 +compiled code, and to provide compatibility mechanisms with
 +legacy software components. A direct benefit Intel MPX provides
 +is hardening software against malicious attacks designed to
 +cause or exploit buffer overruns.
 +
 +For details about the Intel MPX instructions, see Intel(R)
 +Architecture Instruction Set Extensions Programming Reference.
 +
 +Intel(R) MPX Programming Model
 +--
 +
 +Intel MPX introduces new registers and new instructions that
 +operate on these registers. Some of the registers added are
 +bounds registers which store a pointer's lower bound and upper
 +bound limits. Whenever the pointer is used, the requested
 +reference is checked against the pointer's associated bounds,
 +thereby preventing out-of-bound memory access (such as buffer
 +overflows and overruns). Out-of-bounds memory references
 +initiate a #BR exception which can then be handled in an
 +appropriate manner.
 +
 +Loading and Storing Bounds using Translation
 +
 +
 +Intel MPX defines two instructions for load/store of the linear
 +address of a pointer to a buffer, along with the bounds of the
 +buffer into a paging structure of extended bounds. Specifically
 +when storing extended bounds, the processor will perform address
 +translation of the address where the pointer is stored to an
 +address in the Bound Table (BT) to determine the store location
 +of extended bounds. Loading of an extended bounds performs the

s/an//

 +reverse sequence.
 +

...

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--



Re: [Qemu-devel] [PATCH 2/3] X86, mpx: Intel MPX definition

2013-12-06 Thread Borislav Petkov
On Sat, Dec 07, 2013 at 02:52:55AM +0800, Qiaowei Ren wrote:
 
 Signed-off-by: Qiaowei Ren qiaowei@intel.com
 Signed-off-by: Xudong Hao xudong@intel.com
 Signed-off-by: Liu Jinsong jinsong@intel.com
 ---
  arch/x86/include/asm/cpufeature.h |2 ++
  1 files changed, 2 insertions(+), 0 deletions(-)

This patch should probably be merged with the next one...

 
 diff --git a/arch/x86/include/asm/cpufeature.h 
 b/arch/x86/include/asm/cpufeature.h
 index d3f5c63..6c2738d 100644
 --- a/arch/x86/include/asm/cpufeature.h
 +++ b/arch/x86/include/asm/cpufeature.h
 @@ -216,6 +216,7 @@
  #define X86_FEATURE_ERMS (9*32+ 9) /* Enhanced REP MOVSB/STOSB */
  #define X86_FEATURE_INVPCID  (9*32+10) /* Invalidate Processor Context ID */
  #define X86_FEATURE_RTM  (9*32+11) /* Restricted Transactional 
 Memory */
 +#define X86_FEATURE_MPX  (9*32+14) /* Memory Protection 
 Extension */
  #define X86_FEATURE_RDSEED   (9*32+18) /* The RDSEED instruction */
  #define X86_FEATURE_ADX  (9*32+19) /* The ADCX and ADOX 
 instructions */
  #define X86_FEATURE_SMAP (9*32+20) /* Supervisor Mode Access Prevention 
 */
 @@ -330,6 +331,7 @@ extern const char * const x86_power_flags[32];
  #define cpu_has_perfctr_l2   boot_cpu_has(X86_FEATURE_PERFCTR_L2)
  #define cpu_has_cx8  boot_cpu_has(X86_FEATURE_CX8)
  #define cpu_has_cx16 boot_cpu_has(X86_FEATURE_CX16)
 +#define cpu_has_mpx  boot_cpu_has(X86_FEATURE_MPX)

... and we're trying to not have more of those macros so people should be simply
using boot_cpu_has(X86_FEATURE_YYY).

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--



Re: [Qemu-devel] [PATCH 3/3] X86, mpx: Intel MPX xstate feature definition

2013-12-06 Thread Borislav Petkov
On Sat, Dec 07, 2013 at 02:52:56AM +0800, Qiaowei Ren wrote:

Commit message please.

 Signed-off-by: Qiaowei Ren qiaowei@intel.com
 Signed-off-by: Xudong Hao xudong@intel.com
 Signed-off-by: Liu Jinsong jinsong@intel.com
 ---
  arch/x86/include/asm/processor.h |   23 +++
  arch/x86/include/asm/xsave.h |6 +-
  2 files changed, 28 insertions(+), 1 deletions(-)
 
 diff --git a/arch/x86/include/asm/processor.h 
 b/arch/x86/include/asm/processor.h
 index 987c75e..43be6f6 100644
 --- a/arch/x86/include/asm/processor.h
 +++ b/arch/x86/include/asm/processor.h
 @@ -370,6 +370,26 @@ struct ymmh_struct {
   u32 ymmh_space[64];
  };
  
 +struct lwp_struct {
 + u64 lwpcb_addr;
 + u32 flags;
 + u32 buf_head_offset;
 + u64 buf_base;
 + u32 buf_size;
 + u32 filters;
 + u64 saved_event_record[4];
 + u32 event_counter[16];
 +};
 +
 +struct bndregs_struct {
 + u64 bndregs[8];
 +} __packed;
 +
 +struct bndcsr_struct {
 + u64 cfg_reg_u;
 + u64 status_reg;
 +} __packed;
 +
  struct xsave_hdr_struct {
   u64 xstate_bv;
   u64 reserved1[2];
 @@ -380,6 +400,9 @@ struct xsave_struct {
   struct i387_fxsave_struct i387;
   struct xsave_hdr_struct xsave_hdr;
   struct ymmh_struct ymmh;
 + struct lwp_struct lwp;

I'm guessing this and the struct lwp_struct above is being added so that
you can have the LWP XSAVE area size? If so, you don't need it: LWP
XSAVE area is 128 bytes at offset 832 according to my manuals so I'd
guess having a u8 lwp_area[128] should be fine.


 + struct bndregs_struct bndregs;
 + struct bndcsr_struct bndcsr;
   /* new processor state extensions will go here */
  } __attribute__ ((packed, aligned (64)));
  
 diff --git a/arch/x86/include/asm/xsave.h b/arch/x86/include/asm/xsave.h
 index 0415cda..5cd9de3 100644
 --- a/arch/x86/include/asm/xsave.h
 +++ b/arch/x86/include/asm/xsave.h
 @@ -9,6 +9,8 @@
  #define XSTATE_FP0x1
  #define XSTATE_SSE   0x2
  #define XSTATE_YMM   0x4
 +#define XSTATE_BNDREGS   0x8
 +#define XSTATE_BNDCSR0x10
  
  #define XSTATE_FPSSE (XSTATE_FP | XSTATE_SSE)
  
 @@ -20,10 +22,12 @@
  #define XSAVE_YMM_SIZE   256
  #define XSAVE_YMM_OFFSET(XSAVE_HDR_SIZE + XSAVE_HDR_OFFSET)
  
 +#define XSTATE_FLEXIBLE (XSTATE_FP | XSTATE_SSE | XSTATE_YMM)

What's the use of that macro if it is used only once?

 +#define XSTATE_EAGER (XSTATE_BNDREGS | XSTATE_BNDCSR)
  /*
   * These are the features that the OS can handle currently.
   */
 -#define XCNTXT_MASK  (XSTATE_FP | XSTATE_SSE | XSTATE_YMM)
 +#define XCNTXT_MASK  (XSTATE_FLEXIBLE | XSTATE_EAGER)
  
  #ifdef CONFIG_X86_64
  #define REX_PREFIX   0x48, 

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--



Re: [Qemu-devel] [PATCH 1/3] x86, mpx: add documentation on Intel MPX

2013-12-06 Thread Borislav Petkov
On Fri, Dec 06, 2013 at 03:55:10PM +, Ren, Qiaowei wrote:
 It is from public introduction and specification, you can refer to
http://software.intel.com/en-us/articles/introduction-to-intel-memory-protection-extensions

Yep, saw it there too. Which doesn't make it any less strange :)

Btw, if you're going to quote the public documentation, why even add the
text file here? You can simply add the link above as a comment to the
code or as a oneliner somewhere in Documentation/x86/.

Thanks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--



Re: [Qemu-devel] [PATCH 3/3] X86, mpx: Intel MPX xstate feature definition

2013-12-06 Thread Borislav Petkov
On Fri, Dec 06, 2013 at 09:23:22AM -0800, H. Peter Anvin wrote:
 On 12/06/2013 05:46 AM, Borislav Petkov wrote:
  I'm guessing this and the struct lwp_struct above is being added so that
  you can have the LWP XSAVE area size? If so, you don't need it: LWP
  XSAVE area is 128 bytes at offset 832 according to my manuals so I'd
  guess having a u8 lwp_area[128] should be fine.
  
 Sure, but any reason to *not* document the internal structure?

Only that you might start getting remove-this-unused-struct patches. :-)

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--



Re: [Qemu-devel] [RFC PATCH 2/6] virtio/console: Add a failback for unstealable pipe buffer

2012-08-09 Thread Borislav Petkov
On Thu, Aug 09, 2012 at 02:33:12PM +0530, Amit Shah wrote:
  @@ -807,9 +807,31 @@ static int pipe_to_sg(struct pipe_inode_info *pipe, 
  struct pipe_buffer *buf,
   
  len = min(buf-len, sd-len);
  sg_set_page((sgl-sg[sgl-n]), buf-page, len, buf-offset);
  -   sgl-n++;
  -   sgl-len += len;
  +   } else {
  +   /* Failback to copying a page */
  +   struct page *page = alloc_page(GFP_KERNEL);
 
 I prefer zeroing out the page.  If there's not enough data to be
 filled in the page, the remaining data can be leaked to the host.

get_zeroed_page()?

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551



Re: [Qemu-devel] [RFC PATCH 1/2] ivring: Add a ring-buffer driver on IVShmem

2012-06-05 Thread Borislav Petkov
On Tue, Jun 05, 2012 at 10:01:17PM +0900, Yoshihiro YUNOMAE wrote:
 This patch adds a ring-buffer driver for IVShmem device, a virtual RAM device 
 in
 QEMU. This driver can be used as a ring-buffer for kernel logging or tracing 
 of
 a guest OS by recording kernel programing or SystemTap.
 
 This ring-buffer driver is implemented very simple. First 4kB of shared memory
 region is control structure of a ring-buffer. In this region, some values for
 managing the ring-buffer is stored such as bits and mask of whole memory size,
 writing position, threshold value for notification to a reader on a host OS.
 This region is used by the reader to know writing position. Then, total
 memory size - 4kB equals to usable memory region for recording data.
 This ring-buffer driver records any data from start to end of the writable
 memory region.
 
 When writing size exceeds a threshold value, this driver can notify a reader
 to read data by using writel(). As this later patch, reader does not have any
 function for receiving the notification. This notification feature will be 
 used
 near the future.
 
 As a writer records data in this ring-buffer, spinlock function is used to
 avoid competing by some writers in multi CPU environment. Not to use spinlock,
 lockless ring-buffer like as ftrace and one ring-buffer one CPU will be
 implemented near the future.

Yet another ring buffer?

We already have an ftrace and perf ring buffer, can't you use one of those?

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551



Re: [Qemu-devel] x86, nops settings result in kernel crash

2012-08-16 Thread Borislav Petkov
On Thu, Aug 16, 2012 at 09:35:12AM -0400, Tomas Racek wrote:
 Hi,
 
 I am writing a file system test which I execute in qemu with kernel compiled 
 from latest git sources and running it causes this error:
 
 https://bugzilla.kernel.org/show_bug.cgi?id=45971
 
 It works with v3.5, so I ran git bisect which pointed me to:
 
 d6250a3f12edb3a86db9598ffeca3de8b4a219e9 x86, nops: Missing break resulting 
 in incorrect selection on Intel
 
 To be quite honest, I don't understand this stuff much but I tried to do some 
 debugging and I figured out (I hope) that the crash is caused by setting 
 ideal_nops to p6_nops (k8_nops was used before the break statement was added).

Maybe I overlooked it or maybe it was implied but did you try reverting
the patch and rerunning your test? Does it work ok then?

Thanks.

-- 
Regards/Gruss,
Boris.



Re: [Qemu-devel] x86, nops settings result in kernel crash

2012-08-17 Thread Borislav Petkov
On Fri, Aug 17, 2012 at 03:43:56AM -0400, Tomas Racek wrote:
 Well, I've added some debug statements to the code:
 
 void __init arch_init_ideal_nops(void)
 {
 switch (boot_cpu_data.x86_vendor) {
 case X86_VENDOR_INTEL:
 /*
  * Due to a decoder implementation quirk, some
  * specific Intel CPUs actually perform better with
  * the k8_nops than with the SDM-recommended NOPs.
  */
 if (boot_cpu_data.x86 == 6 
 boot_cpu_data.x86_model = 0x0f 
 boot_cpu_data.x86_model != 0x1c 
 boot_cpu_data.x86_model != 0x26 
 boot_cpu_data.x86_model != 0x27 
 boot_cpu_data.x86_model  0x30) {
 printk(NOPS: Option 1\n);
 ideal_nops = k8_nops;
 } else if (boot_cpu_has(X86_FEATURE_NOPL)) {
 printk(NOPS: Option 2\n);
ideal_nops = p6_nops;
 } else {
 printk(NOPS: Option 3\n);
 #ifdef CONFIG_X86_64
 ideal_nops = k8_nops;
 #else
 ideal_nops = intel_nops;
 #endif
 }
 break;
 default:
 #ifdef CONFIG_X86_64
 ideal_nops = k8_nops;
 #else
 if (boot_cpu_has(X86_FEATURE_K8))
 ideal_nops = k8_nops;
 else if (boot_cpu_has(X86_FEATURE_K7))
 ideal_nops = k7_nops;
 else
 ideal_nops = intel_nops;
 #endif
 }
 }
 
 This gives me Option 1 with -cpu host and Option 2 without.

This looks like an emulation bug. The interesting thing is that your
both traces from the bugzilla point to generic_make_request_checks but
it could also be due to timing.

Decoding the instruction stream in the second trace in the bugzilla gives:

[ 278.595106] Code: 03 48 89 03 48 8b 57 70 48 89 53 10 48 2b 01 8b 3f 48 89 45 
98 48 8b 82 90 00 00 00 89 7d 94 48 8b 80 60 02 00 00 48 89 45 88 ac 17 00 00 
c8 45 85 e4 74 30 48 8b 43 10 48 8b 40 08 48 8b 40 48 
All code

   0:   03 48 89add-0x77(%rax),%ecx
   3:   03 48 8badd-0x75(%rax),%ecx
   6:   57  push   %rdi
   7:   70 48   jo 0x51
   9:   89 53 10mov%edx,0x10(%rbx)
   c:   48 2b 01sub(%rcx),%rax
   f:   8b 3f   mov(%rdi),%edi
  11:   48 89 45 98 mov%rax,-0x68(%rbp)
  15:   48 8b 82 90 00 00 00mov0x90(%rdx),%rax
  1c:   89 7d 94mov%edi,-0x6c(%rbp)
  1f:   48 8b 80 60 02 00 00mov0x260(%rax),%rax
  26:   48 89 45 88 mov%rax,-0x78(%rbp)
  2a:   ac  lods   %ds:(%rsi),%al
  2b:*  17  (bad)   -- trapping instruction
  2c:   00 00   add%al,(%rax)
  2e:   c8 45 85 e4 enterq $0x8545,$0xe4
  32:   74 30   je 0x64
  34:   48 8b 43 10 mov0x10(%rbx),%rax
  38:   48 8b 40 08 mov0x8(%rax),%rax
  3c:   48 8b 40 48 mov0x48(%rax),%rax
...

Code starting with the faulting instruction
===
   0:   17  (bad)  
   1:   00 00   add%al,(%rax)
   3:   c8 45 85 e4 enterq $0x8545,$0xe4
   7:   74 30   je 0x39
   9:   48 8b 43 10 mov0x10(%rbx),%rax
   d:   48 8b 40 08 mov0x8(%rax),%rax
  11:   48 8b 40 48 mov0x48(%rax),%rax


and an instruction with opcode 0x17 in 64-bit mode is, AFAICT,
invalid (on 32-bit it is pop %ss according to this thing:
http://www.onlinedisassembler.com).

-- 
Regards/Gruss,
Boris.



[Qemu-devel] [PATCH] target-i386: n270 can MOVBE

2013-02-08 Thread Borislav Petkov
From: Borislav Petkov b...@suse.de

The Atom core (cpu name n270 in QEMU speak) supports MOVBE. This is
needed when booting 3.8 and later linux kernels built with the MATOM
target because we require MOVBE in order to boot properly now.

Cc: H. Peter Anvin h...@zytor.com
Cc: Richard Henderson r...@twiddle.net
Signed-off-by: Borislav Petkov b...@suse.de
---

Right, so I was playing with booting MATOM kernels in QEMU.
As it turns out, QEMU's n270 model doesn't advertize MOVBE
although the real hardware supports it. Quick search pointed me to
http://lists.nongnu.org/archive/html/qemu-devel/2013-01/msg04317.html
which adds that support, among other things. I've merged Richard's
patchset with qemu's current master and after applying this patch below,
I can report success booting an MATOM kernel with QEMU. The same kernel
boots on the real n270 hardware, btw.

 target-i386/cpu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 9f38e4435e53..83816edd8410 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -610,7 +610,8 @@ static x86_def_t builtin_x86_defs[] = {
 CPUID_ACPI | CPUID_SS | CPUID_HT | CPUID_TM | CPUID_PBE,
 /* Some CPUs got no CPUID_SEP */
 .ext_features = CPUID_EXT_SSE3 | CPUID_EXT_MONITOR | CPUID_EXT_SSSE3 |
-CPUID_EXT_DSCPL | CPUID_EXT_EST | CPUID_EXT_TM2 | CPUID_EXT_XTPR,
+CPUID_EXT_DSCPL | CPUID_EXT_EST | CPUID_EXT_TM2 | CPUID_EXT_XTPR |
+   CPUID_EXT_MOVBE,
 .ext2_features = (PPRO_FEATURES  CPUID_EXT2_AMD_ALIASES) |
 CPUID_EXT2_NX,
 .ext3_features = CPUID_EXT3_LAHF_LM,
-- 
1.8.1.2.422.g08c0e7f




Re: [Qemu-devel] [PATCH] target-i386: n270 can MOVBE

2013-02-08 Thread Borislav Petkov
Hi Andreas,

On Fri, Feb 08, 2013 at 12:38:03PM +0100, Andreas Färber wrote:
 Am 08.02.2013 10:30, schrieb Borislav Petkov:
  From: Borislav Petkov b...@suse.de
  
  The Atom core (cpu name n270 in QEMU speak) supports MOVBE. This is
  needed when booting 3.8 and later linux kernels built with the MATOM
  target because we require MOVBE in order to boot properly now.
  
  Cc: H. Peter Anvin h...@zytor.com
  Cc: Richard Henderson r...@twiddle.net
  Signed-off-by: Borislav Petkov b...@suse.de
 
 Please CC me on cpu.c changes (cf. MAINTAINERS).

Wow, qemu has get_maintainer.pl too:

$ git show HEAD | ./scripts/get_maintainer.pl
Andreas Färber afaer...@suse.de (supporter:CPU)
qemu-devel@nongnu.org (odd fixer:X86

:-)

Ok, fixed.

   .ext_features = CPUID_EXT_SSE3 | CPUID_EXT_MONITOR | 
  CPUID_EXT_SSSE3 |
  -CPUID_EXT_DSCPL | CPUID_EXT_EST | CPUID_EXT_TM2 | 
  CPUID_EXT_XTPR,
  +CPUID_EXT_DSCPL | CPUID_EXT_EST | CPUID_EXT_TM2 | 
  CPUID_EXT_XTPR |
  +   CPUID_EXT_MOVBE,
 
 Tab. Please use scripts/checkpatch.pl to verify before sending.

Done.

   .ext2_features = (PPRO_FEATURES  CPUID_EXT2_AMD_ALIASES) |
   CPUID_EXT2_NX,
   .ext3_features = CPUID_EXT3_LAHF_LM,
 
 Otherwise if someone can ack (or if you can point me to a manual), this
 looks like a good bugfix for v1.4.

Right, I don't know what v1.4 is but this still needs Richard's patchset
enabling MOVBE dynamic translation in qemu to go in first before
enabling MOVBE for the n270 model.

Anyway, fixed patch is below.

 CC'ing some more CPU'ish people.

Well, if you do cat /proc/cpuinfo on an Atom, you can see movbe there.

Here it is from my atom box:

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 28
model name  : Intel(R) Atom(TM) CPU N270   @ 1.60GHz
stepping: 2
microcode   : 0x208
cpu MHz : 800.000
cache size  : 512 KB
physical id : 0
siblings: 1
core id : 0
cpu cores   : 1
apicid  : 0
initial apicid  : 0
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 10
wp  : yes
flags   : fpu vme de tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat 
clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe constant_tsc arch_perfmon pebs 
bts aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 xtpr pdcm movbe lahf_lm 
dtherm
bogomips: 3192.01
clflush size: 64
cache_alignment : 64
address sizes   : 32 bits physical, 32 bits virtual
power management:

Or, more cpu vendor-y, do CPUID.EAX(1) on an n270 and bit 22 in ECX is
set. Here is a full base cpuid leafs dump:

$ ./dump_cpuid
max base function: 0x000a
CPUID.EAX=0x
EAX=0x000a, EBX=0x756e6547, ECX=0x6c65746e, EDX=0x49656e69
CPUID.EAX=0x0001
EAX=0x000106c2, EBX=0x01020800, ECX=0x0040c39d, EDX=0xbfe9fbff
^
bit 22 here is MOVBE presence.

CPUID.EAX=0x0002
EAX=0x4fba5901, EBX=0x0e3080c0, ECX=0x, EDX=0x
CPUID.EAX=0x0003
EAX=0x, EBX=0x, ECX=0x, EDX=0x
CPUID.EAX=0x0004
EAX=0x4121, EBX=0x0140003f, ECX=0x003f, EDX=0x0001
CPUID.EAX=0x0005
EAX=0x0040, EBX=0x0040, ECX=0x0003, EDX=0x00020220
CPUID.EAX=0x0006
EAX=0x0001, EBX=0x0002, ECX=0x0001, EDX=0x
CPUID.EAX=0x0007
EAX=0x, EBX=0x, ECX=0x, EDX=0x
CPUID.EAX=0x0008
EAX=0x, EBX=0x, ECX=0x, EDX=0x
CPUID.EAX=0x0009
EAX=0x, EBX=0x, ECX=0x, EDX=0x
CPUID.EAX=0x000a
EAX=0x07280203, EBX=0x, ECX=0x, EDX=0x2501

And the official document is this one here:

http://www.intel.com/content/www/us/en/processors/processor-identification-cpuid-instruction-note.html

Table 5-5 Feature Flags Reported in the ECX Register (Sheet 1 of 2).

HTH.

--
From 5371732b65fbb9a873b52c67a260ffcaf7caf88d Mon Sep 17 00:00:00 2001
From: Borislav Petkov b...@suse.de
Date: Fri, 8 Feb 2013 10:19:44 +0100
Subject: [PATCH] target-i386: n270 can MOVBE
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The Atom core (cpu name n270 in QEMU speak) supports MOVBE. This is
needed when booting 3.8 and later linux kernels built with the MATOM
target because we require MOVBE in order to boot properly now.

Cc: H. Peter Anvin h...@zytor.com
Cc: Richard Henderson r...@twiddle.net
Cc: Andreas Färber afaer...@suse.de
Signed-off-by: Borislav Petkov b...@suse.de
---
 target-i386/cpu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 4a516e1f9e25..36a133462a8d 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -620,7 +620,8 @@ static x86_def_t builtin_x86_defs[] = {
 CPUID_ACPI | CPUID_SS | CPUID_HT

Re: [Qemu-devel] [PATCH] target-i386: n270 can MOVBE

2013-02-08 Thread Borislav Petkov
On Fri, Feb 08, 2013 at 01:19:02PM -0200, Eduardo Habkost wrote:
 -machine pc-1.3 -cpu n270 (and older machine-types) needs to keep
 MOVBE disabled, or you will break live migration.

 Personally I wouldn't mind declaring n270 as a non-migratable CPU
 model. But libvirt supports n270, so it already expects n270 to expose
 a stable ABI on each machine-type.

From looking at the code - and I've never looked at qemu code before
last night - we could filter migration features in x86_cpu_realize()
when called down the pc_init1() path.

We simply need to look the QEMUMachine.is_default flag and leave the
feature on only if pc-1.4 which sets the is_default thing.

Or is there a more preferred way to do that?

Thanks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--



Re: [Qemu-devel] [PATCH] target-i386: n270 can MOVBE

2013-02-08 Thread Borislav Petkov
On Fri, Feb 08, 2013 at 01:58:58PM -0200, Eduardo Habkost wrote:

[ … ]

 As we don't have a decent method to do that today, we are using static
 variables and compatibility-setup functions called from the machine init
 function. See disable_kvm_pv_eoi() for example.
 
 One day we will be able to do that using properties on the machine-type
 compat_props tables, but we can't do that yet.

Ok, understood.

 People can easily work around the lack of the feature today, using
 -cpu n270,+movbe,

Are you sure?

$ qemu-system-i386 -snapshot ... -cpu n270,+movbe ...

from latest qemu master doesn't seem to work here. We still don't see
bit 22 in ECX of CPUID.EAX(1) advertized to the guest.

But that's not the big problem - we still need the actual implementation
of MOVBE in qemu otherwise the guest kernel #GPs when trying to execute
that instruction.

Thanks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--



Re: [Qemu-devel] [PATCH] target-i386: n270 can MOVBE

2013-02-08 Thread Borislav Petkov
On Fri, Feb 08, 2013 at 04:23:04PM -0200, Eduardo Habkost wrote:
 If you don't have the implementation of MOVBE working (and included on
 TCG_EXT_FEATURES), neither your patch or +movbe can help you.

Yes, running -cpu n270,+movbe with Richard's patchset merged works.

Thanks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--



Re: [Qemu-devel] [visorchipset] invalid opcode: 0000 [#1] PREEMPT SMP

2014-04-13 Thread Borislav Petkov
Should we perhaps CC qemu-devel here for an opinion.

Guys, this mail should explain the issue but in case there are
questions, the whole thread starts here:

http://lkml.kernel.org/r/20140407111725.GC25152@localhost

Thanks.

On Sat, Apr 12, 2014 at 01:35:49AM +0800, Jet Chen wrote:
 On 04/12/2014 12:33 AM, H. Peter Anvin wrote:
  On 04/11/2014 06:51 AM, Romer, Benjamin M wrote:
 
  I'm still confused where KVM comes into the picture.  Are you actually
  using KVM (and thus talking about nested virtualization) or are you
  using Qemu in JIT mode and running another hypervisor underneath?
 
  The test that Fengguang used to find the problem was running the linux
  kernel directly using KVM. When the kernel was run with -cpu Haswell,
  +smep,+smap set, the vmcall failed with invalid op, but when the kernel
  is run with -cpu qemu64, the vmcall causes a vmexit, as it should.
  
  As far as I know, Fengguang's test doesn't use KVM at all, it runs Qemu
  as a JIT.  Completely different thing.  In that case Qemu probably
  should *not* set the hypervisor bit.  However, the only thing that the
  hypervisor bit means is that you can look for specific hypervisor APIs
  in CPUID level 0x4000+.
  
  My point is, the vmcall was made because the hypervisor bit was set. If
  this bit had been turned off, as it would be on a real processor, the
  vmcall wouldn't have happened.
  
  And my point is that that is a bug.  In the driver.  A very serious one.
   You cannot call VMCALL until you know *which* hypervisor API(s) you
  have available, period.
  
  The hypervisor bit is a complete red herring. If the guest CPU is
  running in VT-x mode, then VMCALL should VMEXIT inside the guest
  (invoking the guest root VT-x), 
 
  The CPU is running in VT-X. That was my point, the kernel is running in
  the KVM guest, and KVM is setting the CPU feature bits such that bit 31
  is enabled.
  
  Which it is because it wants to export the KVM hypercall interface.
  However, keying VMCALL *only* on the HYPERVISOR bit is wrong in the extreme.
  
  I don't think it's a red herring because the kernel uses this bit
  elsewhere - it is reported as X86_FEATURE_HYPERVISOR in the CPU
  features, and can be checked with the cpu_has_hypervisor macro (which
  was not used by the original author of the code in the driver, but
  should have been). VMWare and KVM support in the kernel also check for
  this bit before checking their hypervisor leaves for an ID. If it's not
  properly set it affects more than just the s-Par drivers.
 
  but the fact still remains that you
  should never, ever, invoke VMCALL unless you know what hypervisor you
  have underneath.
 
  From the standpoint of the s-Par drivers, yes, I agree (as I already
  said). However, VMCALL is not a privileged instruction, so anyone could
  use it from user space and go right past the OS straight to the
  hypervisor. IMHO, making it *lethal* to the guest is a bad idea, since
  any user could hard-stop the guest with a couple of lines of C.
  
  Typically the hypervisor wants to generate a #UD inside of the guest for
  that case.  The guest OS will intercept it and SIGILL the user space
  process.
  
  -hpa
  
 
 Hi Ben,
 
 I re-tested this case with/without option -enable-kvm.
 
 qemu-system-x86_64 -cpu Haswell,+smep,+smap   invalid op
 qemu-system-x86_64 -cpu kvm64 invalid op
 qemu-system-x86_64 -cpu Haswell,+smep,+smap -enable-kvm   everything OK
 qemu-system-x86_64 -cpu kvm64 -enable-kvm everything OK
 
 I think this is probably a bug in QEMU.
 Sorry for misleading you. I am not experienced in QEMU usage. I don't realize 
 I need try this case with different options Until read Peter's reply.
 
 As Peter said, QEMU probably should *not* set the hypervisor bit. But based 
 on my testing, I think KVM works properly in this case.
 
 Thanks,
 Jet
 

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--



Re: [Qemu-devel] [RFC 0/2] GET_EMULATED_CPUID support with allow-emulation option

2014-06-05 Thread Borislav Petkov
On Thu, Jun 05, 2014 at 04:12:08PM -0300, Eduardo Habkost wrote:
 In the meantime, we could:
 
  * Include the less fine-tuned allow-emulation (or
allow-experimental-features) option, which is implemented by this
series, for people who use enforce and/or don't care too much about
getting other experimental features enabled.
  * Wait until somebody implements feature=force.

What are you going to do with allow-emulation after this? It will
become an API and you'll have to support it.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--



Re: [Qemu-devel] [RFC 0/2] GET_EMULATED_CPUID support with allow-emulation option

2014-06-05 Thread Borislav Petkov
On Fri, Jun 06, 2014 at 12:24:26AM +0200, Alexander Graf wrote:
 But can we drop the EMULATED name somehow? Can we rename [1] the ioctl
 to say GET_UNSUPPORTED_CPUID or something along those lines? The name
 is just a really really bad pick.

What do you mean, a bad pick :-P? I added extra care in naming that
functionality what it is - bitfield in CPUID format of *emulated*
features. Unsupported is wrong too - we do support them if we enable
them explicitly. :-)

How about GET_NOT_REALLY_FAST_BUT_STILL_NOT_FAST_ENOUGH_AS_IN_HW_FAST_CPUID?

:-P

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--



[Qemu-devel] [PATCH] memsave: Add a space after address in error message

2015-02-07 Thread Borislav Petkov
From: Borislav Petkov b...@suse.de

Add the missing space to separate address from specified.

Cc: Anthony Liguori aligu...@amazon.com
Cc: Paolo Bonzini pbonz...@redhat.com
Signed-off-by: Borislav Petkov b...@suse.de
---
 cpus.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/cpus.c b/cpus.c
index 0cdd1d71560b..4fa196d48207 100644
--- a/cpus.c
+++ b/cpus.c
@@ -1489,7 +1489,7 @@ void qmp_memsave(int64_t addr, int64_t size, const char 
*filename,
 if (l  size)
 l = size;
 if (cpu_memory_rw_debug(cpu, addr, buf, l, 0) != 0) {
-error_setg(errp, Invalid addr 0x%016 PRIx64 specified, addr);
+error_setg(errp, Invalid addr 0x%016 PRIx64  specified, addr);
 goto exit;
 }
 if (fwrite(buf, 1, l, f) != l) {
-- 
2.2.0.33.gc18b867




[Qemu-devel] [PATCH] memsave: Improve and disambiguate error message

2015-02-08 Thread Borislav Petkov
On Sun, Feb 08, 2015 at 11:09:24AM +0100, Paolo Bonzini wrote:
 Cc: qemu-triv...@nongnu.org

Thanks.

But, there's more b0rked with this error message so you might wanna take
the patch below instead.

Btw, what are the vim settings when doing patches for qemu, this must be
documented somewhere as coding style differs significantly from that of
the kernel?

:-)

---
From: Borislav Petkov b...@suse.de
Subject: [PATCH] memsave: Improve and disambiguate error message

When requesting a size which cannot be read, the error message shows
a different address which is misleading to the user and it looks like
something's wrong with the address parsing. This is because the input
@addr variable is incremented in the memory dumping loop:

(qemu) memsave 0x8418069c 0xb0 mem
Invalid addr 0x849ffe9c specified

Fix that by saving the original address and size and use them in the
error message:

(qemu) memsave 0x8418069c 0xb0 mem
Invalid addr 0x8418069c/size 11534336 specified

Cc: Anthony Liguori aligu...@amazon.com
Cc: Paolo Bonzini pbonz...@redhat.com
Signed-off-by: Borislav Petkov b...@suse.de
---
 cpus.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/cpus.c b/cpus.c
index 0cdd1d71560b..a5c7cad00fc9 100644
--- a/cpus.c
+++ b/cpus.c
@@ -1466,6 +1466,7 @@ void qmp_memsave(int64_t addr, int64_t size, const char 
*filename,
 uint32_t l;
 CPUState *cpu;
 uint8_t buf[1024];
+int64_t orig_addr = addr, orig_size = size;
 
 if (!has_cpu) {
 cpu_index = 0;
@@ -1489,7 +1490,8 @@ void qmp_memsave(int64_t addr, int64_t size, const char 
*filename,
 if (l  size)
 l = size;
 if (cpu_memory_rw_debug(cpu, addr, buf, l, 0) != 0) {
-error_setg(errp, Invalid addr 0x%016 PRIx64 specified, addr);
+error_setg(errp, Invalid addr 0x%016 PRIx64 /size % PRId64
+  specified, orig_addr, orig_size);
 goto exit;
 }
 if (fwrite(buf, 1, l, f) != l) {
-- 
2.2.0.33.gc18b867

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--



Re: [Qemu-devel] [PATCH] memsave: Improve and disambiguate error message

2015-02-10 Thread Borislav Petkov
On Sun, Feb 08, 2015 at 10:46:05PM +0100, Paolo Bonzini wrote:
 I use :set shiftwidth=4 expandtab and c-indent mode.
 
 Quick googling for vim four spaces suggests the following alternatives:
 
 * :set tabstop=8 expandtab shiftwidth=4 softtabstop=4
 
 * set tabstop=8 softtabstop=0 expandtab shiftwidth=4 smarttab.

Thanks Paolo!

:-)

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--



Re: [Qemu-devel] [Patch V2 1/2] x86, mce: Basic support to add LMCE support to QEMU

2015-12-15 Thread Borislav Petkov
On Mon, Dec 14, 2015 at 07:17:27PM -0500, Raj, Ashok wrote:
> I can see how this hurts.. since the poller isn't doing cpu model
> specific stuff..?

The poller sees mca_cfg.ser set on an AMD guest and then the whole
handling/decoding goes wrong.

> in the LMCE case, even if you advertise MCG_LMCE_P in MCG_CAP, the
> guest kernel wont call intel_init_lmce() only from mce_intel.c.. so
> the same problem won't happen.

You shouldn't advertise MCG_LMCE_P if the guest is not Intel. Those MCG
bits should be in the CPU model descriptor X86CPUDefinition.

-- 
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
-- 



Re: [Qemu-devel] freeze host when injecting NMIs in the guest, at least in 4.4-rc4+

2015-12-10 Thread Borislav Petkov
Yap,

this is clearly a qemu/kvm issue. Lemme remove ext4 folks from CC. So
here's what happens:

I boot a kvm guest, connect to its monitor (qemu is started with
"-monitor pty") and on the monitor I issue a couple of times the "nmi"
command. It doesn't explode immediately but it happens pretty often and
when it happens, the *host*(!) freezes with some nasty corruption, see
below.

Thoughts, suggestions, ideas?

...
[   13.564192] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  111.524485] kvm: zapping shadow pages for mmio generation wraparound
[  166.414836] EXT4-fs (sdb1): mounted filesystem with ordered data mode. Opts: 
(null)
[  172.408126] kvm: zapping shadow pages for mmio generation wraparound
[  200.912827] BUG: unable to handle kernel NULL pointer dereference at 
  (null)
[  200.920716] IP: [<  (null)>]   (null)
[  200.925796] PGD 0 
[  200.927833] Oops: 0010 [#1] PREEMPT SMP 
[  200.931258] Modules linked in: binfmt_misc ipv6 vfat fat fuse dm_crypt 
dm_mod kvm_amd kvm irqq
[  200.931274] CPU: 0 PID: 3936 Comm: qemu-system-x86 Not tainted 4.4.0-rc4+ #1
[  200.931275] Hardware name: To be filled by O.E.M. To be filled by 
O.E.M./M5A97 EVO R2.0, BIOS3
[  200.931276] task: 880037ba ti: 8804227d8000 task.ti: 
8804227d8000
[  200.931277] RIP: 0010:[<>]  [<  (null)>]   
(null)
[  200.931278] RSP: 0018:8804227dbcb0  EFLAGS: 00010292
[  200.931278] RAX: 880037ba RBX: 0292 RCX: 0003
[  200.931279] RDX: 880037ba0750 RSI:  RDI: 880037ba0001
[  200.931280] RBP: 8804227dbd08 R08: 0001 R09: 0011f89b6000
[  200.931280] R10: 825c4fd0 R11:  R12: 
[  200.931281] R13:  R14: 0001 R15: 
[  200.931282] FS:  7fec0faba700() GS:88042c60() 
knlGS:
[  200.931283] CS:  0010 DS:  ES:  CR0: 8005003b
[  200.931283] CR2:  CR3: b6566000 CR4: 000406f0
[  200.931283] Stack:
[  200.931285]  8121004e 8804  
0292
[  200.931287]  a0305d3c 880414612280 880414612210 
8804227dbe90
[  200.931288]  880414612130 880037ba 8121004e 
8804227dbd90
[  200.931288] Call Trace:
[  200.931293]  [] ? ext4_file_write_iter+0x8e/0x470
[  200.931295]  [] ? ext4_file_write_iter+0x8e/0x470
[  200.931298]  [] mutex_lock_nested+0x66/0x3d0
[  200.931299]  [] ? ext4_file_write_iter+0x8e/0x470
[  200.931301]  [] ? ext4_file_write_iter+0x8e/0x470
[  200.931303]  [] ? __lock_acquire+0x55a/0x1f80
[  200.931305]  [] ext4_file_write_iter+0x8e/0x470
[  200.931307]  [] ? __fget+0x11e/0x210
[  200.931309]  [] ? percpu_down_read+0x44/0x90
[  200.931311]  [] __vfs_write+0xad/0xe0
[  200.931313]  [] vfs_write+0xb5/0x1a0
[  200.931315]  [] SyS_write+0x52/0xc0
[  200.931317]  [] entry_SYSCALL_64_fastpath+0x16/0x6f
[  200.931319] Code:  Bad RIP value.
[  200.931320] RIP  [<  (null)>]   (null)
[  200.931321]  RSP 
[  200.931321] CR2: 
[  200.939353] ---[ end trace 70675f54590d7755 ]---
[  200.939370] note: qemu-system-x86[3936] exited with preempt_count 1


-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--



Re: [Qemu-devel] freeze host when injecting NMIs in the guest, at least in 4.4-rc4+

2015-12-10 Thread Borislav Petkov
On Thu, Dec 10, 2015 at 05:49:10PM +0100, Paolo Bonzini wrote:
> Can you try it on Intel?

Just did, there it splats even when booting the guest, without even
injecting NMIs:

[  113.233992] ===
[  113.238192] [ INFO: suspicious RCU usage. ]
[  113.242393] 4.4.0-rc4+ #1 Not tainted
[  113.246056] ---
[  113.250257] arch/x86/kvm/trace.h:29 suspicious rcu_dereference_check() usage!
[  113.257400] 
   other info that might help us debug this:

[  113.265432] 
   RCU used illegally from idle CPU!
   rcu_scheduler_active = 1, debug_locks = 0
[  113.276321] RCU used illegally from extended quiescent state!
[  113.282083] 1 lock held by qemu-system-x86/2275:
[  113.286711]  #0:  (>mutex){+.+.+.}, at: [] 
vcpu_load+0x1c/0x80 [kvm]
[  113.295270] 
   stack backtrace:
[  113.299644] CPU: 4 PID: 2275 Comm: qemu-system-x86 Not tainted 4.4.0-rc4+ #1
[  113.306706] Hardware name: Dell Inc. Precision T3600/0PTTT9, BIOS A13 
05/11/2014
[  113.314122]  0001 88042f263d28 813c2cfc 
880432d53000
[  113.321575]  88042f263d58 810c5157 88042f268000 

[  113.329027]   0001 88042f263e10 
a01ee7c8
[  113.336483] Call Trace:
[  113.338939]  [] dump_stack+0x4e/0x82
[  113.344098]  [] lockdep_rcu_suspicious+0xe7/0x120
[  113.350385]  [] kvm_arch_vcpu_ioctl_run+0x16f8/0x19c0 [kvm]
[  113.357544]  [] ? kvm_arch_vcpu_ioctl_run+0xd0/0x19c0 [kvm]
[  113.364695]  [] kvm_vcpu_ioctl+0x342/0x700 [kvm]
[  113.370896]  [] ? __lock_is_held+0x4d/0x70
[  113.376583]  [] ? __fget+0xfe/0x200
[  113.381662]  [] do_vfs_ioctl+0x301/0x550
[  113.387156]  [] ? __fget_light+0x2a/0x90
[  113.392654]  [] SyS_ioctl+0x41/0x70
[  113.397739]  [] entry_SYSCALL_64_fastpath+0x16/0x7a
[  113.755008] kvm: zapping shadow pages for mmio generation wraparound

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.



Re: [Qemu-devel] [Patch V2 1/2] x86, mce: Basic support to add LMCE support to QEMU

2015-12-14 Thread Borislav Petkov
On Mon, Dec 14, 2015 at 02:23:56PM -0200, Eduardo Habkost wrote:
> > -#define MCE_CAP_DEF (MCG_CTL_P|MCG_SER_P)
> > +#define MCE_CAP_DEF (MCG_CTL_P|MCG_SER_P|MCG_LMCE_P)
> 
> This makes mcg_cap change when upgrading QEMU.
> 
> VMs with MCG_LMCE_P enabled shouldn't be migratable to hosts
> running older kernels, or the guest may try to read or write
> MSR_IA32_MCG_EXT_CTL after miration and get a #GP. That means:
> 
> 1) Older machine-types (pc-2.5 and older) should keep the
>old (MCG_CTL_P|MCG_SER_P) default.
> 2) We can't make pc-2.6 enable LMCE by default, either, because
>QEMU guarantees that just changing the machine-type shouldn't
>introduce new host requirements (see:
>http://article.gmane.org/gmane.comp.emulators.qemu/346651)
> 
> It looks like we need a new -cpu option to enable the feature,
> then. At least until we raise the minimum kernel version
> requirements of QEMU.

... and obviously LMCE is vendor-specific so it cannot be enabled on
!Intel guests with a define like that. mce_init() in qemu should check
vendor too.

The same mistake was done with SER_P but that's much harder to change,
as we discussed previously.

-- 
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
-- 



Re: [Qemu-devel] [Patch V2 1/2] x86, mce: Basic support to add LMCE support to QEMU

2015-12-14 Thread Borislav Petkov
On Mon, Dec 14, 2015 at 02:11:46PM -0500, Raj, Ashok wrote:
> This is mostly harmless.. since the MCG_CAP space is shared and has no
> conflict between vendors. Also just the CAP being set has no effect.

Of course it does - we check SER_P in machine_check_poll() and when
I emulate an AMD guest and inject errors into it, error handling is
obviously wrong, see:

https://lkml.kernel.org/r/20151123150355.ge5...@pd.tnic

-- 
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
-- 



Re: [Qemu-devel] [PATCH 3/3] target-i386: kvm: Print warning when clearing mcg_cap bits

2015-11-25 Thread Borislav Petkov
On Wed, Nov 25, 2015 at 01:49:49PM -0200, Eduardo Habkost wrote:
> Instead of silently clearing mcg_cap bits when the host doesn't
> support them, print a warning when doing that.

Why the host? Why would we want there to be any relation between the MCA
capabilities of the host and what qemu is emulating?

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.



Re: [Qemu-devel] [PATCH 3/3] target-i386: kvm: Print warning when clearing mcg_cap bits

2015-11-25 Thread Borislav Petkov
On Wed, Nov 25, 2015 at 06:29:25PM +0100, Paolo Bonzini wrote:
> On 25/11/2015 18:21, Borislav Petkov wrote:
> >> Instead of silently clearing mcg_cap bits when the host doesn't
> >> > support them, print a warning when doing that.
> > Why the host? Why would we want there to be any relation between the MCA
> > capabilities of the host and what qemu is emulating?
> 
> He means the hypervisor. :)

Ah, ok. :)

Then they look good to me, a step in the right direction.

Acked-by: Borislav Petkov <b...@suse.de>

Thanks!

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.



Re: [Qemu-devel] MCG_CAP ABI breakage (was Re: [PATCH] target-i386: Do not set MCG_SER_P by default)

2015-11-24 Thread Borislav Petkov
On Tue, Nov 24, 2015 at 02:36:20PM -0200, Eduardo Habkost wrote:
> KVM_X86_SET_MCE does not call kvm_vcpu_ioctl_x86_setup_mce(). It
> calls kvm_vcpu_ioctl_x86_set_mce(), which stores the
> IA32_MCi_{STATUS,ADDR,MISC} register contents at
> vcpu->arch.mce_banks.

Ah, correct. I've mistakenly followed KVM_X86_SETUP_MCE and not
KVM_X86_SET_MCE, sorry.

Ok, so this makes more sense now - there's kvm_inject_mce_oldstyle() in
qemu and kvm_arch_on_sigbus_vcpu() which is on the SIGBUS handler path
actually does:

if ((env->mcg_cap & MCG_SER_P) && addr
&& (code == BUS_MCEERR_AR || code == BUS_MCEERR_AO)) {
...

I betcha that MCG_SER_P is set on every guest, even !Intel ones. I need
to go stare more at that code.

> I didn't check the QEMU MCE code to confirm that, but I assume it
> is implemented there. In that case, MCG_SER_P in
> KVM_MCE_CAP_SUPPORTED just indicates it can be implemented by
> userspace, as long as it makes the appropriate KVM_X86_SET_MCE
> (or maybe KVM_SET_MSRS?) calls.

I think it is that kvm_arch_on_sigbus_vcpu()/kvm_arch_on_sigbus()
which handles SIGBUS with BUS_MCEERR_AR/BUS_MCEERR_AO si_code. See
mm/memory-failure.c:kill_proc() in the kernel where we do send those
signals to processes.

However, I still think the MCG_SER_P bit being set on
!Intel is wrong even though the recovery action done by
kvm_arch_on_sigbus_vcpu()/kvm_arch_on_sigbus() is correct.

Why, you're asking. :-)

Well, what happens above is that the qemu process gets the signal that
there was an uncorrectable error detected in its memory and it is either
required to do something: BUS_MCEERR_AR == Action Required or its action
is optional: BUS_MCEERR_AO == Action Optional.

The SER_P text in the SDM describes those two:

"SRAO errors indicate that some data in the system is corrupt, but the
data has not been consumed and the processor state is valid. SRAO errors
provide the additional error information for system software to perform
a recovery action. An SRAO error is indicated with UC=1, PCC=0, S=1,
EN=1 and AR=0 in the IA32_MCi_STATUS register."

and

"Software recoverable action required (SRAR) - a UCR error that requires
system software to take a recovery action on this processor before
scheduling another stream of execution on this processor. SRAR errors
indicate that the error was detected and raised at the point of the
consumption in the execution flow. An SRAR error is indicated with UC=1,
PCC=0, S=1, EN=1 and AR=1 in the IA32_MCi_STATUS register."

And for that we don't need to look at SER_P in qemu - we only need to
know what the error severity of the error is and then we go and handle
accordingly.

Because those two si_codes are purely software-defined. And the
application which gets that SIGBUS type doesn't need to care about
SER_P.

For example, AMD has similar error severities and they can be injected
into qemu too. And qemu can do the exact same recovery actions based on
the severity without even looking at the SER_P bit.

So here's the problem:

* SER_P is set on all guests and it puzzles kernels running on !Intel
guests.

* Hardware error recovery actions can be done regardless of that bit.

The only case where that bit makes sense is if the emulated hardware
itself is generating accurate MCEs and then, as a result, wants to make
generate accurate error signatures:

SRAO:   UC=1, PCC=0, S=1, EN=1 and AR=0
SRAR:   UC=1, PCC=0, S=1, EN=1 and AR=1

Those bits should have these settings only when the emulated hw actually
implements SER_P. Otherwise, you'd get those old crude MCEs which are
either uncorrectable and generate an #MC or are correctable errors.

But ok, let me go do some staring at the examples you sent me
previously first. I might get a better idea after I sleep on it.

:-)

Thanks!

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.



Re: [Qemu-devel] [PATCH] target-i386: Do not set MCG_SER_P by default

2015-11-20 Thread Borislav Petkov
On Sat, Nov 21, 2015 at 12:11:35AM +0100, Andreas Färber wrote:
> Hi,
> 
> CC'ing qemu-devel.

Ah, thanks.

> Am 21.11.2015 um 00:01 schrieb Borislav Petkov:
> > From: Borislav Petkov <b...@suse.de>
> > 
> > Software Error Recovery, i.e. SER, is purely an Intel feature and it
> > shouldn't be set by default. Enable it only on Intel.
> 
> Is this new in 2.5? Otherwise we would probably need compatibility code
> in pc*.[ch] for incoming live migration from older versions.

It looks it is really old, AFAIK from 2010:

c0532a76b407 ("MCE: Relay UCR MCE to guest")

You'd need to be more verbose about pc*.[ch]. An example perhaps...?

> >  mcg_cap &= MCE_CAP_DEF;
> >  mcg_cap |= banks;
> > +
> > +   if (IS_INTEL_CPU(env))
> > +   mcg_cap |= MCG_SER_P;
> 
> Tabs and missing braces.

Ok.

Thanks.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.



Re: [Qemu-devel] MCG_CAP ABI breakage (was Re: [PATCH] target-i386: Do not set MCG_SER_P by default)

2015-11-23 Thread Borislav Petkov
On Mon, Nov 23, 2015 at 01:11:27PM -0200, Eduardo Habkost wrote:
> On Mon, Nov 23, 2015 at 11:22:37AM -0200, Eduardo Habkost wrote:
> [...]
> > In the case of this code, it looks like it's already broken
> > because the resulting mcg_cap depends on host kernel capabilities
> > (the ones reported by kvm_get_mce_cap_supported()), and the data
> > initialized by target-i386/cpu.c:mce_init() is silently
> > overwritten by kvm_arch_init_vcpu(). So we would need to fix that
> > before implementing a proper compatibility mechanism for
> > mcg_cap.
> 
> Fortunately, when running Linux v2.6.37 and later,
> kvm_arch_init_vcpu() won't actually change mcg_cap (see details
> below).
> 
> But the code is broken if running on Linux between v2.6.32 and
> v2.6.36: it will clear MCG_SER_P silently (and silently enable
> MCG_SER_P when migrating to a newer host).
> 
> But I don't know what we should do on those cases. If we abort
> initialization when the host doesn't support MCG_SER_P, all CPU
> models with MCE and MCA enabled will become unrunnable on Linux
> between v2.6.32 and v2.6.36. Should we do that, and simply ask
> people to upgrade their kernels (or explicitly disable MCE) if
> they want to run latest QEMU?
> 
> For reference, these are the capabilities returned by Linux:
> * KVM_MAX_MCE_BANKS is 32 since
>   890ca9aefa78f7831f8f633cab9e4803636dffe4 (v2.6.32-rc1~693^2~199)
> * KVM_MCE_CAP_SUPPORTED is (MCG_CTL_P | MCG_SER_P) since
>   5854dbca9b235f8cdd414a0961018763d2d5bf77 (v2.6.37-rc1~142^2~3)

The commit message of that one says that there is MCG_SER_P support in
the kernel.

The previous commit talks about MCE injection with KVM_X86_SET_MCE
ioctl but frankly, I don't see that. From looking at the current code,
KVM_X86_SET_MCE does kvm_vcpu_ioctl_x86_setup_mce() which simply sets
MCG_CAP. And it gets those from KVM_X86_GET_MCE_CAP_SUPPORTED which is

#define KVM_MCE_CAP_SUPPORTED (MCG_CTL_P | MCG_SER_P)

So it basically sets those two supported bits. But how is

supported == actually present

?!?!

That soo doesn't make any sense.

> * KVM_MCE_CAP_SUPPORTED is MCG_CTL_P between
>   890ca9aefa78f7831f8f633cab9e4803636dffe4 (v2.6.32-rc1~693^2~199)
>   and 5854dbca9b235f8cdd414a0961018763d2d5bf77 (v2.6.37-rc1~142^2~3)
> 
> The current definitions in QEMU are:
> #define MCE_CAP_DEF (MCG_CTL_P|MCG_SER_P)
> #define MCE_BANKS_DEF   10

That's also wrong. The number of banks is, of course,
generation-specific. The qemu commit adding that is

79c4f6b08009 ("QEMU: MCE: Add MCE simulation to qemu/tcg")

and I really don't get what the reasoning behind it is? Is this supposed
to mimick a certain default CPU which is not really resembling a real
CPU or some generic CPU or what is it?

Because the moment you start qemu with -cpu , all that
MCA information is totally wrong.

> The target-i386/cpu.c:mce_init() code sets mcg_cap to:
> env->mcg_cap == MCE_CAP_DEF | MCE_BANKS_DEF;
>  == (MCG_CTL_P|MCG_SER_P) | 10;
> 
> The kvm_arch_init_vcpu() code that changes mcg_cap does the following:
> kvm_get_mce_cap_supported(cs->kvm_state, _cap, );
> if (banks > MCE_BANKS_DEF) {
> banks = MCE_BANKS_DEF;
> }
> mcg_cap &= MCE_CAP_DEF;
> mcg_cap |= banks;
> env->mcg_cap = mcg_cap;
>   * Therefore, if running Linux v2.6.37 or newer, this will be
> the result:
> banks == 10;
> mcg_cap == (MCG_CTL_P|MCG_SER_P) | banks
> == (MCG_CTL_P|MCG_SER_P) | 10;
> * That's the same value set by mce_init(), so fortunately
>   kvm_arch_init_vcpu() isn't actually changing mcg_cap
>   if running Linux v.2.6.37 and newer.
>   * However, if running Linux between v2.6.32 and v2.6.37,
> kvm_arch_init_vcpu() will silently clear MCG_SER_P.

I don't understand - you want that cleared when emulating a !Intel CPU
or an Intel CPU which doesn't support SER. This bit should be clear
there. Why should be set at all there?

Hmmm?

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.



Re: [Qemu-devel] [PATCH] target-i386: Do not set MCG_SER_P by default

2015-11-23 Thread Borislav Petkov
+ Tony.

On Mon, Nov 23, 2015 at 03:47:44PM +0100, Paolo Bonzini wrote:
> On 23/11/2015 14:22, Eduardo Habkost wrote:
> > > Software Error Recovery, i.e. SER, is purely an Intel feature and it
> > > shouldn't be set by default. Enable it only on Intel.
> > 
> > What happens when SER is enabled on an AMD CPU? If it really
> > should't be enabled, why is KVM returning it on
> > KVM_X86_GET_MCE_CAP_SUPPORTED?
> 
> Indeed... is it a problem if our frankenstein AMD CPU can recover from
> memory errors?

The problem stems from the fact that the guest kernel looks at SER and
does different handling depending on that bit:

machine_check_poll:

...

if (!(flags & MCP_UC) &&
(m.status & (mca_cfg.ser ? MCI_STATUS_S : MCI_STATUS_UC)))
continue;

so when the guest kernel gets a correctable error (injected..., for
example) it sees that bit set. Even though kvm/qemu emulates an AMD
CPU. So on AMD with that bit set, it would puzzle the whole error
handling/reporting in the guest kernel.

Oh, btw, I'm using a kvm guest to inject MCEs. In case you were
wondering why is he even doing that. :-)

And I'm not sure it makes sense to set that bit for an Intel guest too.
For the simple reason that I don't know how much of the Software Error
Recovery stuff is actually implemented there. If stuff is missing, you
probably don't want to advertize it there too. And by "stuff" I mean
all that fun in section "15.6 RECOVERY OF UNCORRECTED RECOVERABLE (UCR)
ERRORS" of the SDM.

It's a whole another question how/whether to do UCR error handling in
the guest or maybe leave it to the host...

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.



Re: [Qemu-devel] MCG_CAP ABI breakage (was Re: [PATCH] target-i386: Do not set MCG_SER_P by default)

2015-11-23 Thread Borislav Petkov
On Mon, Nov 23, 2015 at 05:42:08PM -0200, Eduardo Habkost wrote:
> I will let the people working on the actual MCE emulation in KVM
> answer that. I am assuming that KVM_MCE_CAP_SUPPORTED is set to
> something that makes sense.

Well, that should be, IMHO, the same like all those feature bits
assigned to the ->feature arrays of the different cpu types in qemu's
X86CPUDefinition descriptors.

> Note that we don't mimick every single detail of real CPUs out
> there, and this isn't necessarily a problem (although sometimes
> we choose bad defaults). Do you see real world scenarios when
> choosing 10 as the default causes problems for guest OSes, or you
> just worry that this might cause problems because it doesn't
> match any real-world CPU?

Well, the problems would come when the guests start using the MCA
infrastructure bits. That's why I asked how exactly do people imagine of
doing all the hardware errors handling in the guest.

I know we do something with poisoning pages, i.e.
kvm_send_hwpoison_signal() and all that machinery around it but in that
particular case it is the hypervisor which marks the pages as poison
and kvm notices that on the __get_user_pages() path and the error is
injected into the guest. AFAICT, of course.

In my case, I'm injecting a HW error in the guest kernel by writing into
the *guest* MSRs and the *guest* kernel MCA code is supposed to handle
the error.

And the problem here is that I'm emulating an AMD guest. But a guest
which sports an Intel-only feature and that puzzles the guest kernel.

Does that make more sense? I hope...

> If we really care about matching the number of banks of real
> CPUs, we can make it configurable, defined by the CPU model,
> and/or have better defaults in future machine-types. That won't
> be a problem.

I think we should try to do that if we're striving for accurate
emulation of guest CPUs. But then there's the migration use-case which
has different focus...

> But I still don't know what we should do when somebody runs:
>   -machine pc-i440fx-2.4 -cpu Penryn
> on a host kernel that doesn't report MCG_SER_P on
> KVM_MCE_CAP_SUPPORTED.

Right, before we ask that question we should ask the more generic one:
how do people want to do error handling in the guest? Do they even want
to? More importantly, does it even make sense to handle hardware errors
in the guest? If so, which and if not, why not?

I mean, no one would've noticed the MCG_SER_P issue if no one would've
tried to use it and what it implies. So it all comes down to whether the
guest uses the emulated feature.

It seems to me this issue remained unnoticed for such a long time now
for the simple reason that nothing used it. So nothing in the guest
cared whether SER_P is set or not, or how many MCA banks are there.

So if you run "-machine pc-i440fx-2.4 -cpu Penryn" it wouldn't matter
because, AFAIK - and correct me if I'm wrong here - the guest never
got to see the Action Required and Action Optional MCEs which are the
result from SER_P support. So the guest didn't care.

Yes, no, am I missing something completely here?

> I am just saying we already clear it when running on Linux
> v2.6.32-v2.6.36, it doesn't matter the host CPU or the -cpu
> options we have. And we do not clear it when running Linux
> v2.6.37 or newer. That's the behavior of pc-*-2.4 and older, even
> if we change it on future machine-types.

Right, ok. So the fact that it was clear in the v2.6.32-v2.6.36 frame
and set later and nothing complained, *probably* confirms my theory that
the guest didn't really care about that setting and it probably doesn't
do now either... Unless you try to use it, like I did :-)

Thanks.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.



[Qemu-devel] qemu x86 CPUID leafs override

2017-11-23 Thread Borislav Petkov
Hi guys,

I'm using the hack below to do some quick kernel testing by setting
arbitrary feature bits and then make it execute the code for that
feature.

For example, boot with:

-cpu EPYC,cpuid-leaf=0x8007,ebx=0xf

to set some RAS feature bits and test newer RAS code.

Would something like that be of interest to a wider audience?

It is rough and ugly but if deemed useful, I could try to clean it up.

Thx.

---
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 045d66191f28..249fb23be696 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -2732,6 +2732,13 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 uint32_t limit;
 uint32_t signature[3];
 
+/*
+ * Pull up max xlevel in case the one we've specified on the cmdline is
+ * higher.
+ */
+if (cpu->cpuid_leaf && env->cpuid_xlevel < cpu->cpuid_leaf)
+   env->cpuid_xlevel = cpu->cpuid_leaf;
+
 /* Calculate & apply limits for different index ranges */
 if (index >= 0xC000) {
 limit = env->cpuid_xlevel2;
@@ -3140,6 +3147,22 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 *edx = 0;
 break;
 }
+
+/* Do CPUID overrides: */
+if (cpu->cpuid_leaf && cpu->cpuid_leaf == index) {
+
+   if (cpu->eax)
+   *eax = cpu->eax;
+
+   if (cpu->ebx)
+   *ebx = cpu->ebx;
+
+   if (cpu->ecx)
+   *ecx = cpu->ecx;
+
+   if (cpu->edx)
+   *edx = cpu->edx;
+}
 }
 
 /* CPUClass::reset() */
@@ -4173,6 +4196,11 @@ static Property x86_cpu_properties[] = {
  * to the specific Windows version being used."
  */
 DEFINE_PROP_INT32("x-hv-max-vps", X86CPU, hv_max_vps, -1),
+DEFINE_PROP_UINT32("cpuid-leaf", X86CPU, cpuid_leaf, UINT32_MAX),
+DEFINE_PROP_UINT32("eax", X86CPU, eax, UINT32_MAX),
+DEFINE_PROP_UINT32("ebx", X86CPU, ebx, UINT32_MAX),
+DEFINE_PROP_UINT32("ecx", X86CPU, ecx, UINT32_MAX),
+DEFINE_PROP_UINT32("edx", X86CPU, edx, UINT32_MAX),
 DEFINE_PROP_END_OF_LIST()
 };
 
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index b086b1528b89..b336b0849456 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -1284,6 +1284,13 @@ struct X86CPU {
 int32_t thread_id;
 
 int32_t hv_max_vps;
+
+/*
+ * CPUID overrides:
+ */
+uint32_t cpuid_leaf;
+uint32_t eax, ebx, ecx, edx;
+
 };
 
 static inline X86CPU *x86_env_get_cpu(CPUX86State *env)

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.



Re: [Qemu-devel] [PATCH v8 05/28] target/i386: add memory encryption feature cpuid support

2018-02-12 Thread Borislav Petkov
On Mon, Feb 12, 2018 at 03:07:26PM -0600, Brijesh Singh wrote:
> In current implementation, when -cpu ...,+sev is passed without
> appropriate SEV configuration then we populate the Fn8000_001F CPUID but
> VMCB will not have SEV bit set hence a guest will be launched as
> non-SEV.

I think we should fail the guest init if sev is not really supported by
the host. Otherwise people might get mislead.

> > Changing existing CPU models is possible only on very specific
> > circumstances (where VMs that are currently runnable would always
> > stay runnable), and would require compat_props entries to keep
> > compatibility on existing machine-types.
> 
> Ah I didn't consider that case. What is recommendation, should we create
> a new CPU Model (EPYC-SEV) ?

Can we please stop creating a new CPU model with every new CPUID feature
support added? This is just ridiculous.

If this is about live migration, then by all means, fail the migration
if the feature bits are not compatible. But replicating CPU models and
then adding one new differing feature doesn't make any sense.

-- 
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
-- 



Re: [Qemu-devel] [PATCH v8 05/28] target/i386: add memory encryption feature cpuid support

2018-02-13 Thread Borislav Petkov
On Tue, Feb 13, 2018 at 09:39:01AM -0600, Brijesh Singh wrote:
> Yes, I think we should be able to avoid creating new CPU model to
> handle this case. I am leaning towards dropping this patch and
> implement logic to populate the CPUID 0x8000_001F only when SEV is
> enabled. This should not require any changes in existing CPU model
> feature flag and live migration of existing guest should work fine.

That sounds even nicer.

Thx.

-- 
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
-- 



Re: [Qemu-devel] [RFC PATCH v1 01/10] KVM: SVM: Add KVM_SEV SEND_START command

2019-04-29 Thread Borislav Petkov
On Mon, Apr 29, 2019 at 03:01:24PM +, Singh, Brijesh wrote:
> Practically I don't see any reason why caller would do that but
> theoretically it can. If we cache the len then we also need to consider
> adding another flag to hint whether userspace ever requested length.
> e.g an application can compute the length of session blob by looking at
> the API version and spec and may never query the length.
> 
> > I mean I'm still thinking defensively here but maybe the only thing that
> > would happen here with a bigger buffer is if the kmalloc() would fail,
> > leading to eventual failure of the migration.
> > 
> > If the code limits the allocation to some sane max length, the migration
> > won't fail even if userspace gives it too big values...

So what about this? Limiting to a sane length...

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.



Re: [Qemu-devel] [RFC PATCH v1 01/10] KVM: SVM: Add KVM_SEV SEND_START command

2019-04-26 Thread Borislav Petkov
On Fri, Apr 26, 2019 at 02:29:31PM +, Singh, Brijesh wrote:
> Yes that's doable but I am afraid that caching the value may lead us to
> wrong path and also divergence from the SEV API spec. The spec says the
> returned length is a minimum length but its possible that caller can
> give a bigger buffer and FW will still work with it.

Does the caller even have a valid reason to give a bigger buffer len?

I mean I'm still thinking defensively here but maybe the only thing that
would happen here with a bigger buffer is if the kmalloc() would fail,
leading to eventual failure of the migration.

If the code limits the allocation to some sane max length, the migration
won't fail even if userspace gives it too big values...

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.



Re: [Qemu-devel] [RFC PATCH v1 01/10] KVM: SVM: Add KVM_SEV SEND_START command

2019-04-26 Thread Borislav Petkov
On Wed, Apr 24, 2019 at 04:09:59PM +, Singh, Brijesh wrote:
> The command is used to create an outgoing SEV guest encryption context.
> 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: Paolo Bonzini 
> Cc: "Radim Krčmář" 
> Cc: Joerg Roedel 
> Cc: Borislav Petkov 
> Cc: Tom Lendacky 
> Cc: x...@kernel.org
> Cc: k...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Signed-off-by: Brijesh Singh 
> ---
>  .../virtual/kvm/amd-memory-encryption.rst |  24 +
>  arch/x86/kvm/svm.c| 101 ++
>  include/uapi/linux/kvm.h  |  12 +++
>  3 files changed, 137 insertions(+)
> 
> diff --git a/Documentation/virtual/kvm/amd-memory-encryption.rst 
> b/Documentation/virtual/kvm/amd-memory-encryption.rst
> index 659bbc093b52..340ac4f87321 100644
> --- a/Documentation/virtual/kvm/amd-memory-encryption.rst
> +++ b/Documentation/virtual/kvm/amd-memory-encryption.rst
> @@ -238,6 +238,30 @@ Returns: 0 on success, -negative on error
>  __u32 trans_len;
>  };
>  
> +10. KVM_SEV_SEND_START
> +--
> +
> +The KVM_SEV_SEND_START command can be used by the hypervisor to create an
> +outgoing guest encryption context.
> +
> +Parameters (in): struct kvm_sev_send_start
> +
> +Returns: 0 on success, -negative on error
> +
> +::
> +struct kvm_sev_send_start {
> +__u32 policy; /* guest policy */
> +
> +__u64 pdh_cert_uaddr; /* platform Diffie-Hellman 
> certificate */
> +__u32 pdh_cert_len;
> +
> +__u64 plat_cert_uaddr;/* platform certificate chain 
> */
> +__u32 plat_cert_len;
> +
> +__u64 amd_cert_uaddr; /* AMD certificate */
> +__u32 amd_cert_len;

__u64 session_uaddr;
__u32 session_len;

too, right?

> +};
> +
>  References
>  ==
>  
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 406b558abfef..4c2a225ba546 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -6955,6 +6955,104 @@ static int sev_launch_secret(struct kvm *kvm, struct 
> kvm_sev_cmd *argp)
>   return ret;
>  }
>  
> +static int sev_send_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
> +{
> + struct kvm_sev_info *sev = _kvm_svm(kvm)->sev_info;
> + void *amd_cert = NULL, *session_data = NULL;
> + void *pdh_cert = NULL, *plat_cert = NULL;
> + struct sev_data_send_start *data = NULL;
> + struct kvm_sev_send_start params;
> + int ret;
> +
> + if (!sev_guest(kvm))
> + return -ENOTTY;
> +
> + if (copy_from_user(, (void __user *)(uintptr_t)argp->data,
> + sizeof(struct kvm_sev_send_start)))
> + return -EFAULT;
> +
> + data = kzalloc(sizeof(*data), GFP_KERNEL);
> + if (!data)
> + return -ENOMEM;
> +
> + /* userspace wants to query the session length */
> + if (!params.session_len)
> + goto cmd;
> +
> + if (!params.pdh_cert_uaddr || !params.pdh_cert_len ||
> + !params.session_uaddr)
> + return -EINVAL;
> +
> + /* copy the certificate blobs from userspace */
> + pdh_cert = psp_copy_user_blob(params.pdh_cert_uaddr, 
> params.pdh_cert_len);
> + if (IS_ERR(pdh_cert)) {
> + ret = PTR_ERR(pdh_cert);
> + goto e_free;
> + }
> +
> + data->pdh_cert_address = __psp_pa(pdh_cert);
> + data->pdh_cert_len = params.pdh_cert_len;
> +
> + plat_cert = psp_copy_user_blob(params.plat_cert_uaddr, 
> params.plat_cert_len);
> + if (IS_ERR(plat_cert)) {
> + ret = PTR_ERR(plat_cert);
> + goto e_free_pdh;
> + }
> +
> + data->plat_cert_address = __psp_pa(plat_cert);
> + data->plat_cert_len = params.plat_cert_len;
> +
> + amd_cert = psp_copy_user_blob(params.amd_cert_uaddr, 
> params.amd_cert_len);
> + if (IS_ERR(amd_cert)) {
> + ret = PTR_ERR(amd_cert);
> + goto e_free_plat_cert;
> + }
> +
> + data->amd_cert_address = __psp_pa(amd_cert);
> + data->amd_cert_len = params.amd_cert_len;
> +
> + ret = -ENOMEM;
> + session_data = kmalloc(params.session_len, GFP_KERNEL);

If the user is supposed to query the session length first, you could
save it in a global variable perhaps and use that value instead of
trusting the user to give you the correct one in params.session_len for
the allocation...

> + if (!session_data)
> + go

Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

2022-12-30 Thread Borislav Petkov
On Fri, Dec 30, 2022 at 04:54:27PM +0100, Jason A. Donenfeld wrote:
> > Right, with CONFIG_X86_VERBOSE_BOOTUP=y in a guest here, it says:
> > 
> > early console in extract_kernel
> > input_data: 0x0be073a8
> > input_len: 0x008cfc43
> > output: 0x0100
> > output_len: 0x0b600a98
> > kernel_total_size: 0x0ac26000
> > needed_size: 0x0b80
> > trampoline_32bit: 0x0009d000
> > 
> > so that's a ~9M kernel which gets decompressed at 0x100 and the
> > output len is, what, ~180M which looks like plenty to me...
> 
> I think you might have misunderstood the thread.

Maybe...

I've been trying to make sense of all the decompression dancing we're doing and
the diagrams you've drawn and there's a comment over extract_kernel() which
basically says that the compressed image is at the back of the memory buffer

input_data: 0x0be073a8

and we decompress to a smaller address

output: 0x0100

And, it *looks* like setup_data is at an address somewhere >= 0x100 so when
we start decompressing, we overwrite it.

I guess extract_kernel() could also dump the setup_data addresses so that we can
verify that...

> First, to reproduce the bug that this patch fixes, you need a kernel with a
> compressed size of around 16 megs, not 9.

Not only that - you need setup_data to be placed somewhere just so that it gets
overwritten during decompression. Also, see below.
 
> Secondly, that crash is well understood and doesn't need to be reproduced;

Is it?

Where do you get the 0x10 as the starting address of the kernel image?

Because input_data above says that the input address is somewhere higher...

Btw, here's an allyesconfig:

early console in setup code
No EFI environment detected.
early console in extract_kernel
input_data: 0x1bd003a8
input_len: 0x0cde2963
output: 0x0100
output_len: 0x27849d74
kernel_total_size: 0x2583
needed_size: 0x27a0
trampoline_32bit: 0x0009d000
Physical KASLR using RDRAND RDTSC...
Virtual KASLR using RDRAND RDTSC...

That kernel has compressed size of ~205M and the output buffer is of size ~632M.

> this patch fixes it. Rather, the question now is how to improve this patch
> to remove the 62 meg limit.

Yeah, I'm wondering what that 62M limit is too, actually.

But maybe I'm misunderstanding...

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

2022-12-30 Thread Borislav Petkov
On Fri, Dec 30, 2022 at 06:07:24PM +0100, Jason A. Donenfeld wrote:
> Look closer at the boot process. The compressed image is initially at
> 0x10, but it gets relocated to a safer area at the end of
> startup_64:

That is the address we're executing here from, rip here looks like 0x100xxx.

> /*
>  * Copy the compressed kernel to the end of our buffer
>  * where decompression in place becomes safe.
>  */
> pushq   %rsi
> leaq(_bss-8)(%rip), %rsi
> leaqrva(_bss-8)(%rbx), %rdi

when you get to here, it looks something like this:

leaq(_bss-8)(%rip), %rsi# 0x9e7ff8
leaqrva(_bss-8)(%rbx), %rdi # 0xc6eeff8

so the source address is that _bss thing and we copy...

> movl$(_bss - startup_32), %ecx
> shrl$3, %ecx
> std

... backwards since DF=1.

Up to:

# rsi = 0x8
# rdi = 0xbe06ff8

Ok, so the source address is 0x10. Good.

> HOWEVER, qemu currently appends setup_data to the end of the
> compressed kernel image,

Yeah, you mean the kernel which starts executing at 0x10, i.e., that part
which is compressed/head_64.S and which does the above and the relocation etc.

> and this part isn't moved, and setup_data links aren't walked/relocated. So
> that means the original address remains, of 0x10.

See above: when it starts copying the kernel image backwards to a higher
address, that last byte is at 0x9e7ff8 so I'm guessing qemu has put setup_data
*after* that address. And that doesn't get copied ofc.

So far, so good.

Now later, we extract the compressed kernel created with the mkpiggy magic:

input_data:
.incbin "arch/x86/boot/compressed/vmlinux.bin.gz"
input_data_end:

by doing

/*
 * Do the extraction, and jump to the new kernel..
 */

pushq   %rsi/* Save the real mode argument */   
0x13d00
movq%rsi, %rdi  /* real mode address */ 
0x13d00
leaqboot_heap(%rip), %rsi   /* malloc area for uncompression */ 
0xc6ef000
leaqinput_data(%rip), %rdx  /* input_data */
0xbe073a8
movlinput_len(%rip), %ecx   /* input_len */ 
0x8cfe13
movq%rbp, %r8   /* output target address */ 
0x100
movloutput_len(%rip), %r9d  /* decompressed length, end of relocs */
callextract_kernel  /* returns kernel location in %rax */
popq%rsi

(actual addresses at the end.)

Now, when you say you triplefault somewhere in initialize_identity_maps() when
trying to access setup_data, then if you look a couple of lines before that call
we do

call load_stage2_idt

which sets up a boottime #PF handler do_boot_page_fault() and it actually does
call kernel_add_identity_map() so *actually* it should map any unmapped
setup_data addresses.

So why doesn't it do that and why do you triplefault?

Hmmm.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

2022-12-31 Thread Borislav Petkov
On Fri, Dec 30, 2022 at 04:59:30PM +0100, Jason A. Donenfeld wrote:
> I'll attach a .config file. Apply the patch at the top of this thread to
> qemu,

Hmm, so the patch at the top of this thread is fixing the clobbering of
setup_data.

But I tried latest qemu with your .config and it boots ok here. So how do I
repro the original issue you're trying to fix?

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

2022-12-31 Thread Borislav Petkov
On Fri, Dec 30, 2022 at 05:06:55PM -0800, H. Peter Anvin wrote:
> This needs to be something like:
> 
> kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd));
> kernel_add_identity_map(sd_addr + sizeof(*sd),
>   sd_addr + sizeof(*sd) + sd->len);

It still #PFs with that:

(gdb) bt
#0  0x84738576 in native_halt () at ./arch/x86/include/asm/irqflags.h:57
#1  halt () at ./arch/x86/include/asm/irqflags.h:98
#2  early_fixup_exception (regs=regs@entry=0x84007dc8, 
trapnr=trapnr@entry=14) at arch/x86/mm/extable.c:340
#3  0x846ff465 in do_early_exception (regs=0x84007dc8, 
trapnr=14) at arch/x86/kernel/head64.c:424
#4  0x846ff14f in early_idt_handler_common () at 
arch/x86/kernel/head_64.S:483
#5  0xc149f9894908788d in ?? ()
#6  0xff2003fc in ?? ()
#7  0x0010 in fixed_percpu_data ()
#8  0xdc00 in ?? ()
#9  0x84007ea8 in init_thread_union ()
#10 0xff20088d in ?? ()
#11 0x in ?? ()

/me goes to dig more.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

2022-12-31 Thread Borislav Petkov
On Sat, Dec 31, 2022 at 01:54:50PM +0100, Jason A. Donenfeld wrote:
> Nothing special... `-kernel bzImage` should be enough to do it. Eric
> reported it, and then I was able to repro trivially. Sure you got the
> right version?

Yeah, qemu executables confusion here - had wrongly something older of the
version 7.1...

Now made sure I'm actually booting with the latest qemu:

QEMU emulator version 7.2.50 (v7.2.0-333-g222059a0fccf)

With that the kernel with your config hangs early during boot and the stack
trace is below.

Seeing how it says trapnr 14, then that looks like something you are seeing.

But lemme poke at it more.

#0  0x84738576 in native_halt () at ./arch/x86/include/asm/irqflags.h:57
#1  halt () at ./arch/x86/include/asm/irqflags.h:98
#2  early_fixup_exception (regs=regs@entry=0x84007dc8, 
trapnr=trapnr@entry=14) at arch/x86/mm/extable.c:340
#3  0x846ff465 in do_early_exception (regs=0x84007dc8, 
trapnr=14) at arch/x86/kernel/head64.c:424
#4  0x846ff14f in early_idt_handler_common () at 
arch/x86/kernel/head_64.S:483
#5  0x2404c7410cd0 in ?? ()
#6  0xff20073c in ?? ()
#7  0x0010 in fixed_percpu_data ()
#8  0xdc00 in ?? ()
#9  0x84007ea8 in init_thread_union ()
#10 0xff200cd0 in ?? ()
#11 0x in ?? ()

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

2022-12-31 Thread Borislav Petkov
On Sat, Dec 31, 2022 at 02:44:08PM +0100, Jason A. Donenfeld wrote:
> Are you using patch v1 minus the 62 MiB thing?

No, I want to see the original failure - the one that prompted you to send

https://lore.kernel.org/r/20221228143831.396245-1-ja...@zx2c4.com

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

2022-12-31 Thread Borislav Petkov
On Sat, Dec 31, 2022 at 02:51:28PM +0100, Jason A. Donenfeld wrote:
> That failure is unrelated to the ident mapping issue Peter and
> I discussed. The original failure is described in the commit message:
> decompression clobbers the data, so sd->next points to garbage.

Right, and the fact that the kernel overwrites it still feels kinda wrong: the
kernel knows where setup_data is - the address is in the setup header so
*actually*, it should take care of not to clobber it.

But hpa would correct me if I'm missing an aspect about whose responsibility it
is to do what here...

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

2022-12-31 Thread Borislav Petkov
On Sat, Dec 31, 2022 at 07:22:47PM +0100, Jason A. Donenfeld wrote:
> So with that understanding confirmed, I'm confused at your surprise that
> hpa's unrelated fix to the different issue didn't fix this issue.

No surprise there - I used a qemu variant without your patch to prevent the
setup_data clobbering so hpa's fix can't help there.

> But since the kernel doesn't do this now, and the 62MiB bug also seems
> to apply to existing kernels, for the purposes of QEMU for now, I think
> the v3 patch is probably best, since it'll handle existing kernels.

Right, we can't fix all those guests which are out there.

> Alternatively, setup_data could be relocated, the boot param protocol
> could be bumped, and then QEMU could conditionalized it's use of
> setup_data based on that protocol version. That'd work, but seems a bit
> more involved.

I think this is at least worth considering because the kernel overwriting
setup_data after having been told where that setup_data is located is not really
nice.

Well not explicitly at least - it gets the pointer to the first element and
something needs to traverse that list to know which addresses are live. But
still, that info is there so perhaps we need to take setup_data into
consideration too before decompressing...

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

2022-12-29 Thread Borislav Petkov
On Wed, Dec 28, 2022 at 11:31:34PM -0800, H. Peter Anvin wrote:
> As far as a crash... that sounds like a big and a pretty serious one at that.
> 
> Could you let me know what kernel you are using and how *exactly* you are 
> booting it?

Right, with CONFIG_X86_VERBOSE_BOOTUP=y in a guest here, it says:

early console in extract_kernel
input_data: 0x0be073a8
input_len: 0x008cfc43
output: 0x0100
output_len: 0x0b600a98
kernel_total_size: 0x0ac26000
needed_size: 0x0b80
trampoline_32bit: 0x0009d000

so that's a ~9M kernel which gets decompressed at 0x100 and the
output len is, what, ~180M which looks like plenty to me...

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

2023-01-01 Thread Borislav Petkov
On Mon, Jan 02, 2023 at 07:01:50AM +0100, Borislav Petkov wrote:
> On Sat, Dec 31, 2022 at 07:31:21PM -0800, H. Peter Anvin wrote:
> > It would probably be a good idea to add a "maximum physical address for
> > initrd/setup_data/cmdline" field to struct kernel_info, though. It appears
> > right now that those fields are being identity-mapped in the decompressor,
> > and that means that if 48-bit addressing is used, physical memory may extend
> > past the addressable range.
> 
> Yeah, we will probably need that too.
> 
> Btw, looka here - it can't get any more obvious than that after dumping
> setup_data too:
> 
> early console in setup code
> early console in extract_kernel
> input_data: 0x040f92bf
> input_len: 0x00f1c325
> output: 0x0100
> output_len: 0x03c5e7d8
> kernel_total_size: 0x04428000
> needed_size: 0x0460
> boot_params->hdr.setup_data: 0x010203b0
> trampoline_32bit: 0x0009d000
> 
> Decompressing Linux... Parsing ELF... done.
> Booting the kernel.
> 
> 
> Aligning them vertically:
> 
> output:   0x0100
> output_len:   0x03c5e7d8
> kernel_total_size:0x04428000
> needed_size:  0x0460
> boot_params->hdr.setup_data:  0x010203b0

Ok, waait a minute:


Field name: pref_address
Type:   read (reloc)
Offset/size:0x258/8
Protocol:   2.10+


  This field, if nonzero, represents a preferred load address for the
  kernel.  A relocating bootloader should attempt to load at this
  address if possible.

  A non-relocatable kernel will unconditionally move itself and to run
  at this address.

so a kernel loader (qemu in this case) already knows where the kernel goes:

boot_params->hdr.setup_data: 0x01020450
boot_params->hdr.pref_address: 0x0100
^

now, considering that same kernel loader (qemu) knows how big that kernel is:

kernel_total_size: 0x04428000

should that loader *not* put anything that the kernel will use in the range

pref_addr + kernel_total_size

?

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

2023-01-01 Thread Borislav Petkov
On Sat, Dec 31, 2022 at 07:21:06PM -0800, H. Peter Anvin wrote:
> As far as the decompression itself goes, it should only a problem if we are
> using physical KASLR since otherwise the kernel has a guaranteed safe zone
> already allocated by the boot loader. However, if physical KASLR is in use,

No KASLR in Jason's config AFAICT:

$ grep RANDOMIZE .config
CONFIG_ARCH_HAS_ELF_RANDOMIZE=y
CONFIG_HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET=y
CONFIG_RANDOMIZE_KSTACK_OFFSET=y
# CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT is not set

> then the decompressor needs to know everything there is to know about the
> memory map.

Yeah, we do have that but as you folks establish later in the thread, those
setup_data regions would need to be avoided too. ;-\
 
> However, there also seems to be some kind of interaction with AMD SEV-SNP.
> 
> 
> The bug appears to have been introduced by:
> 
> b57feed2cc2622ae14b2fa62f19e973e5e0a60cf
> x86/compressed/64: Add identity mappings for setup_data entries
> https://lore.kernel.org/r/tycpr01mb694815cd815e98945f63c99183...@tycpr01mb6948.jpnprd01.prod.outlook.com
> 
> ... which was included in version 5.19, so it is relatively recent.

Right. We need that for the CC blob:

b190a043c49a ("x86/sev: Add SEV-SNP feature detection/setup")

> For a small amount of setup_data, the solution of just putting it next to
> the command line makes a lot of sense, and should be safe indefinitely.

Ok.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

2023-01-01 Thread Borislav Petkov
On Sat, Dec 31, 2022 at 07:31:21PM -0800, H. Peter Anvin wrote:
> It would probably be a good idea to add a "maximum physical address for
> initrd/setup_data/cmdline" field to struct kernel_info, though. It appears
> right now that those fields are being identity-mapped in the decompressor,
> and that means that if 48-bit addressing is used, physical memory may extend
> past the addressable range.

Yeah, we will probably need that too.

Btw, looka here - it can't get any more obvious than that after dumping
setup_data too:

early console in setup code
early console in extract_kernel
input_data: 0x040f92bf
input_len: 0x00f1c325
output: 0x0100
output_len: 0x03c5e7d8
kernel_total_size: 0x04428000
needed_size: 0x0460
boot_params->hdr.setup_data: 0x010203b0
trampoline_32bit: 0x0009d000

Decompressing Linux... Parsing ELF... done.
Booting the kernel.


Aligning them vertically:

output: 0x0100
output_len: 0x03c5e7d8
kernel_total_size:  0x04428000
needed_size:0x0460
boot_params->hdr.setup_data:0x010203b0

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

2023-01-02 Thread Borislav Petkov
On Mon, Jan 02, 2023 at 10:32:03AM +0100, Ard Biesheuvel wrote:
> So instead of appending data to the compressed image and assuming that
> it will stay in place, create or extend a memory reservation
> elsewhere, and refer to its absolute address in setup_data.

>From my limited experience with all those boot protocols, I'd say hardcoding
stuff is always a bad idea. But, we already more or less hardcode, or rather
codify through the setup header contract how stuff needs to get accessed.

And yeah, maybe specifying an absolute address and size for a blob of data and
putting that address and size in the setup header so that all the parties
involved are where what is, is probably better.

But WTH do I know...

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes

2022-12-16 Thread Borislav Petkov
On Fri, Dec 02, 2022 at 02:13:40PM +0800, Chao Peng wrote:
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1782c4555d94..7f0f5e9f2406 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1150,6 +1150,9 @@ static struct kvm *kvm_create_vm(unsigned long type, 
> const char *fdname)
>   spin_lock_init(>mn_invalidate_lock);
>   rcuwait_init(>mn_memslots_update_rcuwait);
>   xa_init(>vcpu_array);
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> + xa_init(>mem_attr_array);
> +#endif

if (IS_ENABLED(CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES))
...

would at least remove the ugly ifdeffery.

Or you could create wrapper functions for that xa_init() and
xa_destroy() and put the ifdeffery in there.

> @@ -2323,6 +2329,49 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm 
> *kvm,
>  }
>  #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>  
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +static u64 kvm_supported_mem_attributes(struct kvm *kvm)

I guess that function should have a verb in the name:

kvm_get_supported_mem_attributes()

> +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> +struct kvm_memory_attributes *attrs)
> +{
> + gfn_t start, end;
> + unsigned long i;
> + void *entry;
> + u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> +
> + /* flags is currently not used. */
> + if (attrs->flags)
> + return -EINVAL;
> + if (attrs->attributes & ~supported_attrs)
> + return -EINVAL;
> + if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> + return -EINVAL;
> + if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> + return -EINVAL;

Dunno, shouldn't those issue some sort of an error message so that the
caller knows where it failed? Or at least return different retvals which
signal what the problem is?

> + start = attrs->address >> PAGE_SHIFT;
> + end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> + entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> +
> + mutex_lock(>lock);
> + for (i = start; i < end; i++)
> + if (xa_err(xa_store(>mem_attr_array, i, entry,
> + GFP_KERNEL_ACCOUNT)))
> + break;
> + mutex_unlock(>lock);
> +
> + attrs->address = i << PAGE_SHIFT;
> + attrs->size = (end - i) << PAGE_SHIFT;
> +
> + return 0;
> +}

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes

2022-12-19 Thread Borislav Petkov
On Mon, Dec 19, 2022 at 04:15:32PM +0800, Chao Peng wrote:
> Tamping down with error number a bit:
> 
> if (attrs->flags)
> return -ENXIO;
> if (attrs->attributes & ~supported_attrs)
> return -EOPNOTSUPP;
> if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size) ||
> attrs->size == 0)
> return -EINVAL;
> if (attrs->address + attrs->size < attrs->address)
> return -E2BIG;

Yap, better.

I guess you should add those to the documentation of the ioctl too
so that people can find out why it fails. Or, well, they can look
at the code directly too but still... imagine some blurb about
user-friendliness here...

:-)

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

2022-12-19 Thread Borislav Petkov
On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> In memory encryption usage, guest memory may be encrypted with special
> key and can be accessed only by the guest itself. We call such memory
> private memory. It's valueless and sometimes can cause problem to allow

valueless?

I can't parse that.

> userspace to access guest private memory. This new KVM memslot extension
> allows guest private memory being provided through a restrictedmem
> backed file descriptor(fd) and userspace is restricted to access the
> bookmarked memory in the fd.

bookmarked?

> This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> additional KVM memslot fields restricted_fd/restricted_offset to allow
> userspace to instruct KVM to provide guest memory through restricted_fd.
> 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> and the size is 'memory_size'.
> 
> The extended memslot can still have the userspace_addr(hva). When use, a

"When un use, ..."

...

> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index a8e379a3afee..690cb21010e7 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -50,6 +50,8 @@ config KVM
>   select INTERVAL_TREE
>   select HAVE_KVM_PM_NOTIFIER if PM
>   select HAVE_KVM_MEMORY_ATTRIBUTES
> + select HAVE_KVM_RESTRICTED_MEM if X86_64
> + select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM

Those deps here look weird.

RESTRICTEDMEM should be selected by TDX_GUEST as it can't live without
it.

Then you don't have to select HAVE_KVM_RESTRICTED_MEM simply because of
X86_64 - you need that functionality when the respective guest support
is enabled in KVM.

Then, looking forward into your patchset, I'm not sure you even
need HAVE_KVM_RESTRICTED_MEM - you could make it all depend on
CONFIG_RESTRICTEDMEM. But that's KVM folks call - I'd always aim for
less Kconfig items because we have waay too many.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

2022-12-20 Thread Borislav Petkov
On Tue, Dec 20, 2022 at 03:43:18PM +0800, Chao Peng wrote:
> RESTRICTEDMEM is needed by TDX_HOST, not TDX_GUEST.

Which basically means that RESTRICTEDMEM should simply depend on KVM.
Because you can't know upfront whether KVM will run a TDX guest or a SNP
guest and so on.

Which then means that RESTRICTEDMEM will practically end up always
enabled in KVM HV configs.

> The only reason to add another HAVE_KVM_RESTRICTED_MEM is some code only
> works for 64bit[*] and CONFIG_RESTRICTEDMEM is not sufficient to enforce
> that.

This is what I mean with "we have too many Kconfig items". :-\

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



[Qemu-devel] [PATCH] target-i386: Reenable RDTSCP support on Opteron_G[345] CPU models

2018-12-11 Thread Borislav Petkov via Qemu-devel
+ qemu-devel.

On Tue, Dec 11, 2018 at 03:30:17PM +, Daniel P. Berrangé wrote:
> Great, then, this is a non-issue - we just need to mention that fact
> in the commit that sets the min version for the kernel

Ok, here's a first draft ontop of Eduardo's machine-next branch from
http://github.com/ehabkost/qemu

Also, thanks for the help Eduardo! :-)

---
From: Borislav Petkov 
Date: Tue, 11 Dec 2018 17:01:00 +0100

The missing functionality was added ~3 years ago with the Linux commit

  46896c73c1a4 ("KVM: svm: add support for RDTSCP")

so reenable RDTSCP support on those CPU models.

Opteron_G2 - being family 15, model 6, doesn't have RDTSCP support
(the real hardware doesn't have it. K8 got RDTSCP support with the NPT
models, i.e., models >= 0x40).

Document the host's minimum required kernel version, while at it.

Signed-off-by: Borislav Petkov 
---
 include/hw/i386/pc.h | 17 +
 qemu-doc.texi| 13 +
 target/i386/cpu.c| 11 ---
 3 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 9d29c4b1df2a..ebc28e816b04 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -296,6 +296,19 @@ bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
 
 #define PC_COMPAT_3_1 \
 HW_COMPAT_3_1 \
+{\
+.driver   = "Opteron_G3" "-" TYPE_X86_CPU,\
+.property = "rdtscp",\
+.value= "off",\
+},{\
+.driver   = "Opteron_G4" "-" TYPE_X86_CPU,\
+.property = "rdtscp",\
+.value= "off",\
+},{\
+.driver   = "Opteron_G5" "-" TYPE_X86_CPU,\
+.property = "rdtscp",\
+.value= "off",\
+},
 
 #define PC_COMPAT_3_0 \
 HW_COMPAT_3_0 \
@@ -527,10 +540,6 @@ bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
 .driver   = "qemu32" "-" TYPE_X86_CPU,\
 .property = "popcnt",\
 .value= "on",\
-},{\
-.driver   = "Opteron_G2" "-" TYPE_X86_CPU,\
-.property = "rdtscp",\
-.value= "on",\
 },{\
 .driver   = "Opteron_G3" "-" TYPE_X86_CPU,\
 .property = "rdtscp",\
diff --git a/qemu-doc.texi b/qemu-doc.texi
index f7ad1dfe4b69..16b955cbf985 100644
--- a/qemu-doc.texi
+++ b/qemu-doc.texi
@@ -37,6 +37,7 @@
 * QEMU System emulator for non PC targets::
 * QEMU Guest Agent::
 * QEMU User space emulator::
+* System requirements::
 * Implementation notes::
 * Deprecated features::
 * Supported build platforms::
@@ -2813,6 +2814,18 @@ Act as if the host page size was 'pagesize' bytes
 Run the emulation in single step mode.
 @end table
 
+@node System requirements
+@chapter System requirements
+
+@section KVM kernel module
+
+On x86_64 hosts, the default set of CPU features enabled by the KVM accelerator
+require the host to be running Linux v4.5 or newer.
+
+The OpteronG[345] CPU models require KVM support for RDTSCP, which was
+added with Linux 4.5 which is supported by the major distros. And even
+if RHEL7 has kernel 3.10, KVM there has the required functionality there
+to make it close to a 4.5 or newer kernel.
 
 @include qemu-tech.texi
 
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index f81d35e1f914..a7def11b27cd 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -2706,7 +2706,6 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_DE | CPUID_FP87,
 .features[FEAT_1_ECX] =
 CPUID_EXT_CX16 | CPUID_EXT_SSE3,
-/* Missing: CPUID_EXT2_RDTSCP */
 .features[FEAT_8000_0001_EDX] =
 CPUID_EXT2_LM | CPUID_EXT2_NX | CPUID_EXT2_SYSCALL,
 .features[FEAT_8000_0001_ECX] =
@@ -2730,9 +2729,9 @@ static X86CPUDefinition builtin_x86_defs[] = {
 .features[FEAT_1_ECX] =
 CPUID_EXT_POPCNT | CPUID_EXT_CX16 | CPUID_EXT_MONITOR |
 CPUID_EXT_SSE3,
-/* Missing: CPUID_EXT2_RDTSCP */
 .features[FEAT_8000_0001_EDX] =
-CPUID_EXT2_LM | CPUID_EXT2_NX | CPUID_EXT2_SYSCALL,
+CPUID_EXT2_LM | CPUID_EXT2_NX | CPUID_EXT2_SYSCALL |
+CPUID_EXT2_RDTSCP,
 .features[FEAT_8000_0001_ECX] =
 CPUID_EXT3_MISALIGNSSE | CPUID_EXT3_SSE4A |
 CPUID_EXT3_ABM | CPUID_EXT3_SVM | CPUID_EXT3_LAHF_LM,
@@ -2757,10 +2756,9 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_EXT_POPCNT | CPUID_EXT_SSE42 | CPUID_EXT_SSE41 |
 CPUID_EXT_CX16 | CPUID_EXT_SSSE3 | CPUID_EXT_PCLMULQDQ |
 CPUID_EXT_SSE3,
-/* Missing: CPUID_EXT2_RDTSCP */
 .features[FEAT_8000_0001_EDX] =
 CPUID_EXT2_LM | CPUID_EXT2_PDPE1GB | CPUID_EXT2_NX |
-CPUID_EXT2_SYSCALL,
+CPUID_EXT2_SYSCALL | CPUID_EXT2_RDTSCP,
 

[Qemu-devel] [PATCH -v2] target-i386: Reenable RDTSCP support on Opteron_G[345] CPU models CPU models

2018-12-12 Thread Borislav Petkov via Qemu-devel
On Wed, Dec 12, 2018 at 05:52:35PM -0200, Eduardo Habkost wrote:
> Why did you remove this entry from PC_COMPAT_2_4?
> 
> We must keep compatibility with old behavior of Opteron_G2 on
> pc-2.4, even if the old behavior was incorrect.

Ok, hunk reverted. v2 below.

Thx.

---
From: Borislav Petkov 

The missing functionality was added ~3 years ago with the Linux commit

  46896c73c1a4 ("KVM: svm: add support for RDTSCP")

so reenable RDTSCP support on those CPU models.

Opteron_G2 - being family 15, model 6, doesn't have RDTSCP support
(the real hardware doesn't have it. K8 got RDTSCP support with the NPT
models, i.e., models >= 0x40).

Document the host's minimum required kernel version, while at it.

Signed-off-by: Borislav Petkov 
---
v2: Keep Opteron_G2 in PC_COMPAT_2_4 unchanged.

 include/hw/i386/pc.h | 13 +
 qemu-doc.texi| 13 +
 target/i386/cpu.c| 11 ---
 3 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 9d29c4b1df2a..236d962d2547 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -296,6 +296,19 @@ bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
 
 #define PC_COMPAT_3_1 \
 HW_COMPAT_3_1 \
+{\
+.driver   = "Opteron_G3" "-" TYPE_X86_CPU,\
+.property = "rdtscp",\
+.value= "off",\
+},{\
+.driver   = "Opteron_G4" "-" TYPE_X86_CPU,\
+.property = "rdtscp",\
+.value= "off",\
+},{\
+.driver   = "Opteron_G5" "-" TYPE_X86_CPU,\
+.property = "rdtscp",\
+.value= "off",\
+},
 
 #define PC_COMPAT_3_0 \
 HW_COMPAT_3_0 \
diff --git a/qemu-doc.texi b/qemu-doc.texi
index f7ad1dfe4b69..16b955cbf985 100644
--- a/qemu-doc.texi
+++ b/qemu-doc.texi
@@ -37,6 +37,7 @@
 * QEMU System emulator for non PC targets::
 * QEMU Guest Agent::
 * QEMU User space emulator::
+* System requirements::
 * Implementation notes::
 * Deprecated features::
 * Supported build platforms::
@@ -2813,6 +2814,18 @@ Act as if the host page size was 'pagesize' bytes
 Run the emulation in single step mode.
 @end table
 
+@node System requirements
+@chapter System requirements
+
+@section KVM kernel module
+
+On x86_64 hosts, the default set of CPU features enabled by the KVM accelerator
+require the host to be running Linux v4.5 or newer.
+
+The OpteronG[345] CPU models require KVM support for RDTSCP, which was
+added with Linux 4.5 which is supported by the major distros. And even
+if RHEL7 has kernel 3.10, KVM there has the required functionality there
+to make it close to a 4.5 or newer kernel.
 
 @include qemu-tech.texi
 
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index f81d35e1f914..a7def11b27cd 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -2706,7 +2706,6 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_DE | CPUID_FP87,
 .features[FEAT_1_ECX] =
 CPUID_EXT_CX16 | CPUID_EXT_SSE3,
-/* Missing: CPUID_EXT2_RDTSCP */
 .features[FEAT_8000_0001_EDX] =
 CPUID_EXT2_LM | CPUID_EXT2_NX | CPUID_EXT2_SYSCALL,
 .features[FEAT_8000_0001_ECX] =
@@ -2730,9 +2729,9 @@ static X86CPUDefinition builtin_x86_defs[] = {
 .features[FEAT_1_ECX] =
 CPUID_EXT_POPCNT | CPUID_EXT_CX16 | CPUID_EXT_MONITOR |
 CPUID_EXT_SSE3,
-/* Missing: CPUID_EXT2_RDTSCP */
 .features[FEAT_8000_0001_EDX] =
-CPUID_EXT2_LM | CPUID_EXT2_NX | CPUID_EXT2_SYSCALL,
+CPUID_EXT2_LM | CPUID_EXT2_NX | CPUID_EXT2_SYSCALL |
+CPUID_EXT2_RDTSCP,
 .features[FEAT_8000_0001_ECX] =
 CPUID_EXT3_MISALIGNSSE | CPUID_EXT3_SSE4A |
 CPUID_EXT3_ABM | CPUID_EXT3_SVM | CPUID_EXT3_LAHF_LM,
@@ -2757,10 +2756,9 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_EXT_POPCNT | CPUID_EXT_SSE42 | CPUID_EXT_SSE41 |
 CPUID_EXT_CX16 | CPUID_EXT_SSSE3 | CPUID_EXT_PCLMULQDQ |
 CPUID_EXT_SSE3,
-/* Missing: CPUID_EXT2_RDTSCP */
 .features[FEAT_8000_0001_EDX] =
 CPUID_EXT2_LM | CPUID_EXT2_PDPE1GB | CPUID_EXT2_NX |
-CPUID_EXT2_SYSCALL,
+CPUID_EXT2_SYSCALL | CPUID_EXT2_RDTSCP,
 .features[FEAT_8000_0001_ECX] =
 CPUID_EXT3_FMA4 | CPUID_EXT3_XOP |
 CPUID_EXT3_3DNOWPREFETCH | CPUID_EXT3_MISALIGNSSE |
@@ -2788,10 +2786,9 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_EXT_AES | CPUID_EXT_POPCNT | CPUID_EXT_SSE42 |
 CPUID_EXT_SSE41 | CPUID_EXT_CX16 | CPUID_EXT_FMA |
 CPUID_EXT_SSSE3 | CPUID_EXT_PCLMULQDQ | CPUID_EXT_SSE3,
-/* Missing: CPUID_EXT2_RDTSCP */
 .features[FEAT_8000_0001_EDX] =
 CPUID_EXT2_LM | CPUID_EXT2_PDPE1GB | CPUID_EXT2_NX |
-