Re: How to reserve guest physical region for ACPI

2016-01-06 Thread Laszlo Ersek
On 01/06/16 14:39, Igor Mammedov wrote:
> On Tue, 5 Jan 2016 18:22:33 +0100
> Laszlo Ersek <ler...@redhat.com> wrote:
> 
>> On 01/05/16 18:08, Igor Mammedov wrote:
>>> On Mon, 4 Jan 2016 21:17:31 +0100
>>> Laszlo Ersek <ler...@redhat.com> wrote:
>>>   
>>>> Michael CC'd me on the grandparent of the email below. I'll try to add
>>>> my thoughts in a single go, with regard to OVMF.
>>>>
>>>> On 12/30/15 20:52, Michael S. Tsirkin wrote:  
>>>>> On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:
>>>>>> On Mon, 28 Dec 2015 14:50:15 +0200
>>>>>> "Michael S. Tsirkin" <m...@redhat.com> wrote:
>>>>>>
>>>>>>> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:
>>>>>>>>
>>>>>>>> Hi Michael, Paolo,
>>>>>>>>
>>>>>>>> Now it is the time to return to the challenge that how to reserve guest
>>>>>>>> physical region internally used by ACPI.
>>>>>>>>
>>>>>>>> Igor suggested that:
>>>>>>>> | An alternative place to allocate reserve from could be high memory.
>>>>>>>> | For pc we have "reserved-memory-end" which currently makes sure
>>>>>>>> | that hotpluggable memory range isn't used by firmware
>>>>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)
>>>>>>>> 
>>>>
>>>> OVMF has no support for the "reserved-memory-end" fw_cfg file. The
>>>> reason is that nobody wrote that patch, nor asked for the patch to be
>>>> written. (Not implying that just requesting the patch would be
>>>> sufficient for the patch to be written.)
>>>>  
>>>>>>> I don't want to tie things to reserved-memory-end because this
>>>>>>> does not scale: next time we need to reserve memory,
>>>>>>> we'll need to find yet another way to figure out what is where.
>>>>>> Could you elaborate a bit more on a problem you're seeing?
>>>>>>
>>>>>> To me it looks like it scales rather well.
>>>>>> For example lets imagine that we adding a device
>>>>>> that has some on device memory that should be mapped into GPA
>>>>>> code to do so would look like:
>>>>>>
>>>>>>   pc_machine_device_plug_cb(dev)
>>>>>>   {
>>>>>>...
>>>>>>if (dev == OUR_NEW_DEVICE_TYPE) {
>>>>>>memory_region_add_subregion(as, current_reserved_end, >mr);
>>>>>>set_new_reserved_end(current_reserved_end + 
>>>>>> memory_region_size(>mr));
>>>>>>}
>>>>>>   }
>>>>>>
>>>>>> we can practically add any number of new devices that way.
>>>>>
>>>>> Yes but we'll have to build a host side allocator for these, and that's
>>>>> nasty. We'll also have to maintain these addresses indefinitely (at
>>>>> least per machine version) as they are guest visible.
>>>>> Not only that, there's no way for guest to know if we move things
>>>>> around, so basically we'll never be able to change addresses.
>>>>>
>>>>> 
>>>>>>  
>>>>>>> I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
>>>>>>> support 64 bit RAM instead
>>>>
>>>> This looks quite doable in OVMF, as long as the blob to allocate from
>>>> high memory contains *zero* ACPI tables.
>>>>
>>>> (
>>>> Namely, each ACPI table is installed from the containing fw_cfg blob
>>>> with EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(), and the latter has its
>>>> own allocation policy for the *copies* of ACPI tables it installs.
>>>>
>>>> This allocation policy is left unspecified in the section of the UEFI
>>>> spec that governs EFI_ACPI_TABLE_PROTOCOL.
>>>>
>>>> The current policy in edk2 (= the reference implementation) seems to be
>>>> "allocate from under 4GB". It is currently being changed to "try to
>>>> allocate from under 4GB, and if that fails, retry from high memory". (It
>>>> is motivated b

Re: How to reserve guest physical region for ACPI

2016-01-05 Thread Laszlo Ersek
On 01/05/16 18:08, Igor Mammedov wrote:
> On Mon, 4 Jan 2016 21:17:31 +0100
> Laszlo Ersek <ler...@redhat.com> wrote:
> 
>> Michael CC'd me on the grandparent of the email below. I'll try to add
>> my thoughts in a single go, with regard to OVMF.
>>
>> On 12/30/15 20:52, Michael S. Tsirkin wrote:
>>> On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:  
>>>> On Mon, 28 Dec 2015 14:50:15 +0200
>>>> "Michael S. Tsirkin" <m...@redhat.com> wrote:
>>>>  
>>>>> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:  
>>>>>>
>>>>>> Hi Michael, Paolo,
>>>>>>
>>>>>> Now it is the time to return to the challenge that how to reserve guest
>>>>>> physical region internally used by ACPI.
>>>>>>
>>>>>> Igor suggested that:
>>>>>> | An alternative place to allocate reserve from could be high memory.
>>>>>> | For pc we have "reserved-memory-end" which currently makes sure
>>>>>> | that hotpluggable memory range isn't used by firmware
>>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html) 
>>>>>>  
>>
>> OVMF has no support for the "reserved-memory-end" fw_cfg file. The
>> reason is that nobody wrote that patch, nor asked for the patch to be
>> written. (Not implying that just requesting the patch would be
>> sufficient for the patch to be written.)
>>
>>>>> I don't want to tie things to reserved-memory-end because this
>>>>> does not scale: next time we need to reserve memory,
>>>>> we'll need to find yet another way to figure out what is where.  
>>>> Could you elaborate a bit more on a problem you're seeing?
>>>>
>>>> To me it looks like it scales rather well.
>>>> For example lets imagine that we adding a device
>>>> that has some on device memory that should be mapped into GPA
>>>> code to do so would look like:
>>>>
>>>>   pc_machine_device_plug_cb(dev)
>>>>   {
>>>>...
>>>>if (dev == OUR_NEW_DEVICE_TYPE) {
>>>>memory_region_add_subregion(as, current_reserved_end, >mr);
>>>>set_new_reserved_end(current_reserved_end + 
>>>> memory_region_size(>mr));
>>>>}
>>>>   }
>>>>
>>>> we can practically add any number of new devices that way.  
>>>
>>> Yes but we'll have to build a host side allocator for these, and that's
>>> nasty. We'll also have to maintain these addresses indefinitely (at
>>> least per machine version) as they are guest visible.
>>> Not only that, there's no way for guest to know if we move things
>>> around, so basically we'll never be able to change addresses.
>>>
>>>   
>>>>
>>>>> I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
>>>>> support 64 bit RAM instead  
>>
>> This looks quite doable in OVMF, as long as the blob to allocate from
>> high memory contains *zero* ACPI tables.
>>
>> (
>> Namely, each ACPI table is installed from the containing fw_cfg blob
>> with EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(), and the latter has its
>> own allocation policy for the *copies* of ACPI tables it installs.
>>
>> This allocation policy is left unspecified in the section of the UEFI
>> spec that governs EFI_ACPI_TABLE_PROTOCOL.
>>
>> The current policy in edk2 (= the reference implementation) seems to be
>> "allocate from under 4GB". It is currently being changed to "try to
>> allocate from under 4GB, and if that fails, retry from high memory". (It
>> is motivated by Aarch64 machines that may have no DRAM at all under 4GB.)
>> )
>>
>>>>> (and maybe a way to allocate and
>>>>> zero-initialize buffer without loading it through fwcfg),  
>>
>> Sounds reasonable.
>>
>>>>> this way bios
>>>>> does the allocation, and addresses can be patched into acpi.  
>>>> and then guest side needs to parse/execute some AML that would
>>>> initialize QEMU side so it would know where to write data.  
>>>
>>> Well not really - we can put it in a data table, by itself
>>> so it's easy to find.  
>>
>> Do you mean acpi_tb_find_table(), acpi_get_table_by_index() /
>> acpi_get_table_with_size()?
>>
>>&

Re: How to reserve guest physical region for ACPI

2016-01-05 Thread Laszlo Ersek
On 01/05/16 17:43, Michael S. Tsirkin wrote:
> On Tue, Jan 05, 2016 at 05:30:25PM +0100, Igor Mammedov wrote:
 bios-linker-loader is a great interface for initializing some
 guest owned data and linking it together but I think it adds
 unnecessary complexity and is misused if it's used to handle
 device owned data/on device memory in this and VMGID cases.  
>>>
>>> I want a generic interface for guest to enumerate these things.  linker
>>> seems quite reasonable but if you see a reason why it won't do, or want
>>> to propose a better interface, fine.
>>>
>>> PCI would do, too - though windows guys had concerns about
>>> returning PCI BARs from ACPI.
>> There were potential issues with pSeries bootloader that treated
>> PCI_CLASS_MEMORY_RAM as conventional RAM but it was fixed.
>> Could you point out to discussion about windows issues?
>>
>> What VMGEN patches that used PCI for mapping purposes were
>> stuck at, was that it was suggested to use PCI_CLASS_MEMORY_RAM
>> class id but we couldn't agree on it.
>>
>> VMGEN v13 with full discussion is here
>> https://patchwork.ozlabs.org/patch/443554/
>> So to continue with this route we would need to pick some other
>> driver less class id so windows won't prompt for driver or
>> maybe supply our own driver stub to guarantee that no one
>> would touch it. Any suggestions?
> 
> Pick any device/vendor id pair for which windows specifies no driver.
> There's a small risk that this will conflict with some
> guest but I think it's minimal.
> 
> 
>>>
>>>
 There was RFC on list to make BIOS boot from NVDIMM already
 doing some ACPI table lookup/parsing. Now if they were forced
 to also parse and execute AML to initialize QEMU with guest
 allocated address that would complicate them quite a bit.  
>>>
>>> If they just need to find a table by name, it won't be
>>> too bad, would it?
>> that's what they were doing scanning memory for static NVDIMM table.
>> However if it were DataTable, BIOS side would have to execute
>> AML so that the table address could be told to QEMU.
> 
> Not at all. You can find any table by its signature without
> parsing AML.
> 
> 
>> In case of direct mapping or PCI BAR there is no need to initialize
>> QEMU side from AML.
>> That also saves us IO port where this address should be written
>> if bios-linker-loader approach is used.
>>
>>>
 While with NVDIMM control memory region mapped directly by QEMU,
 respective patches don't need in any way to initialize QEMU,
 all they would need just read necessary data from control region.

 Also using bios-linker-loader takes away some usable RAM
 from guest and in the end that doesn't scale,
 the more devices I add the less usable RAM is left for guest OS
 while all the device needs is a piece of GPA address space
 that would belong to it.  
>>>
>>> I don't get this comment. I don't think it's MMIO that is wanted.
>>> If it's backed by qemu virtual memory then it's RAM.
>> Then why don't allocate video card VRAM the same way and try to explain
>> user that a guest started with '-m 128 -device cirrus-vga,vgamem_mb=64Mb'
>> only has 64Mb of available RAM because of we think that on device VRAM
>> is also RAM.
>>
>> Maybe I've used MMIO term wrongly here but it roughly reflects the idea
>> that on device memory (whether it's VRAM, NVDIMM control block or VMGEN
>> area) is not allocated from guest's usable RAM (as described in E820)
>> but rather directly mapped in guest's GPA and doesn't consume available
>> RAM as guest sees it. That's also the way it's done on real hardware.
>>
>> What we need in case of VMGEN ID and NVDIMM is on device memory
>> that could be directly accessed by guest.
>> Both direct mapping or PCI BAR do that job and we could use simple
>> static AML without any patching.
> 
> At least with VMGEN the issue is that there's an AML method
> that returns the physical address.
> Then if guest OS moves the BAR (which is legal), it will break
> since caller has no way to know it's related to the BAR.
> 
> 
>
> See patch at the bottom that might be handy.
>   
>> he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
>> | when writing ASL one shall make sure that only XP supported
>> | features are in global scope, which is evaluated when tables
>> | are loaded and features of rev2 and higher are inside methods.
>> | That way XP doesn't crash as far as it doesn't evaluate unsupported
>> | opcode and one can guard those opcodes checking _REV object if 
>> neccesary.
>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html) 
>>  
>
> Yes, this technique works.
>
> An alternative is to add an XSDT, XP ignores that.
> XSDT at the moment breaks OVMF (because it loads both
> the RSDT and the XSDT, which is wrong), but I think
> Laszlo was working on a fix for that.  
 Using XSDT would increase ACPI tables occupied RAM
 

Re: How to reserve guest physical region for ACPI

2016-01-04 Thread Laszlo Ersek
Michael CC'd me on the grandparent of the email below. I'll try to add
my thoughts in a single go, with regard to OVMF.

On 12/30/15 20:52, Michael S. Tsirkin wrote:
> On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:
>> On Mon, 28 Dec 2015 14:50:15 +0200
>> "Michael S. Tsirkin"  wrote:
>>
>>> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:

 Hi Michael, Paolo,

 Now it is the time to return to the challenge that how to reserve guest
 physical region internally used by ACPI.

 Igor suggested that:
 | An alternative place to allocate reserve from could be high memory.
 | For pc we have "reserved-memory-end" which currently makes sure
 | that hotpluggable memory range isn't used by firmware
 (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)

OVMF has no support for the "reserved-memory-end" fw_cfg file. The
reason is that nobody wrote that patch, nor asked for the patch to be
written. (Not implying that just requesting the patch would be
sufficient for the patch to be written.)

>>> I don't want to tie things to reserved-memory-end because this
>>> does not scale: next time we need to reserve memory,
>>> we'll need to find yet another way to figure out what is where.
>> Could you elaborate a bit more on a problem you're seeing?
>>
>> To me it looks like it scales rather well.
>> For example lets imagine that we adding a device
>> that has some on device memory that should be mapped into GPA
>> code to do so would look like:
>>
>>   pc_machine_device_plug_cb(dev)
>>   {
>>...
>>if (dev == OUR_NEW_DEVICE_TYPE) {
>>memory_region_add_subregion(as, current_reserved_end, >mr);
>>set_new_reserved_end(current_reserved_end + 
>> memory_region_size(>mr));
>>}
>>   }
>>
>> we can practically add any number of new devices that way.
> 
> Yes but we'll have to build a host side allocator for these, and that's
> nasty. We'll also have to maintain these addresses indefinitely (at
> least per machine version) as they are guest visible.
> Not only that, there's no way for guest to know if we move things
> around, so basically we'll never be able to change addresses.
> 
> 
>>  
>>> I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
>>> support 64 bit RAM instead

This looks quite doable in OVMF, as long as the blob to allocate from
high memory contains *zero* ACPI tables.

(
Namely, each ACPI table is installed from the containing fw_cfg blob
with EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(), and the latter has its
own allocation policy for the *copies* of ACPI tables it installs.

This allocation policy is left unspecified in the section of the UEFI
spec that governs EFI_ACPI_TABLE_PROTOCOL.

The current policy in edk2 (= the reference implementation) seems to be
"allocate from under 4GB". It is currently being changed to "try to
allocate from under 4GB, and if that fails, retry from high memory". (It
is motivated by Aarch64 machines that may have no DRAM at all under 4GB.)
)

>>> (and maybe a way to allocate and
>>> zero-initialize buffer without loading it through fwcfg),

Sounds reasonable.

>>> this way bios
>>> does the allocation, and addresses can be patched into acpi.
>> and then guest side needs to parse/execute some AML that would
>> initialize QEMU side so it would know where to write data.
> 
> Well not really - we can put it in a data table, by itself
> so it's easy to find.

Do you mean acpi_tb_find_table(), acpi_get_table_by_index() /
acpi_get_table_with_size()?

> 
> AML is only needed if access from ACPI is desired.
> 
> 
>> bios-linker-loader is a great interface for initializing some
>> guest owned data and linking it together but I think it adds
>> unnecessary complexity and is misused if it's used to handle
>> device owned data/on device memory in this and VMGID cases.
> 
> I want a generic interface for guest to enumerate these things.  linker
> seems quite reasonable but if you see a reason why it won't do, or want
> to propose a better interface, fine.

* The guest could do the following:
- while processing the ALLOCATE commands, it would make a note where in
GPA space each fw_cfg blob gets allocated
- at the end the guest would prepare a temporary array with a predefined
record format, that associates each fw_cfg blob's name with the concrete
allocation address
- it would create an FWCfgDmaAccess stucture pointing at this array,
with a new "control" bit set (or something similar)
- the guest could write the address of the FWCfgDmaAccess struct to the
appropriate register, as always.

* Another idea would be a GET_ALLOCATION_ADDRESS linker/loader command,
specifying:
- the fw_cfg blob's name, for which to retrieve the guest-allocated
  address (this command could only follow the matching ALLOCATE
  command, never precede it)
- a flag whether the address should be written to IO or MMIO space
  (would be likely IO on x86, MMIO on ARM)
- a unique 

Re: [PATCH 0/3] KVM: x86: simplify RSM into 64-bit protected mode

2015-11-03 Thread Laszlo Ersek
On 11/02/15 10:32, Paolo Bonzini wrote:
> 
> 
> On 31/10/2015 20:50, Laszlo Ersek wrote:
>> Tested-by: Laszlo Ersek <ler...@redhat.com>
> 
> Thanks Laszlo, I applied patches 1 and 2 (since your "part 2" never was :)).
> 
> Paolo
> 

Thanks.

Since you can rebase the queue freely, can you please also add:

Reported-by: Laszlo Ersek <ler...@redhat.com>

to Radim's patch "KVM: x86: handle SMBASE as physical address in RSM"?

Thanks
Laszlo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: allow RSM from 64-bit mode

2015-11-03 Thread Laszlo Ersek
On 11/03/15 15:04, Paolo Bonzini wrote:
> 
> 
> On 03/11/2015 15:02, Laszlo Ersek wrote:
>> On 11/03/15 14:46, Paolo Bonzini wrote:
>>>
>>>
>>> On 03/11/2015 14:40, Laszlo Ersek wrote:
>>>> On 11/03/15 14:29, Paolo Bonzini wrote:
>>>>> The SDM says that exiting system management mode from 64-bit mode
>>>>> is invalid, but that would be too good to be true.  But actually,
>>>>> most of the code is already there to support exiting from compat
>>>>> mode (EFER.LME=1, EFER.LMA=0).  Getting all the way from 64-bit
>>>>> mode to real mode only requires clearing CS.L and CR4.PCIDE.
>>>>>
>>>>> Cc: sta...@vger.kernel.org
>>>>> Fixes: 660a5d517aaab9187f93854425c4c63f4a09195c
>>>>> Cc: Laszlo Ersek <ler...@redhat.com>
>>>>> Cc: Radim Krčmář <rkrc...@redhat.com>
>>>>> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
>>>>> ---
>>>>>  arch/x86/kvm/emulate.c | 30 +-
>>>>>  1 file changed, 25 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
>>>>> index b60fed56671b..1505587d06e9 100644
>>>>> --- a/arch/x86/kvm/emulate.c
>>>>> +++ b/arch/x86/kvm/emulate.c
>>>>> @@ -2484,16 +2484,36 @@ static int em_rsm(struct x86_emulate_ctxt *ctxt)
>>>>>  
>>>>>   /*
>>>>>* Get back to real mode, to prepare a safe state in which to load
>>>>> -  * CR0/CR3/CR4/EFER.
>>>>> -  *
>>>>> -  * CR4.PCIDE must be zero, because it is a 64-bit mode only feature.
>>>>> +  * CR0/CR3/CR4/EFER.  It's all a bit more complicated if the vCPU
>>>>> +  * supports long mode.
>>>>>*/
>>>>> + cr4 = ctxt->ops->get_cr(ctxt, 4);
>>>>> + if (emulator_has_longmode(ctxt)) {
>>>>> + struct desc_struct cs_desc;
>>>>> +
>>>>> + /* Zero CR4.PCIDE before CR0.PG.  */
>>>>> + if (cr4 & X86_CR4_PCIDE) {
>>>>> + ctxt->ops->set_cr(ctxt, 4, cr4 & ~X86_CR4_PCIDE);
>>>>> + cr4 &= ~X86_CR4_PCIDE;
>>>>> + }
>>>>> +
>>>>> + /* A 32-bit code segment is required to clear EFER.LMA.  */
>>>>> + memset(_desc, 0, sizeof(cs_desc));
>>>>> + cs_desc.type = 0xb;
>>>>> + cs_desc.s = cs_desc.g = cs_desc.p = 1;
>>>>> + ctxt->ops->set_segment(ctxt, 0, _desc, 0, VCPU_SREG_CS);
>>>>> + }
>>>>> +
>>>>> + /* For the 64-bit case, this will clear EFER.LMA.  */
>>>>>   cr0 = ctxt->ops->get_cr(ctxt, 0);
>>>>>   if (cr0 & X86_CR0_PE)
>>>>>   ctxt->ops->set_cr(ctxt, 0, cr0 & ~(X86_CR0_PG | X86_CR0_PE));
>>>>> - cr4 = ctxt->ops->get_cr(ctxt, 4);
>>>>> +
>>>>> + /* Now clear CR4.PAE (which must be done before clearing EFER.LME).  */
>>>>>   if (cr4 & X86_CR4_PAE)
>>>>>   ctxt->ops->set_cr(ctxt, 4, cr4 & ~X86_CR4_PAE);
>>>>> +
>>>>> + /* And finally go back to 32-bit mode.  */
>>>>>   efer = 0;
>>>>>   ctxt->ops->set_msr(ctxt, MSR_EFER, efer);
>>>>>  
>>>>> @@ -4454,7 +4474,7 @@ static const struct opcode twobyte_table[256] = {
>>>>>   F(DstMem | SrcReg | Src2CL | ModRM, em_shld), N, N,
>>>>>   /* 0xA8 - 0xAF */
>>>>>   I(Stack | Src2GS, em_push_sreg), I(Stack | Src2GS, em_pop_sreg),
>>>>> - II(No64 | EmulateOnUD | ImplicitOps, em_rsm, rsm),
>>>>> + II(EmulateOnUD | ImplicitOps, em_rsm, rsm),
>>>>>   F(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_bts),
>>>>>   F(DstMem | SrcReg | Src2ImmByte | ModRM, em_shrd),
>>>>>   F(DstMem | SrcReg | Src2CL | ModRM, em_shrd),
>>>>>
>>>>
>>>> What branch should I test this on top of?
>>>
>>> Just use whatever you were using before, and revert commit c9db607
>>> ("UefiCpuPkg: PiSmmCpuDxeSmm: do not execute RSM from 64-bit mode",
>>> 2015-10-14) from your OVMF branch.
>>
>> Right, I planned to do that OVMF-side revert; I just wasn't sure if e.g.
>> kvm/queue had some prerequisite patches for this.
> 
> Indeed, you can use either your "part 2" series or Radim's patches from
> kvm/queue, it's the same.

I noticed that you applied this patch to kvm/queue, at
9f64d2c75fa6a5aac0a5657400f3473f8144c3be. So I simply tested kvm/queue
in that state. (I did not forget to drop the workaround OVMF patch first.)

Tested-by: Laszlo Ersek <ler...@redhat.com>

Thank you!
Laszlo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: obey KVM_X86_QUIRK_CD_NW_CLEARED in kvm_set_cr0()

2015-11-03 Thread Laszlo Ersek
On 11/03/15 19:34, Laszlo Ersek wrote:
> Commit b18d5431acc7 ("KVM: x86: fix CR0.CD virtualization") was
> technically correct, but it broke OVMF guests by slowing down various
> parts of the firmware.
> 
> Commit fb279950ba02 ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED") quirked the
> first function modified by b18d5431acc7, vmx_get_mt_mask(), for OVMF's
> sake. This restored the speed of the OVMF code that runs before
> PlatformPei (including the memory intensive LZMA decompression in SEC).
> 
> This patch extends the quirk to the second function modified by
> b18d5431acc7, kvm_set_cr0(). It eliminates the intrusive slowdown that
> hits the EFI_MP_SERVICES_PROTOCOL implementation of edk2's
> UefiCpuPkg/CpuDxe -- which is built into OVMF --, when CpuDxe starts up
> all APs at once for initialization, in order to count them.
> 
> We also carry over the kvm_arch_has_noncoherent_dma() sub-condition from
> the other half of the original commit b18d5431acc7.
> 
> Cc: Paolo Bonzini <pbonz...@redhat.com>
> Cc: Jordan Justen <jordan.l.jus...@intel.com>
> Cc: Janusz Mocek <janusz...@gmail.com>
> Cc: Alex Williamson <alex.william...@redhat.com>
> Cc: Xiao Guangrong <guangrong.x...@linux.intel.com>
> Signed-off-by: Laszlo Ersek <ler...@redhat.com>
> ---
>  arch/x86/kvm/x86.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a24bae0..30723a4 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -625,7 +625,9 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
>   if ((cr0 ^ old_cr0) & update_bits)
>   kvm_mmu_reset_context(vcpu);
>  
> - if ((cr0 ^ old_cr0) & X86_CR0_CD)
> + if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
> + kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
> + !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
>   kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
>  
>   return 0;
> 


I had notes on this patch, but I forgot to format it with --notes. They
were:

- People on the CC list, please reply with your Tested-by, Reported-by,
  etc tags as appropriate; it's getting blurry who participated in what
  and how.

- This patch is *not* necessary for the OVMF SMM work; instead it
  addresses an independent OVMF boot regression seen by users.

Thanks
Laszlo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: x86: obey KVM_X86_QUIRK_CD_NW_CLEARED in kvm_set_cr0()

2015-11-03 Thread Laszlo Ersek
Commit b18d5431acc7 ("KVM: x86: fix CR0.CD virtualization") was
technically correct, but it broke OVMF guests by slowing down various
parts of the firmware.

Commit fb279950ba02 ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED") quirked the
first function modified by b18d5431acc7, vmx_get_mt_mask(), for OVMF's
sake. This restored the speed of the OVMF code that runs before
PlatformPei (including the memory intensive LZMA decompression in SEC).

This patch extends the quirk to the second function modified by
b18d5431acc7, kvm_set_cr0(). It eliminates the intrusive slowdown that
hits the EFI_MP_SERVICES_PROTOCOL implementation of edk2's
UefiCpuPkg/CpuDxe -- which is built into OVMF --, when CpuDxe starts up
all APs at once for initialization, in order to count them.

We also carry over the kvm_arch_has_noncoherent_dma() sub-condition from
the other half of the original commit b18d5431acc7.

Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Jordan Justen <jordan.l.jus...@intel.com>
Cc: Janusz Mocek <janusz...@gmail.com>
Cc: Alex Williamson <alex.william...@redhat.com>
Cc: Xiao Guangrong <guangrong.x...@linux.intel.com>
Signed-off-by: Laszlo Ersek <ler...@redhat.com>
---
 arch/x86/kvm/x86.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a24bae0..30723a4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -625,7 +625,9 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
if ((cr0 ^ old_cr0) & update_bits)
kvm_mmu_reset_context(vcpu);
 
-   if ((cr0 ^ old_cr0) & X86_CR0_CD)
+   if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
+   kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
+   !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
 
return 0;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: allow RSM from 64-bit mode

2015-11-03 Thread Laszlo Ersek
On 11/03/15 14:46, Paolo Bonzini wrote:
> 
> 
> On 03/11/2015 14:40, Laszlo Ersek wrote:
>> On 11/03/15 14:29, Paolo Bonzini wrote:
>>> The SDM says that exiting system management mode from 64-bit mode
>>> is invalid, but that would be too good to be true.  But actually,
>>> most of the code is already there to support exiting from compat
>>> mode (EFER.LME=1, EFER.LMA=0).  Getting all the way from 64-bit
>>> mode to real mode only requires clearing CS.L and CR4.PCIDE.
>>>
>>> Cc: sta...@vger.kernel.org
>>> Fixes: 660a5d517aaab9187f93854425c4c63f4a09195c
>>> Cc: Laszlo Ersek <ler...@redhat.com>
>>> Cc: Radim Krčmář <rkrc...@redhat.com>
>>> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
>>> ---
>>>  arch/x86/kvm/emulate.c | 30 +-
>>>  1 file changed, 25 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
>>> index b60fed56671b..1505587d06e9 100644
>>> --- a/arch/x86/kvm/emulate.c
>>> +++ b/arch/x86/kvm/emulate.c
>>> @@ -2484,16 +2484,36 @@ static int em_rsm(struct x86_emulate_ctxt *ctxt)
>>>  
>>> /*
>>>  * Get back to real mode, to prepare a safe state in which to load
>>> -* CR0/CR3/CR4/EFER.
>>> -*
>>> -* CR4.PCIDE must be zero, because it is a 64-bit mode only feature.
>>> +* CR0/CR3/CR4/EFER.  It's all a bit more complicated if the vCPU
>>> +* supports long mode.
>>>  */
>>> +   cr4 = ctxt->ops->get_cr(ctxt, 4);
>>> +   if (emulator_has_longmode(ctxt)) {
>>> +   struct desc_struct cs_desc;
>>> +
>>> +   /* Zero CR4.PCIDE before CR0.PG.  */
>>> +   if (cr4 & X86_CR4_PCIDE) {
>>> +   ctxt->ops->set_cr(ctxt, 4, cr4 & ~X86_CR4_PCIDE);
>>> +   cr4 &= ~X86_CR4_PCIDE;
>>> +   }
>>> +
>>> +   /* A 32-bit code segment is required to clear EFER.LMA.  */
>>> +   memset(_desc, 0, sizeof(cs_desc));
>>> +   cs_desc.type = 0xb;
>>> +   cs_desc.s = cs_desc.g = cs_desc.p = 1;
>>> +   ctxt->ops->set_segment(ctxt, 0, _desc, 0, VCPU_SREG_CS);
>>> +   }
>>> +
>>> +   /* For the 64-bit case, this will clear EFER.LMA.  */
>>> cr0 = ctxt->ops->get_cr(ctxt, 0);
>>> if (cr0 & X86_CR0_PE)
>>> ctxt->ops->set_cr(ctxt, 0, cr0 & ~(X86_CR0_PG | X86_CR0_PE));
>>> -   cr4 = ctxt->ops->get_cr(ctxt, 4);
>>> +
>>> +   /* Now clear CR4.PAE (which must be done before clearing EFER.LME).  */
>>> if (cr4 & X86_CR4_PAE)
>>> ctxt->ops->set_cr(ctxt, 4, cr4 & ~X86_CR4_PAE);
>>> +
>>> +   /* And finally go back to 32-bit mode.  */
>>> efer = 0;
>>> ctxt->ops->set_msr(ctxt, MSR_EFER, efer);
>>>  
>>> @@ -4454,7 +4474,7 @@ static const struct opcode twobyte_table[256] = {
>>> F(DstMem | SrcReg | Src2CL | ModRM, em_shld), N, N,
>>> /* 0xA8 - 0xAF */
>>> I(Stack | Src2GS, em_push_sreg), I(Stack | Src2GS, em_pop_sreg),
>>> -   II(No64 | EmulateOnUD | ImplicitOps, em_rsm, rsm),
>>> +   II(EmulateOnUD | ImplicitOps, em_rsm, rsm),
>>> F(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_bts),
>>> F(DstMem | SrcReg | Src2ImmByte | ModRM, em_shrd),
>>> F(DstMem | SrcReg | Src2CL | ModRM, em_shrd),
>>>
>>
>> What branch should I test this on top of?
> 
> Just use whatever you were using before, and revert commit c9db607
> ("UefiCpuPkg: PiSmmCpuDxeSmm: do not execute RSM from 64-bit mode",
> 2015-10-14) from your OVMF branch.

Right, I planned to do that OVMF-side revert; I just wasn't sure if e.g.
kvm/queue had some prerequisite patches for this.

> This is how I tested it, in fact.

I'll try to report back soon.

Thanks!
Laszlo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: allow RSM from 64-bit mode

2015-11-03 Thread Laszlo Ersek
On 11/03/15 14:29, Paolo Bonzini wrote:
> The SDM says that exiting system management mode from 64-bit mode
> is invalid, but that would be too good to be true.  But actually,
> most of the code is already there to support exiting from compat
> mode (EFER.LME=1, EFER.LMA=0).  Getting all the way from 64-bit
> mode to real mode only requires clearing CS.L and CR4.PCIDE.
> 
> Cc: sta...@vger.kernel.org
> Fixes: 660a5d517aaab9187f93854425c4c63f4a09195c
> Cc: Laszlo Ersek <ler...@redhat.com>
> Cc: Radim Krčmář <rkrc...@redhat.com>
> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
> ---
>  arch/x86/kvm/emulate.c | 30 +-
>  1 file changed, 25 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index b60fed56671b..1505587d06e9 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -2484,16 +2484,36 @@ static int em_rsm(struct x86_emulate_ctxt *ctxt)
>  
>   /*
>* Get back to real mode, to prepare a safe state in which to load
> -  * CR0/CR3/CR4/EFER.
> -  *
> -  * CR4.PCIDE must be zero, because it is a 64-bit mode only feature.
> +  * CR0/CR3/CR4/EFER.  It's all a bit more complicated if the vCPU
> +  * supports long mode.
>*/
> + cr4 = ctxt->ops->get_cr(ctxt, 4);
> + if (emulator_has_longmode(ctxt)) {
> + struct desc_struct cs_desc;
> +
> + /* Zero CR4.PCIDE before CR0.PG.  */
> + if (cr4 & X86_CR4_PCIDE) {
> + ctxt->ops->set_cr(ctxt, 4, cr4 & ~X86_CR4_PCIDE);
> + cr4 &= ~X86_CR4_PCIDE;
> + }
> +
> + /* A 32-bit code segment is required to clear EFER.LMA.  */
> + memset(_desc, 0, sizeof(cs_desc));
> + cs_desc.type = 0xb;
> + cs_desc.s = cs_desc.g = cs_desc.p = 1;
> + ctxt->ops->set_segment(ctxt, 0, _desc, 0, VCPU_SREG_CS);
> + }
> +
> + /* For the 64-bit case, this will clear EFER.LMA.  */
>   cr0 = ctxt->ops->get_cr(ctxt, 0);
>   if (cr0 & X86_CR0_PE)
>   ctxt->ops->set_cr(ctxt, 0, cr0 & ~(X86_CR0_PG | X86_CR0_PE));
> - cr4 = ctxt->ops->get_cr(ctxt, 4);
> +
> + /* Now clear CR4.PAE (which must be done before clearing EFER.LME).  */
>   if (cr4 & X86_CR4_PAE)
>   ctxt->ops->set_cr(ctxt, 4, cr4 & ~X86_CR4_PAE);
> +
> + /* And finally go back to 32-bit mode.  */
>   efer = 0;
>   ctxt->ops->set_msr(ctxt, MSR_EFER, efer);
>  
> @@ -4454,7 +4474,7 @@ static const struct opcode twobyte_table[256] = {
>   F(DstMem | SrcReg | Src2CL | ModRM, em_shld), N, N,
>   /* 0xA8 - 0xAF */
>   I(Stack | Src2GS, em_push_sreg), I(Stack | Src2GS, em_pop_sreg),
> - II(No64 | EmulateOnUD | ImplicitOps, em_rsm, rsm),
> + II(EmulateOnUD | ImplicitOps, em_rsm, rsm),
>   F(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_bts),
>   F(DstMem | SrcReg | Src2ImmByte | ModRM, em_shrd),
>   F(DstMem | SrcReg | Src2CL | ModRM, em_shrd),
> 

What branch should I test this on top of?

Thank you!
Laszlo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM: x86: simplify RSM into 64-bit protected mode

2015-10-31 Thread Laszlo Ersek
On 10/30/15 16:36, Radim Krčmář wrote:
> This series bases on "KVM: x86: fix RSM into 64-bit protected mode,
> round 2" and reverts it in [3/3].  To avoid regressions after doing so,
> [1/2] introduces a helper that is used in [2/2] to hopefully get the
> same behavior.
> 
> I'll set up test environment next week, unless a random act of kindness
> allows me not to.
> 
> 
> Radim Krčmář (3):
>   KVM: x86: add read_phys to x86_emulate_ops
>   KVM: x86: handle SMBASE as physical address in RSM
>   KVM: x86: simplify RSM into 64-bit protected mode
> 
>  arch/x86/include/asm/kvm_emulate.h | 10 +
>  arch/x86/kvm/emulate.c | 44 
> +-
>  arch/x86/kvm/x86.c | 10 +
>  3 files changed, 30 insertions(+), 34 deletions(-)
> 

Tested-by: Laszlo Ersek <ler...@redhat.com>

Thanks, Radim.
Laszlo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: fix RSM into 64-bit protected mode, round 2

2015-10-31 Thread Laszlo Ersek
On 10/30/15 16:40, Radim Krčmář wrote:
> 2015-10-26 17:32+0100, Paolo Bonzini:
>> On 26/10/2015 16:43, Laszlo Ersek wrote:
>>>> The code would be cleaner if we had a different approach, but this works
>>>> too and is safer for stable. In case you prefer to leave the rewrite for
>>>> a future victim,
>>>
>>> It's hard to express how much I prefer that.
>>
>> Radim, if you want to have a try go ahead since I cannot apply the patch
>> until next Monday.
> 
> The future I originally had in mind was more hoverboardy, but a series
> just landed, "KVM: x86: simplify RSM into 64-bit protected mode".
> 
> Laszlo, I'd be grateful if you could check that it works.
> 

I'm tagging it for next week, thanks.
Laszlo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: fix RSM into 64-bit protected mode, round 2

2015-10-26 Thread Laszlo Ersek
On 10/26/15 16:37, Radim Krčmář wrote:
> 2015-10-23 23:43+0200, Laszlo Ersek:
>> Commit b10d92a54dac ("KVM: x86: fix RSM into 64-bit protected mode")
>> reordered the rsm_load_seg_64() and rsm_enter_protected_mode() calls,
>> relative to each other. The argument that said commit made was correct,
>> however putting rsm_enter_protected_mode() first whole-sale violated the
>> following (correct) invariant from em_rsm():
>>
>>  * Get back to real mode, to prepare a safe state in which to load
>>  * CR0/CR3/CR4/EFER.  Also this will ensure that addresses passed
>>  * to read_std/write_std are not virtual.
> 
> Nice catch.
> 
>> Namely, rsm_enter_protected_mode() may re-enable paging, *after* which
>>
>>   rsm_load_seg_64()
>> GET_SMSTATE()
>>   read_std()
>>
>> will try to interpret the (smbase + offset) address as a virtual one. This
>> will result in unexpected page faults being injected to the guest in
>> response to the RSM instruction.
> 
> I think this is a good time to introduce the read_phys helper, which we
> wanted to avoid with that assumption.
> 
>> Split rsm_load_seg_64() in two parts:
>>
>> - The first part, rsm_stash_seg_64(), shall call GET_SMSTATE() while in
>>   real mode, and save the relevant state off SMRAM into an array local to
>>   rsm_load_state_64().
>>
>> - The second part, rsm_load_seg_64(), shall occur after entering protected
>>   mode, but the segment details shall come from the local array, not the
>>   guest's SMRAM.
>>
>> Fixes: b10d92a54dac25a6152f1aa1ffc95c12908035ce
>> Cc: Paolo Bonzini <pbonz...@redhat.com>
>> Cc: Radim Krčmář <rkrc...@redhat.com>
>> Cc: Jordan Justen <jordan.l.jus...@intel.com>
>> Cc: Michael Kinney <michael.d.kin...@intel.com>
>> Cc: sta...@vger.kernel.org
>> Signed-off-by: Laszlo Ersek <ler...@redhat.com>
>> ---
> 
> The code would be cleaner if we had a different approach, but this works
> too and is safer for stable. In case you prefer to leave the rewrite for
> a future victim,

It's hard to express how much I prefer that.

> 
> Reviewed-by: Radim Krčmář <rkrc...@redhat.com>
> 

Thank you!
Laszlo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: x86: fix RSM into 64-bit protected mode, round 2

2015-10-23 Thread Laszlo Ersek
Commit b10d92a54dac ("KVM: x86: fix RSM into 64-bit protected mode")
reordered the rsm_load_seg_64() and rsm_enter_protected_mode() calls,
relative to each other. The argument that said commit made was correct,
however putting rsm_enter_protected_mode() first whole-sale violated the
following (correct) invariant from em_rsm():

 * Get back to real mode, to prepare a safe state in which to load
 * CR0/CR3/CR4/EFER.  Also this will ensure that addresses passed
 * to read_std/write_std are not virtual.

Namely, rsm_enter_protected_mode() may re-enable paging, *after* which

  rsm_load_seg_64()
GET_SMSTATE()
  read_std()

will try to interpret the (smbase + offset) address as a virtual one. This
will result in unexpected page faults being injected to the guest in
response to the RSM instruction.

Split rsm_load_seg_64() in two parts:

- The first part, rsm_stash_seg_64(), shall call GET_SMSTATE() while in
  real mode, and save the relevant state off SMRAM into an array local to
  rsm_load_state_64().

- The second part, rsm_load_seg_64(), shall occur after entering protected
  mode, but the segment details shall come from the local array, not the
  guest's SMRAM.

Fixes: b10d92a54dac25a6152f1aa1ffc95c12908035ce
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Radim Krčmář <rkrc...@redhat.com>
Cc: Jordan Justen <jordan.l.jus...@intel.com>
Cc: Michael Kinney <michael.d.kin...@intel.com>
Cc: sta...@vger.kernel.org
Signed-off-by: Laszlo Ersek <ler...@redhat.com>
---
 arch/x86/kvm/emulate.c | 37 ++---
 1 file changed, 30 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 9da95b9..25e16b6 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2311,7 +2311,16 @@ static int rsm_load_seg_32(struct x86_emulate_ctxt 
*ctxt, u64 smbase, int n)
return X86EMUL_CONTINUE;
 }
 
-static int rsm_load_seg_64(struct x86_emulate_ctxt *ctxt, u64 smbase, int n)
+struct rsm_stashed_seg_64 {
+   u16 selector;
+   struct desc_struct desc;
+   u32 base3;
+};
+
+static int rsm_stash_seg_64(struct x86_emulate_ctxt *ctxt,
+   struct rsm_stashed_seg_64 *stash,
+   u64 smbase,
+   int n)
 {
struct desc_struct desc;
int offset;
@@ -2326,10 +2335,20 @@ static int rsm_load_seg_64(struct x86_emulate_ctxt 
*ctxt, u64 smbase, int n)
set_desc_base(,  GET_SMSTATE(u32, smbase, offset + 8));
base3 =   GET_SMSTATE(u32, smbase, offset + 12);
 
-   ctxt->ops->set_segment(ctxt, selector, , base3, n);
+   stash[n].selector = selector;
+   stash[n].desc = desc;
+   stash[n].base3 = base3;
return X86EMUL_CONTINUE;
 }
 
+static inline void rsm_load_seg_64(struct x86_emulate_ctxt *ctxt,
+  struct rsm_stashed_seg_64 *stash,
+  int n)
+{
+   ctxt->ops->set_segment(ctxt, stash[n].selector, [n].desc,
+  stash[n].base3, n);
+}
+
 static int rsm_enter_protected_mode(struct x86_emulate_ctxt *ctxt,
 u64 cr0, u64 cr4)
 {
@@ -2419,6 +2438,7 @@ static int rsm_load_state_64(struct x86_emulate_ctxt 
*ctxt, u64 smbase)
u32 base3;
u16 selector;
int i, r;
+   struct rsm_stashed_seg_64 stash[6];
 
for (i = 0; i < 16; i++)
*reg_write(ctxt, i) = GET_SMSTATE(u64, smbase, 0x7ff8 - i * 8);
@@ -2460,15 +2480,18 @@ static int rsm_load_state_64(struct x86_emulate_ctxt 
*ctxt, u64 smbase)
dt.address =GET_SMSTATE(u64, smbase, 0x7e68);
ctxt->ops->set_gdt(ctxt, );
 
+   for (i = 0; i < ARRAY_SIZE(stash); i++) {
+   r = rsm_stash_seg_64(ctxt, stash, smbase, i);
+   if (r != X86EMUL_CONTINUE)
+   return r;
+   }
+
r = rsm_enter_protected_mode(ctxt, cr0, cr4);
if (r != X86EMUL_CONTINUE)
return r;
 
-   for (i = 0; i < 6; i++) {
-   r = rsm_load_seg_64(ctxt, smbase, i);
-   if (r != X86EMUL_CONTINUE)
-   return r;
-   }
+   for (i = 0; i < ARRAY_SIZE(stash); i++)
+   rsm_load_seg_64(ctxt, stash, i);
 
return X86EMUL_CONTINUE;
 }
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [edk2] KVM: MTRR: fix memory type handling if MTRR is completely disabled

2015-10-20 Thread Laszlo Ersek
Hi,

On 10/20/15 19:27, Janusz wrote:
> W dniu 15.10.2015 o 20:46, Laszlo Ersek pisze:
>> On 10/15/15 18:53, Kinney, Michael D wrote:
>>> Laszlo,
>>>
>>> There is already a PCD for this timeout that is used by CpuMpPei.
>>>
>>> gUefiCpuPkgTokenSpaceGuid.PcdCpuApInitTimeOutInMicroSeconds
>>>
>>> I noticed that CpuDxe is using a hard coded AP timeout.  I think we should 
>>> just use this same PCD for both the PEI and DXE CPU module and then set it 
>>> for OVMF to the compatible value.
>> Perfect, thank you!
>>
>> (I notice the default in the DEC file is 5, which is half of what
>> the DXE driver hardcodes.)
>>
>> Now we only need a recommended (or experimental) value for it, and an
>> explanation why 100*1000 is no longer sufficient on KVM :)
>>
>> Thanks!
>> Laszlo
>>
>>
>>
> Laszlo,
> 
> I saw that there is already some change in ovmf for MicroSecondDelay
> https://github.com/tianocore/edk2/commit/1e410eadd80c328e66868263b3006a274ce81ae0
> Is that a fix for it? Because I tried it and it still doesn't work for
> me: https://bpaste.net/show/2514b51bf41f
> I still get internal error

I think you guys are now "mature enough OVMF users" to start employing
the correct terminology.

"edk2" (also spelled as "EDK II") is: "a modern, feature-rich,
cross-platform firmware development environment for the UEFI and PI
specifications".

The source tree contains a whole bunch of modules (drivers,
applications, libraries), organized into packages.

"OVMF" usually denotes a firmware binary built from one of the
OvmfPkg/OvmfPkg*.dsc "platform description files". Think of them as "top
level makefiles". The difference between them is the target architecture
(there's Ia32, X64, and Ia32X64 -- the last one means that the SEC and
PEI phases are 32-bit, whereas the DXE and later phases are 64-bit.) In
practice you'll only care about full X64.

Now, each of OvmfPkg/OvmfPkg*.dsc builds the following three kinds of
modules into the final binary:
- platform-independent modules from various top-level packages
- platform- (ie. Ia32/X64-) dependent modules from various top-level
  packages
- modules from under OvmfPkg that are specific to QEMU/KVM (and Xen, if
  you happen to use OVMF with Xen)

Now, when you reference a commit like 1e410ead above, you can look at
the diffstat, and decide if it is OvmfPkg-specific (third category
above) or not. Here you see UefiCpuPkg, which happens to be the second
category.

The important point is: please do *not* call any and all edk2 patches
"OVMF changes", indiscriminately. That's super confusing for people who
understand the above distinctions. Which now you do too. :)

Let me add that in edk2, patches that straddle top level packages are
generally forbidden -- you can't have a patch that modifies OvmfPkg and
UefiCpuPkg at the same time, modulo *very* rare exceptions. If a feature
or bugfix needs to touch several top-level packages, the series must be
built up carefully in stages.

Knowing all of the above, you can tell that the patch you referenced had
only *enabled* OvmfPkg to customize UefiCpuPkg, via
"PcdCpuApInitTimeOutInMicroSeconds". But for that customization to occur
actually, a small patch for OvmfPkg will be necessary too, in order to
set "PcdCpuApInitTimeOutInMicroSeconds" differently from the default.

I plan to send that patch soon. If you'd like to be CC'd, that's great
(reporting back with a Tested-by is even better!), but I'll need your
real name for that. (Or any name that looks like a real name.)

Thanks!
Laszlo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [edk2] KVM: MTRR: fix memory type handling if MTRR is completely disabled

2015-10-16 Thread Laszlo Ersek
On 10/16/15 05:05, Xiao Guangrong wrote:
> 
> 
> On 10/16/2015 12:18 AM, Laszlo Ersek wrote:
>> CC'ing Jordan and Chen Fan.
>>
>> On 10/15/15 09:10, Xiao Guangrong wrote:
>>>
>>>
>>> On 10/15/2015 02:58 PM, Janusz wrote:
>>>> W dniu 15.10.2015 o 08:41, Xiao Guangrong pisze:
>>>>>
>>>>>
>>>>> On 10/15/2015 02:19 PM, Janusz wrote:
>>>>>> W dniu 15.10.2015 o 06:19, Xiao Guangrong pisze:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Well, the bug may be not in KVM. When this bug happened, i saw OVMF
>>>>>>> only checked 1 CPU out, there is the log from OVMF's debug input:
>>>>>>>
>>>>>>>  Flushing GCD
>>>>>>>  Flushing GCD
>>>>>>>  Flushing GCD
>>>>>>>  Flushing GCD
>>>>>>>  Flushing GCD
>>>>>>>  Flushing GCD
>>>>>>>  Flushing GCD
>>>>>>>  Flushing GCD
>>>>>>>  Flushing GCD
>>>>>>>  Flushing GCDs
>>>>>>> Detect CPU count: 1
>>>>>>>
>>>>>>> So that the startup code has been freed however the APs are still
>>>>>>> running,
>>>>>>> i think that why we saw the vCPUs executed on unexpected address.
>>>>>>>
>>>>>>> After digging into OVMF's code, i noticed that BSP CPU waits for APs
>>>>>>> for a fixed timer period, however, KVM recent changes require zap
>>>>>>> all
>>>>>>> mappings if CR0.CD is changed, that means the APs need more time to
>>>>>>> startup.
>>>>>>>
>>>>>>> After following changes to OVMF, the bug is completely gone on my
>>>>>>> side:
>>>>>>>
>>>>>>> --- a/UefiCpuPkg/CpuDxe/ApStartup.c
>>>>>>> +++ b/UefiCpuPkg/CpuDxe/ApStartup.c
>>>>>>> @@ -454,7 +454,9 @@ StartApsStackless (
>>>>>>>   //
>>>>>>>   // Wait 100 milliseconds for APs to arrive at the ApEntryPoint
>>>>>>> routine
>>>>>>>   //
>>>>>>> -  MicroSecondDelay (100 * 1000);
>>>>>>> +  MicroSecondDelay (10 * 100 * 1000);
>>>>>>>
>>>>>>>   return EFI_SUCCESS;
>>>>>>> }
>>>>>>>
>>>>>>> Janusz, could you please check this instead? You can switch to your
>>>>>>> previous kernel to do this test.
>>>>>>>
>>>>>>>
>>>>>> Ok, now first time when I started VM I was able to start system
>>>>>> successfully. When I turned it off and started it again, it
>>>>>> restarted my
>>>>>> vm at system boot couple of times. Sometimes I also get very high cpu
>>>>>> usage for no reason. Also, I get less fps in GTA 5 than in kernel
>>>>>> 4.1, I
>>>>>> get something like 30-55, but on 4.1 I get all the time 60 fps.
>>>>>> This is
>>>>>> my new log: https://bpaste.net/show/61a122ad7fe5
>>>>>>
>>>>>
>>>>> Just confirm: the Qemu internal error did not appear any more, right?
>>>> Yes, when I reverted your first patch, switched to -vga std from -vga
>>>> none and didn't passthrough my GPU (case when I got this internal
>>>> error), vm started without problem. I even didn't get any VM restarts
>>>> like with passthrough
>>>>
>>>
>>> Wow, it seems we have fixed the QEMU internal error now. :)
>>>
>>> Recurrently, Paolo has reverted some MTRR patches, was your test
>>> based on these reverted patches?
>>>
>>> The GPU passthrough issue may be related to vfio (not sure), Alex, do
>>> you have any idea?
>>>
>>> Laszlo, could you please check the root case is reasonable and fix it in
>>> OVMF if it's right?
>>
>> The code that you have found is in edk2's EFI_MP_SERVICES_PROTOCOL
>> implementation -- more closely, its initial CPU counter code --, from
>> edk2 git commit 533263ee5a7f. It is not specific to OVMF -- it is
>> generic edk2 code for Intel processors. (I'm CC'ing Jordan and Chen Fan
>> because they authored 

Re: [edk2] KVM: MTRR: fix memory type handling if MTRR is completely disabled

2015-10-15 Thread Laszlo Ersek
CC'ing Jordan and Chen Fan.

On 10/15/15 09:10, Xiao Guangrong wrote:
> 
> 
> On 10/15/2015 02:58 PM, Janusz wrote:
>> W dniu 15.10.2015 o 08:41, Xiao Guangrong pisze:
>>>
>>>
>>> On 10/15/2015 02:19 PM, Janusz wrote:
 W dniu 15.10.2015 o 06:19, Xiao Guangrong pisze:
>
>
>
> Well, the bug may be not in KVM. When this bug happened, i saw OVMF
> only checked 1 CPU out, there is the log from OVMF's debug input:
>
> Flushing GCD
> Flushing GCD
> Flushing GCD
> Flushing GCD
> Flushing GCD
> Flushing GCD
> Flushing GCD
> Flushing GCD
> Flushing GCD
> Flushing GCDs
> Detect CPU count: 1
>
> So that the startup code has been freed however the APs are still
> running,
> i think that why we saw the vCPUs executed on unexpected address.
>
> After digging into OVMF's code, i noticed that BSP CPU waits for APs
> for a fixed timer period, however, KVM recent changes require zap all
> mappings if CR0.CD is changed, that means the APs need more time to
> startup.
>
> After following changes to OVMF, the bug is completely gone on my
> side:
>
> --- a/UefiCpuPkg/CpuDxe/ApStartup.c
> +++ b/UefiCpuPkg/CpuDxe/ApStartup.c
> @@ -454,7 +454,9 @@ StartApsStackless (
>  //
>  // Wait 100 milliseconds for APs to arrive at the ApEntryPoint
> routine
>  //
> -  MicroSecondDelay (100 * 1000);
> +  MicroSecondDelay (10 * 100 * 1000);
>
>  return EFI_SUCCESS;
>}
>
> Janusz, could you please check this instead? You can switch to your
> previous kernel to do this test.
>
>
 Ok, now first time when I started VM I was able to start system
 successfully. When I turned it off and started it again, it
 restarted my
 vm at system boot couple of times. Sometimes I also get very high cpu
 usage for no reason. Also, I get less fps in GTA 5 than in kernel
 4.1, I
 get something like 30-55, but on 4.1 I get all the time 60 fps. This is
 my new log: https://bpaste.net/show/61a122ad7fe5

>>>
>>> Just confirm: the Qemu internal error did not appear any more, right?
>> Yes, when I reverted your first patch, switched to -vga std from -vga
>> none and didn't passthrough my GPU (case when I got this internal
>> error), vm started without problem. I even didn't get any VM restarts
>> like with passthrough
>>
> 
> Wow, it seems we have fixed the QEMU internal error now. :)
> 
> Recurrently, Paolo has reverted some MTRR patches, was your test
> based on these reverted patches?
> 
> The GPU passthrough issue may be related to vfio (not sure), Alex, do
> you have any idea?
> 
> Laszlo, could you please check the root case is reasonable and fix it in
> OVMF if it's right?

The code that you have found is in edk2's EFI_MP_SERVICES_PROTOCOL
implementation -- more closely, its initial CPU counter code --, from
edk2 git commit 533263ee5a7f. It is not specific to OVMF -- it is
generic edk2 code for Intel processors. (I'm CC'ing Jordan and Chen Fan
because they authored the patch in question.)

If VCPUs need more time to rendezvous than written in the code, on
recent KVM, then I think we should introduce a new FixedPCD in
UefiCpuPkg (practically: a compile time constant) for the timeout. Which
is not hard to do.

However, we'll need two things:
- an idea about the concrete rendezvous timeout to set, from OvmfPkg

- a *detailed* explanation / elaboration on your words:

  "KVM recent changes require zap all mappings if CR0.CD is changed,
  that means the APs need more time to startup"

  Preferably with references to Linux kernel commits and the Intel SDM,
  so that n00bs like me can get a fleeting idea. Do you mean that with
  caching disabled, the APs execute their rendezvous code (from memory)
  more slowly?

> BTW, OVMF handles #UD with no trace - nothing is killed, and no call trace
> in the debug input...

There *is* a trace (of any unexpected exception -- at least for the
BSP), but unfortunately its location is not intuitive.

The exception handler that is built into OVMF
("UefiCpuPkg/Library/CpuExceptionHandlerLib") is again generic edk2
code, and it prints the trace directly to the serial port, regardless of
the fact that OVMF's DebugLib instance logs explicit DEBUGs to the QEMU
debug port. (The latter can be directed to the serial port as well, if
you build OVMF with -D DEBUG_ON_SERIAL_PORT, but this is not relevant here.)

If you reproduce the issue while looking at the (virtual) serial port of
the guest, I trust you will get a register dump.

Thanks!
Laszlo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [edk2] KVM: MTRR: fix memory type handling if MTRR is completely disabled

2015-10-15 Thread Laszlo Ersek
On 10/15/15 18:53, Kinney, Michael D wrote:
> Laszlo,
> 
> There is already a PCD for this timeout that is used by CpuMpPei.
> 
>   gUefiCpuPkgTokenSpaceGuid.PcdCpuApInitTimeOutInMicroSeconds
> 
> I noticed that CpuDxe is using a hard coded AP timeout.  I think we should 
> just use this same PCD for both the PEI and DXE CPU module and then set it 
> for OVMF to the compatible value.

Perfect, thank you!

(I notice the default in the DEC file is 5, which is half of what
the DXE driver hardcodes.)

Now we only need a recommended (or experimental) value for it, and an
explanation why 100*1000 is no longer sufficient on KVM :)

Thanks!
Laszlo


> 
> Mike
> 
>> -Original Message-
>> From: edk2-devel [mailto:edk2-devel-boun...@lists.01.org] On Behalf Of
>> Laszlo Ersek
>> Sent: Thursday, October 15, 2015 9:19 AM
>> To: Xiao Guangrong
>> Cc: kvm@vger.kernel.org; Justen, Jordan L; edk2-de...@ml01.01.org; Alex
>> Williamson; Chen Fan; Paolo Bonzini; Wanpeng Li
>> Subject: Re: [edk2] KVM: MTRR: fix memory type handling if MTRR is
>> completely disabled
>>
>> CC'ing Jordan and Chen Fan.
>>
>> On 10/15/15 09:10, Xiao Guangrong wrote:
>>>
>>>
>>> On 10/15/2015 02:58 PM, Janusz wrote:
>>>> W dniu 15.10.2015 o 08:41, Xiao Guangrong pisze:
>>>>>
>>>>>
>>>>> On 10/15/2015 02:19 PM, Janusz wrote:
>>>>>> W dniu 15.10.2015 o 06:19, Xiao Guangrong pisze:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Well, the bug may be not in KVM. When this bug happened, i saw
>> OVMF
>>>>>>> only checked 1 CPU out, there is the log from OVMF's debug input:
>>>>>>>
>>>>>>> Flushing GCD
>>>>>>> Flushing GCD
>>>>>>> Flushing GCD
>>>>>>> Flushing GCD
>>>>>>> Flushing GCD
>>>>>>> Flushing GCD
>>>>>>> Flushing GCD
>>>>>>> Flushing GCD
>>>>>>> Flushing GCD
>>>>>>> Flushing GCDs
>>>>>>> Detect CPU count: 1
>>>>>>>
>>>>>>> So that the startup code has been freed however the APs are still
>>>>>>> running,
>>>>>>> i think that why we saw the vCPUs executed on unexpected address.
>>>>>>>
>>>>>>> After digging into OVMF's code, i noticed that BSP CPU waits for APs
>>>>>>> for a fixed timer period, however, KVM recent changes require zap all
>>>>>>> mappings if CR0.CD is changed, that means the APs need more time to
>>>>>>> startup.
>>>>>>>
>>>>>>> After following changes to OVMF, the bug is completely gone on my
>>>>>>> side:
>>>>>>>
>>>>>>> --- a/UefiCpuPkg/CpuDxe/ApStartup.c
>>>>>>> +++ b/UefiCpuPkg/CpuDxe/ApStartup.c
>>>>>>> @@ -454,7 +454,9 @@ StartApsStackless (
>>>>>>>  //
>>>>>>>  // Wait 100 milliseconds for APs to arrive at the ApEntryPoint
>>>>>>> routine
>>>>>>>  //
>>>>>>> -  MicroSecondDelay (100 * 1000);
>>>>>>> +  MicroSecondDelay (10 * 100 * 1000);
>>>>>>>
>>>>>>>  return EFI_SUCCESS;
>>>>>>>}
>>>>>>>
>>>>>>> Janusz, could you please check this instead? You can switch to your
>>>>>>> previous kernel to do this test.
>>>>>>>
>>>>>>>
>>>>>> Ok, now first time when I started VM I was able to start system
>>>>>> successfully. When I turned it off and started it again, it
>>>>>> restarted my
>>>>>> vm at system boot couple of times. Sometimes I also get very high cpu
>>>>>> usage for no reason. Also, I get less fps in GTA 5 than in kernel
>>>>>> 4.1, I
>>>>>> get something like 30-55, but on 4.1 I get all the time 60 fps. This is
>>>>>> my new log: https://bpaste.net/show/61a122ad7fe5
>>>>>>
>>>>>
>>>>> Just confirm: the Qemu internal error did not appear any more, right?
>>>> Yes, when I reverted your first patch, switched to -vga std from -vga
>>>> none and didn't passt

Re: [edk2] KVM: MTRR: fix memory type handling if MTRR is completely disabled

2015-10-14 Thread Laszlo Ersek
On 10/14/15 10:32, Xiao Guangrong wrote:
> 
> 
> On 10/14/2015 04:24 PM, Xiao Guangrong wrote:
>>
>>
>> On 10/14/2015 03:37 PM, Janusz wrote:
>>> I was able to run my virtual machine with this, but had very high cpu
>>> usage when something happen in it like booting system. once, my virtual
>>> machine hang and I couln't even get my mouse / keyboard back from qemu.
>>> When I did vga passthrough, I didn't get any video output, and cpu usage
>>> was also high. Tried it on 4.3
>>
>> Which tree are you using? Is it kvm tree?
>> Could you please work on queue brancn on current kvm tree based on
>> top commit 73917739334c6509: KVM: x86: fix SMI to halted VCPU.
>>
>> Hmm... interesting, this diff works on my box...
> 
> Forgot to say that i built my test env following the instructions on
> kvm-wiki:
> http://www.linux-kvm.org/page/OVMF

Wow! Someone actually cares about the whitepaper. Thank you. :)

Laszlo

> 
> My test script is attached, and i will try to build the env like yours
> as much
> as possible...

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [edk2] KVM: MTRR: fix memory type handling if MTRR is completely disabled

2015-09-18 Thread Laszlo Ersek
On 09/18/15 11:37, Janusz wrote:
> Hello,
> 
> I am writting about this patch that was posted by Xiao:
> http://www.spinics.net/lists/kvm/msg119044.html and this:
> http://www.spinics.net/lists/kvm/msg119045.html
> I've tested both kernel 4.2 and 4.3 and problem still exists when I use
> OVMF - 100% cpu usage, VM resetting, while it works properly on kernel 4.1

My last (still current) request remains "please quirk it". See the end
of ,
and other messages in that subthread.

I haven't been following kernel development, so maybe the quirk has not
happened. No clue.

... "VM resetting" looks something different though; I've been under the
impression that the pedantic (=unquirked) MTRR configuration didn't
impact other things than speed. Janusz, maybe you could contribute with
a host kernel bisection for the VM reset symptom.

Thanks
Laszlo

> ___
> edk2-devel mailing list
> edk2-de...@lists.01.org
> https://lists.01.org/mailman/listinfo/edk2-devel
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [edk2] apparent SMBASE relocation issue with noexec enabled [was: MdeModulePkg DxeIpl: Add stack NX support]

2015-08-07 Thread Laszlo Ersek
On 08/07/15 12:38, Paolo Bonzini wrote:
 
 
 On 07/08/2015 01:02, Laszlo Ersek wrote:
 The trace covers the full lifetime of the guest (I started tracing
 before launching the guest, and I passed -no-reboot to qemu, so when the
 guest crashed, QEMU exited.)

 This was on 3.10.0-299.el7.x86_64.
 I repeated the test with EPT off. The guest doesn't crash this way, it spins 
 in a busy loop.

  qemu-system-i38-32767 [002] 55142.911133: kvm_emulate_insn: 0:7ffd790b: 
 0f aa
  qemu-system-i38-32767 [002] 55142.911139: kvm_cpuid:func 
 8001 rax 6e8 rbx 0 rcx 0 rdx 10
  qemu-system-i38-32767 [002] 55142.911148: kvm_enter_smm:vcpu 0: 
 leaving SMM, smbase 0x7ffc
  qemu-system-i38-32767 [002] 55142.911150: kvm_mmu_get_page: existing sp 
 gfn 7fe65 1/2 q3 --- !pge !nxe root 0 sync
  qemu-system-i38-32767 [002] 55142.911151: kvm_mmu_get_page: existing sp 
 gfn 7fe66 1/2 q3 --- !pge !nxe root 0 sync
  qemu-system-i38-32767 [002] 55142.911151: kvm_mmu_get_page: existing sp 
 gfn 7fe67 1/2 q3 --- !pge !nxe root 0 sync
  qemu-system-i38-32767 [002] 55142.911151: kvm_mmu_get_page: existing sp 
 gfn 7fe68 1/2 q3 --- !pge !nxe root 0 sync
  qemu-system-i38-32767 [002] 55142.911152: kvm_entry:vcpu 0
  qemu-system-i38-32767 [002] 55142.911152: kvm_exit: reason 
 EXCEPTION_NMI rip 0x7ffdb6b2 info 7fe88760 8b0e
  qemu-system-i38-32767 [002] 55142.911153: kvm_page_fault:   address 
 7fe88760 error_code b

 And then the last triplet is repeated infinitely.

 0x7ffdb6b2 is the address of the same first instruction executed after the 
 RSM.

 The address 0x7fe88760 seems to fall into an EfiBootServicesData allocation, 
 made in PEI (via a suitable HOB):

 Memory Allocation 0x0004 0x7FE69000 - 0x7FE88FFF
 
 You probably should use -cpu model,-lm,-nx.  EFER is not part of the
 32-bit state save map, so the EFER.NXE bit is not restored correctly on
 exit from SMM if you emulate a 32-bit CPU.

That did the trick, thank you! I added

feature policy='disable' name='nx'/

under /domain/cpu in the libvirt XML, and the SMM stuff works fine again.

I'm going to update

  [edk2] [PATCH 58/58] OvmfPkg: README: document SMM status
  http://thread.gmane.org/gmane.comp.bios.edk2.devel/329/focus=382

with the -nx feature flag.

Thanks!
Laszlo

 I have not debugged yet why it works without KVM, nor why the symptoms
 are different between EPT and non-EPT.
 
 Paolo
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [edk2] apparent SMBASE relocation issue with noexec enabled [was: MdeModulePkg DxeIpl: Add stack NX support]

2015-08-06 Thread Laszlo Ersek
On 08/07/15 00:38, Laszlo Ersek wrote:
 On 08/06/15 16:55, Paolo Bonzini wrote:


 On 06/08/2015 16:31, Laszlo Ersek wrote:
 kvm_cpuid:func 8001 rax 6e8 rbx 0 rcx 0 rdx 10
 kvm_enter_smm:vcpu 0: leaving SMM, smbase 0x7ffc
 kvm_entry:vcpu 0
 kvm_exit: reason TRIPLE_FAULT rip 0x7ffdb6b2 info 0 0
 kvm_userspace_exit:   reason KVM_EXIT_SHUTDOWN (8)

 Can you provide a trace with both kvm and kvmmmu events enabled?
 
 The trace-cmd report command printed the following to stderr:
 
 trace-cmd: No such file or directory
   function is_writable_pte not defined
 
 Not sure how serious that is; FWIW it did produce a human readable
 report. Please find it at
 http://people.redhat.com/~lersek/smbase_reloc_triple_fault/trace.txt.xz.
 
 The trace covers the full lifetime of the guest (I started tracing
 before launching the guest, and I passed -no-reboot to qemu, so when the
 guest crashed, QEMU exited.)
 
 This was on 3.10.0-299.el7.x86_64.

I repeated the test with EPT off. The guest doesn't crash this way, it spins in 
a busy loop.

 qemu-system-i38-32767 [002] 55142.911133: kvm_emulate_insn: 0:7ffd790b: 0f 
aa
 qemu-system-i38-32767 [002] 55142.911139: kvm_cpuid:func 8001 
rax 6e8 rbx 0 rcx 0 rdx 10
 qemu-system-i38-32767 [002] 55142.911148: kvm_enter_smm:vcpu 0: 
leaving SMM, smbase 0x7ffc
 qemu-system-i38-32767 [002] 55142.911150: kvm_mmu_get_page: existing sp 
gfn 7fe65 1/2 q3 --- !pge !nxe root 0 sync
 qemu-system-i38-32767 [002] 55142.911151: kvm_mmu_get_page: existing sp 
gfn 7fe66 1/2 q3 --- !pge !nxe root 0 sync
 qemu-system-i38-32767 [002] 55142.911151: kvm_mmu_get_page: existing sp 
gfn 7fe67 1/2 q3 --- !pge !nxe root 0 sync
 qemu-system-i38-32767 [002] 55142.911151: kvm_mmu_get_page: existing sp 
gfn 7fe68 1/2 q3 --- !pge !nxe root 0 sync
 qemu-system-i38-32767 [002] 55142.911152: kvm_entry:vcpu 0
 qemu-system-i38-32767 [002] 55142.911152: kvm_exit: reason 
EXCEPTION_NMI rip 0x7ffdb6b2 info 7fe88760 8b0e
 qemu-system-i38-32767 [002] 55142.911153: kvm_page_fault:   address 
7fe88760 error_code b

And then the last triplet is repeated infinitely.

0x7ffdb6b2 is the address of the same first instruction executed after the RSM.

The address 0x7fe88760 seems to fall into an EfiBootServicesData allocation, 
made in PEI (via a suitable HOB):

Memory Allocation 0x0004 0x7FE69000 - 0x7FE88FFF

Thanks
Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [edk2] apparent SMBASE relocation issue with noexec enabled [was: MdeModulePkg DxeIpl: Add stack NX support]

2015-08-06 Thread Laszlo Ersek
On 08/06/15 16:55, Paolo Bonzini wrote:
 
 
 On 06/08/2015 16:31, Laszlo Ersek wrote:
 kvm_cpuid:func 8001 rax 6e8 rbx 0 rcx 0 rdx 10
 kvm_enter_smm:vcpu 0: leaving SMM, smbase 0x7ffc
 kvm_entry:vcpu 0
 kvm_exit: reason TRIPLE_FAULT rip 0x7ffdb6b2 info 0 0
 kvm_userspace_exit:   reason KVM_EXIT_SHUTDOWN (8)
 
 Can you provide a trace with both kvm and kvmmmu events enabled?

The trace-cmd report command printed the following to stderr:

trace-cmd: No such file or directory
  function is_writable_pte not defined

Not sure how serious that is; FWIW it did produce a human readable
report. Please find it at
http://people.redhat.com/~lersek/smbase_reloc_triple_fault/trace.txt.xz.

The trace covers the full lifetime of the guest (I started tracing
before launching the guest, and I passed -no-reboot to qemu, so when the
guest crashed, QEMU exited.)

This was on 3.10.0-299.el7.x86_64.

Thank you!
Laszlo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


apparent SMBASE relocation issue with noexec enabled [was: MdeModulePkg DxeIpl: Add stack NX support]

2015-08-06 Thread Laszlo Ersek
Separate followup message with the symptoms I managed to gather about
the failure on KVM.

On 08/06/15 15:42, Laszlo Ersek wrote:
 Hi Star,

 On 08/06/15 11:44, Zeng, Star wrote:
 Hi Laszlo,

 Could you help take a try with the attached patch on your VM before I
 send it for formal review?

 I got your patch via your first (public) message as well -- not the
 one reflected by the list software, but on the direct route. So, I can
 test it and even comment on it a little bit below (in Thunderbird
 plaintext attachments are displayed inline, and if you select the full
 contents of the body window before hitting Reply All, then the
 entire selection will be quoted, including the inline displayed text
 attachments).

 ** So, test results:

 - With X64 DXE, your patch solves the issue for me.

 - With Ia32 DXE, your patch also soves the issue.

 - With this patch applied, if I use Ia32 DXE *plus* my SMM series,
   then:

   - Using TCG in QEMU (= software emulation), the patch solves the
 issue.

   - Using KVM with QEMU (= hardware virtualization), I continue seeing
 reboots, but they hit in a different place. Somewhere inside or
 after the SMBASE relocation. I believe this is a KVM bug. I will
 send a separate email about this.

** Bottom-up investigation:

I captured a KVM trace. The interesting part is not very long (fewer
than 300 lines), I just searched for a triple fault, and then searched
backwards for an SMI:

 kvm_apic_ipi: dst 0 vec 0 (SMI|physical|assert|edge|dst)
 kvm_apic_accept_irq:  apicid 0 vec 0 (SMI|edge)
 kvm_enter_smm:vcpu 0: entering SMM, smbase 0x3
 kvm_entry:vcpu 0
 kvm_exit: reason EPT_VIOLATION rip 0x8000 info 184 0
 kvm_page_fault:   address 38000 error_code 184
 kvm_entry:vcpu 0
 kvm_exit: reason EPT_VIOLATION rip 0x7ffa78ca info 184 0
 kvm_page_fault:   address 7ffd78ca error_code 184
 kvm_entry:vcpu 0
 kvm_exit: reason CR_ACCESS rip 0x7ffa78d0 info 3 0
 kvm_cr:   cr_write 3 = 0x7fe64000
 kvm_entry:vcpu 0
 kvm_exit: reason CR_ACCESS rip 0x7ffa78e0 info 4 0
 kvm_cr:   cr_write 4 = 0x660
 kvm_entry:vcpu 0
 kvm_exit: reason CR_ACCESS rip 0x7ffa78ec info 0 0
 kvm_cr:   cr_write 0 = 0x8033
 kvm_entry:vcpu 0
 kvm_exit: reason EPT_VIOLATION rip 0x7ffa78ef info 81 0
 kvm_page_fault:   address 7fe66ff8 error_code 81
 kvm_entry:vcpu 0
 kvm_exit: reason EPT_VIOLATION rip 0x7ffa78ef info 81 0
 kvm_page_fault:   address 7fe63eb8 error_code 81
 kvm_entry:vcpu 0
 kvm_exit: reason EPT_VIOLATION rip 0x7ffa78ef info 181 0
 kvm_page_fault:   address 7ffdf710 error_code 181
 kvm_entry:vcpu 0
 kvm_exit: reason EPT_VIOLATION rip 0x7ffd7906 info 182 0
 kvm_page_fault:   address 7ffc1ff8 error_code 182
 kvm_entry:vcpu 0
 kvm_exit: reason EPT_VIOLATION rip 0x7ffd2f20 info 184 0
 kvm_page_fault:   address 7ffd2f20 error_code 184
 kvm_entry:vcpu 0
 kvm_exit: reason EPT_VIOLATION rip 0x7ffd8ec4 info 184 0
 kvm_page_fault:   address 7ffd8ec4 error_code 184
 kvm_entry:vcpu 0
 kvm_exit: reason EPT_VIOLATION rip 0x7ffd8ec4 info 181 0
 kvm_page_fault:   address 7ffdb8c8 error_code 181
 kvm_entry:vcpu 0
 kvm_exit: reason EPT_VIOLATION rip 0x7ffda2a8 info 184 0
 kvm_page_fault:   address 7ffda2a8 error_code 184
 kvm_entry:vcpu 0
 kvm_exit: reason EPT_VIOLATION rip 0x7ffd9fe4 info 184 0
 kvm_page_fault:   address 7ffd9fe4 error_code 184
 kvm_entry:vcpu 0
 kvm_exit: reason CPUID rip 0x7ffd8923 info 0 0
 kvm_cpuid:func 1 rax 6e8 rbx 800 rcx 8021 rdx f89fbff
 kvm_entry:vcpu 0
 kvm_exit: reason MSR_READ rip 0x7ffd886f info 0 0
 kvm_msr:  msr_read 1b = 0xfee00900
 kvm_entry:vcpu 0
 kvm_exit: reason CPUID rip 0x7ffd8923 info 0 0
 kvm_cpuid:func 1 rax 6e8 rbx 800 rcx 8021 rdx f89fbff
 kvm_entry:vcpu 0
 kvm_exit: reason MSR_READ rip 0x7ffd886f info 0 0
 kvm_msr:  msr_read 1b = 0xfee00900
 kvm_entry:vcpu 0
 kvm_exit: reason CPUID rip 0x7ffd8923 info 0 0
 kvm_cpuid:func 0 rax a rbx 756e6547 rcx 6c65746e rdx 49656e69
 kvm_entry:vcpu 0
 kvm_exit: reason CPUID rip 0x7ffd8923 info 0 0
 kvm_cpuid:func 1 rax 6e8 rbx 800 rcx 8021 rdx f89fbff
 kvm_entry:vcpu 0
 kvm_exit: reason CPUID rip 0x7ffd8923 info 0 0
 kvm_cpuid:func 1 rax 6e8 rbx 800 rcx 8021 rdx f89fbff
 kvm_entry:vcpu 0
 kvm_exit: reason MSR_READ rip 0x7ffd886f info 0 0
 kvm_msr:  msr_read 1b = 0xfee00900
 kvm_entry

Re: MTRR setup in OVMF [was: PATCH v3 01/10 KVM: MMU: fix decoding cache type from MTRR]

2015-07-15 Thread Laszlo Ersek
On 07/15/15 00:37, Jordan Justen wrote:
 On 2015-07-14 14:29:11, Laszlo Ersek wrote:
 On 07/14/15 23:15, Paolo Bonzini wrote:
 The long delay that Alex reported (for the case when all guest memory
 was set to UC up-front) is due to the fact that the SEC phase of OVMF
 decompresses an approximately 1712 KB sized, LZMA-compressed blob, to
 approx. 896 KB worth of PEI drivers and 8192 KB worth of DXE and UEFI
 drivers -- and this decompression is extremely memory-intensive.

 (When Jordan implemented that reset vector first, we saw similar
 performance degradation on AMD hosts (albeit not due to MTRR but due to
 page attributes). See
 https://github.com/tianocore/edk2/commit/98f378a7. I'm only mentioning
 it here because it makes me appreciate the current problem report.)

 Anyway, the reset vector's page table building is implemented in
 OvmfPkg/ResetVector/Ia32/PageTables64.asm. The decompression in SEC
 can be found in OvmfPkg/Sec/SecMain.c, function DecompressMemFvs().

 Perhaps the OVMF reset vector should initialize the MTRRs for the BSP?

 That's an idea, yes, if someone feels sufficiently drawn to writing
 assembly.
 
 Maybe we can use MtrrLib in the SEC C code?

If the page table building in the reset vector is not too costly with
UC, then yes, it should suffice if MtrrLib was put to use first just
before the decompression in SEC.

Thanks
Laszlo

 -Jordan
 
 Complications:
 - the reset vector is specific to OvmfPkg only in the OvmfPkgX64.dsc
   build
 - it needs to be determined what memory to cover.

 I think SEC doesn't do any MMIO, so it should be enough to enable MTRRs
 and set the default type to writeback.

 Seems safe to me, off the top of my head (and testing could confirm /
 disprove quickly).

 In any case we're going to have to quirk it, because of the broken
 guests in the wild.

 Thanks.
 Laszlo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MTRR setup in OVMF [was: PATCH v3 01/10 KVM: MMU: fix decoding cache type from MTRR]

2015-07-15 Thread Laszlo Ersek
On 07/15/15 21:30, Xiao Guangrong wrote:
 
 Hi,
 
 I have posted the pachset to make OVMF happy and have CCed you guys,
 could you please check it if it works for you?

Sorry, I can't check it; I don't have an environment that exposes the
issue in the first place. Perhaps Alex can check it, or refer some of
the users that reported the problem to him to this patchset? (Those
users were using bleeding edge v4.2-rc1 anyway, unlike conservative me.)

Thanks!
Laszlo

 On 07/15/2015 05:15 AM, Paolo Bonzini wrote:
 The long delay that Alex reported (for the case when all guest memory
 was set to UC up-front) is due to the fact that the SEC phase of OVMF
 decompresses an approximately 1712 KB sized, LZMA-compressed blob, to
 approx. 896 KB worth of PEI drivers and 8192 KB worth of DXE and UEFI
 drivers -- and this decompression is extremely memory-intensive.

 (When Jordan implemented that reset vector first, we saw similar
 performance degradation on AMD hosts (albeit not due to MTRR but due to
 page attributes). See
 https://github.com/tianocore/edk2/commit/98f378a7. I'm only mentioning
 it here because it makes me appreciate the current problem report.)

 Anyway, the reset vector's page table building is implemented in
 OvmfPkg/ResetVector/Ia32/PageTables64.asm. The decompression in SEC
 can be found in OvmfPkg/Sec/SecMain.c, function DecompressMemFvs().

 Perhaps the OVMF reset vector should initialize the MTRRs for the BSP?
 I think SEC doesn't do any MMIO, so it should be enough to enable MTRRs
 and set the default type to writeback.

 In any case we're going to have to quirk it, because of the broken
 guests in the wild.

 Paolo


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


MTRR setup in OVMF [was: PATCH v3 01/10 KVM: MMU: fix decoding cache type from MTRR]

2015-07-14 Thread Laszlo Ersek
Cross-posting to edk2-devel.

Original sub-thread starts here:
http://thread.gmane.org/gmane.linux.kernel/1952205/focus=1994315

On 07/13/15 17:15, Xiao Guangrong wrote:
 
 
 On 07/13/2015 11:13 PM, Paolo Bonzini wrote:
 On 13/07/2015 16:45, Xiao Guangrong wrote:
 +/* MTRR is completely disabled, use UC for all of physical
 memory. */
 +if (!(mtrr_state-enabled  0x2))
 +return MTRR_TYPE_UNCACHABLE;

 actually disappears in commit fa61213746a7 (KVM: MTRR: simplify
 kvm_mtrr_get_guest_memory_type, 2015-06-15).

 :(

 Based on the SDM, UC is applied to all memory rather than default-type
 if MTRR is disabled.

 There are two issues I think.  One is that I cannot find in the current
 code that UC is applied to all memory rather than default-type if MTRR
 is disabled.  mtrr_default_type unconditionally looks at
 mtrr_state-deftype.
 
 Yes... Will fix.
 

 However, fast boot came back if return 0xFF here. So fast boot expects
 that the memory type is WB.

 Yes.


 static u8 mtrr_default_type(struct kvm_mtrr *mtrr_state)
 {
   if (mtrr_is_enabled(mtrr_state))
   return mtrr_state-deftype 
 IA32_MTRR_DEF_TYPE_TYPE_MASK;
   else
   return MTRR_TYPE_UNCACHABLE;
 }

 ?  Then it's easy to add a quirk that makes the default WRITEBACK until
 MTRRs are enabled.

 It is the wrong configure in OVMF... shall we need to adjust KVM to
 satisfy
 OVMF?

 That's what quirks are for...  The firmware should still be fixed of
 course.
 
 I see, will do it.

The long delay that Alex reported (for the case when all guest memory
was set to UC up-front) is due to the fact that the SEC phase of OVMF
decompresses an approximately 1712 KB sized, LZMA-compressed blob, to
approx. 896 KB worth of PEI drivers and 8192 KB worth of DXE and UEFI
drivers -- and this decompression is extremely memory-intensive.

(Refer to Firmware image structure in the OVMF whitepaper,
http://www.linux-kvm.org/page/OVMF.)

Perhaps also significant, with this initial MTRR change: the x86_64
reset vector builds some page tables too. (Refer to Select features |
X64-specific reset vector for OVMF in the OVMF whitepaper.)

(When Jordan implemented that reset vector first, we saw similar
performance degradation on AMD hosts (albeit not due to MTRR but due to
page attributes). See
https://github.com/tianocore/edk2/commit/98f378a7. I'm only mentioning
it here because it makes me appreciate the current problem report.)

Anyway, the reset vector's page table building is implemented in
OvmfPkg/ResetVector/Ia32/PageTables64.asm. The decompression in SEC
can be found in OvmfPkg/Sec/SecMain.c, function DecompressMemFvs().
(It is recommended to refer heavily to the OVMF whitepaper, or at least
to drink heavily.)

In the PEI phase, we do set up MTRRs sensibly, see
OvmfPkg/PlatformPei/MemDetect.c, function QemuInitializeRam().
However, that's too late with regard this problem report, because PEI
modules run *from* one of the firmware volumes that SEC expands with LZMA.

Anyway, the logic in QemuInitializeRam() relies on the MtrrLib library
class, which OVMF resolves to the
UefiCpuPkg/Library/MtrrLib/MtrrLib.inf instance. The latter has no
client module type restriction (it's BASE), so it could be used in SEC
too, if someone were to write patches.

I'm sorry but I really can't take this on now. So, respectfully:
- please quirk it in KVM for now (SeaBIOS is also affected),
- patches welcome for OVMF, as always.

Thanks
Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MTRR setup in OVMF [was: PATCH v3 01/10 KVM: MMU: fix decoding cache type from MTRR]

2015-07-14 Thread Laszlo Ersek
On 07/14/15 23:15, Paolo Bonzini wrote:
 The long delay that Alex reported (for the case when all guest memory
 was set to UC up-front) is due to the fact that the SEC phase of OVMF
 decompresses an approximately 1712 KB sized, LZMA-compressed blob, to
 approx. 896 KB worth of PEI drivers and 8192 KB worth of DXE and UEFI
 drivers -- and this decompression is extremely memory-intensive.

 (When Jordan implemented that reset vector first, we saw similar
 performance degradation on AMD hosts (albeit not due to MTRR but due to
 page attributes). See
 https://github.com/tianocore/edk2/commit/98f378a7. I'm only mentioning
 it here because it makes me appreciate the current problem report.)

 Anyway, the reset vector's page table building is implemented in
 OvmfPkg/ResetVector/Ia32/PageTables64.asm. The decompression in SEC
 can be found in OvmfPkg/Sec/SecMain.c, function DecompressMemFvs().
 
 Perhaps the OVMF reset vector should initialize the MTRRs for the BSP?

That's an idea, yes, if someone feels sufficiently drawn to writing
assembly. Complications:
- the reset vector is specific to OvmfPkg only in the OvmfPkgX64.dsc
  build
- it needs to be determined what memory to cover.

 I think SEC doesn't do any MMIO, so it should be enough to enable MTRRs
 and set the default type to writeback.

Seems safe to me, off the top of my head (and testing could confirm /
disprove quickly).

 In any case we're going to have to quirk it, because of the broken
 guests in the wild.

Thanks.
Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: Add host physical address width capability

2015-07-10 Thread Laszlo Ersek
On 07/10/15 16:13, Paolo Bonzini wrote:
 
 
 On 09/07/2015 20:57, Laszlo Ersek wrote:
 Without EPT, you don't
 hit the processor limitation with your setup, but the user should 
 nevertheless
 still be notified.

 I disagree.
 
 FWIW, I also disagree (and it looks like Bandan disagrees with himself
 now :)).
 
 In fact, I think shadow paging code should also emulate
 this behavior if the gpa is out of range.

 I disagree.
 
 Same here.
 
 There is no out of range gpa. QEMU allocates enough memory, and it
 should be completely transparent to the guest. The fact that it silently
 breaks with nested paging if the host processor doesn't have enough
 address bits is a bug (maybe a hardware bug, maybe a KVM bug; I'm not
 sure, but I suspect it's a hardware bug).
 
 It's a hardware bug, possibly due to some limitations in the physical
 addresses that the TLB can store?  I guess KVM could detect the
 situation and fall back to slw shadow paging.
 
 ... In any case, please understand that I'm not campaigning for this
 warning :) IIRC the warning was your (very welcome!) idea after I
 reported the problem; I'm just trying to ensure that the warning match
 the exact issue I encountered.
 
 Yup.  I think the right thing to do would be to hide memory above the
 limit.

How so?

- The stack would not be doing what the user asks for. Pass -m a_lot,
and the guest would silently see less memory. If the user found out,
he'd immediately ask (or set out debugging) why. I think if the user's
request cannot be satisfied, the stack should fail hard.

- Assuming the user didn't find out, and the guest just worked (with
less memory than the user asked for), then the hidden portion of the
memory (that QEMU allocated nonetheless) would be just wasted, on the
host system. (Especially with overcommit_memory=2 (which is the most
prudent setting).)

Thanks
Laszlo

  A kernel patch to query the limit is definitely necessary, but
 it needs to return e.g. 48 for shadow paging (otherwise you could just
 use CPUID).  I'm not sure if the rest is possible with just QEMU, or it
 requires help from the firmware.  Probably yes.
 
 Paolo
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: Add host physical address width capability

2015-07-10 Thread Laszlo Ersek
On 07/10/15 16:59, Paolo Bonzini wrote:
 
 
 On 10/07/2015 16:57, Laszlo Ersek wrote:
 ... In any case, please understand that I'm not campaigning for this
 warning :) IIRC the warning was your (very welcome!) idea after I
 reported the problem; I'm just trying to ensure that the warning match
 the exact issue I encountered.

 Yup.  I think the right thing to do would be to hide memory above the
 limit.
 How so?

 - The stack would not be doing what the user asks for. Pass -m a_lot,
 and the guest would silently see less memory. If the user found out,
 he'd immediately ask (or set out debugging) why. I think if the user's
 request cannot be satisfied, the stack should fail hard.
 
 That's another possibility.  I think both of them are wrong depending on
 _why_ you're using -m a lot in the first place.
 
 Considering that this really happens (on Xeons) only for 1TB+ guests,

I reported this issue because I ran into it with a ~64GB guest. From my
/proc/cpuinfo:

model name  : Intel(R) Core(TM) i7 CPU   M 620  @ 2.67GHz
address sizes   : 36 bits physical, 48 bits virtual

I was specifically developing 64GB+ support for OVMF, and this
limitation caused me to think that there was a bug in my OVMF patches.
(There wasn't.) An error message from QEMU, advising me to turn off EPT,
would have saved me many hours.

Thanks
Laszlo

 it's probably just for debugging and then hiding the memory makes some
 sense.
 
 Paolo
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: Add host physical address width capability

2015-07-09 Thread Laszlo Ersek
On 07/09/15 20:32, Bandan Das wrote:
 Paolo Bonzini pbonz...@redhat.com writes:
 
 On 09/07/2015 08:43, Laszlo Ersek wrote:
 On 07/09/15 08:09, Paolo Bonzini wrote:


 On 09/07/2015 00:36, Bandan Das wrote:
 Let userspace inquire the maximum physical address width
 of the host processors; this can be used to identify maximum
 memory that can be assigned to the guest.

 Reported-by: Laszlo Ersek ler...@redhat.com
 Signed-off-by: Bandan Das b...@redhat.com
 ---
  arch/x86/kvm/x86.c   | 3 +++
  include/uapi/linux/kvm.h | 1 +
  2 files changed, 4 insertions(+)

 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index bbaf44e..97d6746 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -2683,6 +2683,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, 
 long ext)
   case KVM_CAP_NR_MEMSLOTS:
   r = KVM_USER_MEM_SLOTS;
   break;
 + case KVM_CAP_PHY_ADDR_WIDTH:
 + r = boot_cpu_data.x86_phys_bits;
 + break;

 Userspace can just use CPUID, can't it?

 I believe KVM's cooperation is necessary, for the following reason:

 The truncation only occurs when the guest-phys - host-phys translation
 is done in hardware, *and* the phys bits of the host processor are
 insufficient to represent the highest guest-phys address that the guest
 will ever face.

 The first condition (of course) means that the truncation depends on EPT
 being enabled. (I didn't test on AMD so I don't know if RVI has the same
 issue.) If EPT is disabled, either because the host processor lacks it,
 or because the respective kvm_intel module parameter is set so, then the
 issue cannot be experienced.

 Therefore I believe a KVM patch is necessary.

 However, this specific patch doesn't seem sufficient; it should also
 consider whether EPT is enabled. (And the ioctl should be perhaps
 renamed to reflect that -- what QEMU needs to know is not the raw
 physical address width of the host processor, but whether that width
 will cause EPT to silently truncate high guest-phys addresses.)

 Right; if you want to consider whether EPT is enabled (which is the
 right thing to do, albeit it makes for a much bigger patch) a KVM patch
 is necessary.  In that case you also need to patch the API documentation.
 
 Note that this patch really doesn't do anything except for printing a
 message that something might potentially go wrong.

Yes.

 Without EPT, you don't
 hit the processor limitation with your setup, but the user should nevertheless
 still be notified.

I disagree.

 In fact, I think shadow paging code should also emulate
 this behavior if the gpa is out of range.

I disagree.

There is no out of range gpa. QEMU allocates enough memory, and it
should be completely transparent to the guest. The fact that it silently
breaks with nested paging if the host processor doesn't have enough
address bits is a bug (maybe a hardware bug, maybe a KVM bug; I'm not
sure, but I suspect it's a hardware bug). In any case the guest
shouldn't care at all. It is a *virtual* machine, and the VMM should lie
to it plausibly enough. How much RAM, and how many phys address bits the
host has, is a performance question, but it should not be a correctness
question. A 256 GB guest should run (slowly, but correctly) on a laptop
that has only 4 GB of RAM and only 36 phys addr bits, but plenty of swap
space.

Because otherwise your argument could be extrapolated as TCG should
break too if the gpa is 'out of range'.

So, I disagree. Whatever memory you give to the guest should just work
(unless of course you want to emulate a small address width for the
*VCPU*, but that's absolutely not the use case here). What we have here
is a leaky abstraction: a PCPU limitation giving away a lie that the
guest should never notice. The guest should be able to use all memory
that was specified with QEMU's -m, regardless of TCG vs. KVM-without-EPT
vs. KVM-with-EPT. If the last case cannot work (due to hardware
limitations), that's fine, but then (and only then) a warning should be
printed.

... In any case, please understand that I'm not campaigning for this
warning :) IIRC the warning was your (very welcome!) idea after I
reported the problem; I'm just trying to ensure that the warning match
the exact issue I encountered.

Thanks!
Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: Add host physical address width capability

2015-07-09 Thread Laszlo Ersek
On 07/09/15 08:09, Paolo Bonzini wrote:
 
 
 On 09/07/2015 00:36, Bandan Das wrote:
 Let userspace inquire the maximum physical address width
 of the host processors; this can be used to identify maximum
 memory that can be assigned to the guest.

 Reported-by: Laszlo Ersek ler...@redhat.com
 Signed-off-by: Bandan Das b...@redhat.com
 ---
  arch/x86/kvm/x86.c   | 3 +++
  include/uapi/linux/kvm.h | 1 +
  2 files changed, 4 insertions(+)

 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index bbaf44e..97d6746 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -2683,6 +2683,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
 ext)
  case KVM_CAP_NR_MEMSLOTS:
  r = KVM_USER_MEM_SLOTS;
  break;
 +case KVM_CAP_PHY_ADDR_WIDTH:
 +r = boot_cpu_data.x86_phys_bits;
 +break;
 
 Userspace can just use CPUID, can't it?

I believe KVM's cooperation is necessary, for the following reason:

The truncation only occurs when the guest-phys - host-phys translation
is done in hardware, *and* the phys bits of the host processor are
insufficient to represent the highest guest-phys address that the guest
will ever face.

The first condition (of course) means that the truncation depends on EPT
being enabled. (I didn't test on AMD so I don't know if RVI has the same
issue.) If EPT is disabled, either because the host processor lacks it,
or because the respective kvm_intel module parameter is set so, then the
issue cannot be experienced.

Therefore I believe a KVM patch is necessary.

However, this specific patch doesn't seem sufficient; it should also
consider whether EPT is enabled. (And the ioctl should be perhaps
renamed to reflect that -- what QEMU needs to know is not the raw
physical address width of the host processor, but whether that width
will cause EPT to silently truncate high guest-phys addresses.)

Thanks
Laszlo

 
 Paolo
 
  case KVM_CAP_PV_MMU:/* obsolete */
  r = 0;
  break;
 diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
 index 716ad4a..e7949a1 100644
 --- a/include/uapi/linux/kvm.h
 +++ b/include/uapi/linux/kvm.h
 @@ -817,6 +817,7 @@ struct kvm_ppc_smmu_info {
  #define KVM_CAP_DISABLE_QUIRKS 116
  #define KVM_CAP_X86_SMM 117
  #define KVM_CAP_MULTI_ADDRESS_SPACE 118
 +#define KVM_CAP_PHY_ADDR_WIDTH 119
  
  #ifdef KVM_CAP_IRQ_ROUTING
  


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] target-i386: Sanity check host processor physical address width

2015-07-09 Thread Laszlo Ersek
On 07/09/15 00:42, Bandan Das wrote:
 
 If a Linux guest is assigned more memory than is supported
 by the host processor, the guest is unable to boot. That
 is expected, however, there's no message indicating the user
 what went wrong. This change prints a message to stderr if
 KVM has the corresponding capability.
 
 Reported-by: Laszlo Ersek ler...@redhat.com
 Signed-off-by: Bandan Das b...@redhat.com
 ---
  linux-headers/linux/kvm.h | 1 +
  target-i386/kvm.c | 6 ++
  2 files changed, 7 insertions(+)
 
 diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
 index 3bac873..6afad49 100644
 --- a/linux-headers/linux/kvm.h
 +++ b/linux-headers/linux/kvm.h
 @@ -817,6 +817,7 @@ struct kvm_ppc_smmu_info {
  #define KVM_CAP_DISABLE_QUIRKS 116
  #define KVM_CAP_X86_SMM 117
  #define KVM_CAP_MULTI_ADDRESS_SPACE 118
 +#define KVM_CAP_PHY_ADDR_WIDTH 119
  
  #ifdef KVM_CAP_IRQ_ROUTING
  
 diff --git a/target-i386/kvm.c b/target-i386/kvm.c
 index 066d03d..66e3448 100644
 --- a/target-i386/kvm.c
 +++ b/target-i386/kvm.c
 @@ -892,6 +892,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
  uint64_t shadow_mem;
  int ret;
  struct utsname utsname;
 +int max_phys_bits;
  
  ret = kvm_get_supported_msrs(s);
  if (ret  0) {
 @@ -945,6 +946,11 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
  }
  }
  
 +max_phys_bits = kvm_check_extension(s, KVM_CAP_PHY_ADDR_WIDTH);
 +if (max_phys_bits  (1ULL  max_phys_bits) = ram_size)
 +fprintf(stderr, Warning: The amount of memory assigned to the guest 
 
 +is more than that supported by the host CPU(s). Guest may be 
 unstable.\n);
 +
  if (kvm_check_extension(s, KVM_CAP_X86_SMM)) {
  smram_machine_done.notify = register_smram_listener;
  qemu_add_machine_init_done_notifier(smram_machine_done);
 

First, see my comments on the KVM patch.

Second, ram_size is not the right thing to compare. What should be
checked is whether the highest guest-physical address that maps to RAM
can be represented in the address width of the host processor (and only
if EPT is enabled, but that sub-condition belongs to the KVM patch).

Note that this is not the same as the check written in the patch. For
example, if you assume a 32-bit PCI hole with size 1 GB, then a total
guest RAM of size 63 GB will result in the highest guest-phys memory
address being 0xF__, which just fits into 36 bits.

Correspondingly, the above code would not print the warning for

  -m $((63 * 1024 + 1))

on my laptop (which has address sizes   : 36 bits physical, ...), even
though such a guest would not boot for me (with EPT enabled).

Please see

http://thread.gmane.org/gmane.comp.bios.tianocore.devel/15418/focus=15447

So, ram_size in the controlling expression should be replaced with
maximum_guest_ram_address (which should be inclusive, and the = relop
should be preserved).

Thanks
Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] target-i386: Sanity check host processor physical address width

2015-07-09 Thread Laszlo Ersek
On 07/09/15 09:59, Paolo Bonzini wrote:
 
 
 On 09/07/2015 00:42, Bandan Das wrote:

 If a Linux guest is assigned more memory than is supported
 by the host processor, the guest is unable to boot. That
 is expected, however, there's no message indicating the user
 what went wrong. This change prints a message to stderr if
 KVM has the corresponding capability.

 Reported-by: Laszlo Ersek ler...@redhat.com
 Signed-off-by: Bandan Das b...@redhat.com
 
 This is not entirely correct, because it doesn't take the PCI hole into
 account.
 
 Perhaps KVM could simply hide memory above the limit (i.e. treat it as
 MMIO), and the BIOS could remove RAM above the limit from the e820
 memory map?

I'd prefer to leave the guest firmware*s* out of this... :)

E820 is a legacy BIOS concept. In OVMF we'd have to hack the memory
resource descriptor HOBs (which in turn control the DXE memory space
map, which in turn controls the UEFI memory map). Those HOBs are
currently based on what the CMOS reports about the RAM available under
and above 4GB.

It's pretty complex already (will get more complex with SMM support),
and TBH, for working around such an obscure issue, I wouldn't like to
complicate it even further...

After all, this is a host platform limitation. The solution should be to
either move to a more capable host, or do it in software (disable EPT).

Thanks!
Laszlo

 
 Paolo
 
 ---
  linux-headers/linux/kvm.h | 1 +
  target-i386/kvm.c | 6 ++
  2 files changed, 7 insertions(+)

 diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
 index 3bac873..6afad49 100644
 --- a/linux-headers/linux/kvm.h
 +++ b/linux-headers/linux/kvm.h
 @@ -817,6 +817,7 @@ struct kvm_ppc_smmu_info {
  #define KVM_CAP_DISABLE_QUIRKS 116
  #define KVM_CAP_X86_SMM 117
  #define KVM_CAP_MULTI_ADDRESS_SPACE 118
 +#define KVM_CAP_PHY_ADDR_WIDTH 119
  
  #ifdef KVM_CAP_IRQ_ROUTING
  
 diff --git a/target-i386/kvm.c b/target-i386/kvm.c
 index 066d03d..66e3448 100644
 --- a/target-i386/kvm.c
 +++ b/target-i386/kvm.c
 @@ -892,6 +892,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
  uint64_t shadow_mem;
  int ret;
  struct utsname utsname;
 +int max_phys_bits;
  
  ret = kvm_get_supported_msrs(s);
  if (ret  0) {
 @@ -945,6 +946,11 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
  }
  }
  
 +max_phys_bits = kvm_check_extension(s, KVM_CAP_PHY_ADDR_WIDTH);
 +if (max_phys_bits  (1ULL  max_phys_bits) = ram_size)
 +fprintf(stderr, Warning: The amount of memory assigned to the 
 guest 
 +is more than that supported by the host CPU(s). Guest may be 
 unstable.\n);
 +
  if (kvm_check_extension(s, KVM_CAP_X86_SMM)) {
  smram_machine_done.notify = register_smram_listener;
  qemu_add_machine_init_done_notifier(smram_machine_done);


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] target-i386: Sanity check host processor physical address width

2015-07-09 Thread Laszlo Ersek
On 07/09/15 11:27, Igor Mammedov wrote:
 On Thu, 09 Jul 2015 09:02:38 +0200
 Laszlo Ersek ler...@redhat.com wrote:
 
 On 07/09/15 00:42, Bandan Das wrote:

 If a Linux guest is assigned more memory than is supported
 by the host processor, the guest is unable to boot. That
 is expected, however, there's no message indicating the user
 what went wrong. This change prints a message to stderr if
 KVM has the corresponding capability.

 Reported-by: Laszlo Ersek ler...@redhat.com
 Signed-off-by: Bandan Das b...@redhat.com
 ---
  linux-headers/linux/kvm.h | 1 +
  target-i386/kvm.c | 6 ++
  2 files changed, 7 insertions(+)

 diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
 index 3bac873..6afad49 100644
 --- a/linux-headers/linux/kvm.h
 +++ b/linux-headers/linux/kvm.h
 @@ -817,6 +817,7 @@ struct kvm_ppc_smmu_info {
  #define KVM_CAP_DISABLE_QUIRKS 116
  #define KVM_CAP_X86_SMM 117
  #define KVM_CAP_MULTI_ADDRESS_SPACE 118
 +#define KVM_CAP_PHY_ADDR_WIDTH 119
  
  #ifdef KVM_CAP_IRQ_ROUTING
  
 diff --git a/target-i386/kvm.c b/target-i386/kvm.c
 index 066d03d..66e3448 100644
 --- a/target-i386/kvm.c
 +++ b/target-i386/kvm.c
 @@ -892,6 +892,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
  uint64_t shadow_mem;
  int ret;
  struct utsname utsname;
 +int max_phys_bits;
  
  ret = kvm_get_supported_msrs(s);
  if (ret  0) {
 @@ -945,6 +946,11 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
  }
  }
  
 +max_phys_bits = kvm_check_extension(s, KVM_CAP_PHY_ADDR_WIDTH);
 +if (max_phys_bits  (1ULL  max_phys_bits) = ram_size)
 +fprintf(stderr, Warning: The amount of memory assigned to the 
 guest 
 +is more than that supported by the host CPU(s). Guest may be 
 unstable.\n);
 +
  if (kvm_check_extension(s, KVM_CAP_X86_SMM)) {
  smram_machine_done.notify = register_smram_listener;
  qemu_add_machine_init_done_notifier(smram_machine_done);


 First, see my comments on the KVM patch.

 Second, ram_size is not the right thing to compare. What should be
 checked is whether the highest guest-physical address that maps to RAM
 can be represented in the address width of the host processor (and only
 if EPT is enabled, but that sub-condition belongs to the KVM patch).

 Note that this is not the same as the check written in the patch. For
 example, if you assume a 32-bit PCI hole with size 1 GB, then a total
 guest RAM of size 63 GB will result in the highest guest-phys memory
 address being 0xF__, which just fits into 36 bits.

 Correspondingly, the above code would not print the warning for

   -m $((63 * 1024 + 1))

 on my laptop (which has address sizes   : 36 bits physical, ...), even
 though such a guest would not boot for me (with EPT enabled).

 Please see

 http://thread.gmane.org/gmane.comp.bios.tianocore.devel/15418/focus=15447

 So, ram_size in the controlling expression should be replaced with
 maximum_guest_ram_address (which should be inclusive, and the = relop
 should be preserved).
 also with memory hotplug tuned on we should check if the end of
 hotplug memory area is less then limit, i.e.:
 
   pcms-hotplug_memory.base + hotplug_mem_size  1ULL  max_phys_bits

Seems reasonable, thanks for the hint!

(The LHS in this instance is exclusive though, so equality should *not*
trigger the warning. maximum_guest_ram_address is inclusive, and
equality should trigger the warning. (Although equality seems quite
impossible in practice.))

Thanks!
Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] target-i386: Sanity check host processor physical address width

2015-07-09 Thread Laszlo Ersek
On 07/09/15 21:11, Bandan Das wrote:
 Laszlo Ersek ler...@redhat.com writes:
 ...

 First, see my comments on the KVM patch.

 Second, ram_size is not the right thing to compare. What should be
 checked is whether the highest guest-physical address that maps to RAM
 can be represented in the address width of the host processor (and only
 if EPT is enabled, but that sub-condition belongs to the KVM patch).

 Note that this is not the same as the check written in the patch. For
 example, if you assume a 32-bit PCI hole with size 1 GB, then a total
 guest RAM of size 63 GB will result in the highest guest-phys memory
 address being 0xF__, which just fits into 36 bits.

 Correspondingly, the above code would not print the warning for

   -m $((63 * 1024 + 1))

 on my laptop (which has address sizes   : 36 bits physical, ...), even
 though such a guest would not boot for me (with EPT enabled).

 Please see

 http://thread.gmane.org/gmane.comp.bios.tianocore.devel/15418/focus=15447

 So, ram_size in the controlling expression should be replaced with
 maximum_guest_ram_address (which should be inclusive, and the = relop
 should be preserved).
 also with memory hotplug tuned on we should check if the end of
 hotplug memory area is less then limit, i.e.:

   pcms-hotplug_memory.base + hotplug_mem_size  1ULL  max_phys_bits

 Seems reasonable, thanks for the hint!
 
 Thanks Igor and Laszlo, makes sense. I am wondering if this 1GB PCI
 hole is always fixed so that I can simply include that in calculating the 
 maximum
 guest ram address ? Or do we have to figure that out every time ?

Please grep the tree for above_4g_mem_size. The size of the 32-bit PCI
hole is not constant, but all the necessary computation goes into
above_4g_mem_size already.

So I think you should derive the max possible gpa from
above_4g_mem_size and the top of the hotpluggable memory area, and
compare that against the PCPU address width, *if* EPT is enabled.

(BTW pcms-hotplug_memory.base depends on above_4g_mem_size too.)

Thanks
Laszlo

 
 (The LHS in this instance is exclusive though, so equality should *not*
 trigger the warning. maximum_guest_ram_address is inclusive, and
 equality should trigger the warning. (Although equality seems quite
 impossible in practice.))

 Thanks!
 Laszlo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: Add host physical address width capability

2015-07-09 Thread Laszlo Ersek
On 07/09/15 22:02, Bandan Das wrote:
 Laszlo Ersek ler...@redhat.com writes:
 ...
 Yes.

 Without EPT, you don't
 hit the processor limitation with your setup, but the user should 
 nevertheless
 still be notified.

 I disagree.

 In fact, I think shadow paging code should also emulate
 this behavior if the gpa is out of range.

 I disagree.

 There is no out of range gpa. QEMU allocates enough memory, and it
 should be completely transparent to the guest. The fact that it silently
 breaks with nested paging if the host processor doesn't have enough
 address bits is a bug (maybe a hardware bug, maybe a KVM bug; I'm not
 sure, but I suspect it's a hardware bug). In any case the guest
 shouldn't care at all. It is a *virtual* machine, and the VMM should lie
 to it plausibly enough. How much RAM, and how many phys address bits the
 host has, is a performance question, but it should not be a correctness
 question. A 256 GB guest should run (slowly, but correctly) on a laptop
 that has only 4 GB of RAM and only 36 phys addr bits, but plenty of swap
 space.

 Because otherwise your argument could be extrapolated as TCG should
 break too if the gpa is 'out of range'.

 So, I disagree. Whatever memory you give to the guest should just work
 (unless of course you want to emulate a small address width for the
 *VCPU*, but that's absolutely not the use case here). What we have here
 is a leaky abstraction: a PCPU limitation giving away a lie that the
 guest should never notice. The guest should be able to use all memory
 that was specified with QEMU's -m, regardless of TCG vs. KVM-without-EPT
 vs. KVM-with-EPT. If the last case cannot work (due to hardware
 limitations), that's fine, but then (and only then) a warning should be
 printed.
 
 Hmm... Ok, I understand your point. So, this is more like a EPT
 limitation/bug in that Qemu isn't complaining about the memory assigned
 to the guest but EPT code is breaking owing to the processor physical
 address width.

Exactly.

 And honestly, I now think that this patch just makes the whole
 situation more confusing :) I am wondering if it's just possible for kvm to
 simply throw an error like a EPT misconfiguration or something ..
 
 Or in other words, if using a hardware assisted mechanism is just not
 possible, KVM will simply not let it run instead of letting a guest
 stuck in boot.

That would be the best solution.

Thanks
Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: vgic: add virt-capable compatible strings

2015-03-05 Thread Laszlo Ersek
On 03/05/15 15:47, Mark Rutland wrote:
 Several dts only list arm,cortex-a7-gic or arm,gic-400 in their GIC
 compatible list, and while this is correct (and supported by the GIC
 driver), KVM will fail to detect that it can support these cases.
 
 This patch adds the missing strings to the VGIC code. The of_device_id
 entries are padded to keep the probe fucntion data aligned.

Side note: there's a typo in function.

Laszlo


 
 Signed-off-by: Mark Rutland mark.rutl...@arm.com
 Cc: Andre Przywara andre.przyw...@arm.com
 Cc: Christoffer Dall christoffer.d...@linaro.org
 Cc: Marc Zyngier marc.zyng...@arm.com
 Cc: Michal Simek mon...@monstr.eu
 ---
  virt/kvm/arm/vgic.c | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)
 
 diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
 index 0cc6ab6..86cec79 100644
 --- a/virt/kvm/arm/vgic.c
 +++ b/virt/kvm/arm/vgic.c
 @@ -1865,8 +1865,10 @@ static struct notifier_block vgic_cpu_nb = {
  };
  
  static const struct of_device_id vgic_ids[] = {
 - { .compatible = arm,cortex-a15-gic, .data = vgic_v2_probe, },
 - { .compatible = arm,gic-v3, .data = vgic_v3_probe, },
 + { .compatible = arm,cortex-a15-gic,   .data = vgic_v2_probe, },
 + { .compatible = arm,cortex-a7-gic,.data = vgic_v2_probe, },
 + { .compatible = arm,gic-400,  .data = vgic_v2_probe, },
 + { .compatible = arm,gic-v3,   .data = vgic_v3_probe, },
   {},
  };
  
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-03 Thread Laszlo Ersek
On 03/03/15 18:34, Alexander Graf wrote:
 On 02/19/2015 11:54 AM, Ard Biesheuvel wrote:
 This is a 0th order approximation of how we could potentially force
 the guest
 to avoid uncached mappings, at least from the moment the MMU is on.
 (Before
 that, all of memory is implicitly classified as Device-nGnRnE)

 The idea (patch #2) is to trap writes to MAIR_EL1, and replace
 uncached mappings
 with cached ones. This way, there is no need to mangle any guest page
 tables.

 The downside is that, to do this correctly, we need to always trap
 writes to
 the VM sysreg group, which includes registers that the guest may write
 to very
 often. To reduce the associated performance hit, patch #1 introduces a
 fast path
 for EL2 to perform trivial sysreg writes on behalf of the guest,
 without the
 need for a full world switch to the host and back.

 The main purpose of these patches is to quantify the performance hit, and
 verify whether the MAIR_EL1 handling works correctly.
 
 I gave this a quick spin on a VM running with QEMU.
 
   * VGA output is still distorted, I get random junk black lines in the
 output in between
   * When I add -device nec-usb-xhci -device usb-kbd the VM doesn't even
 boot up

Do you also have the dirty page tracking patches in your host kernel? I
needed both (and got them via Drew's backport, thanks) and then both VGA
and USB started working fine.

Without the MAIR patches, I got cache-line size random corruptions in
the VGA display (16 pixel wide small segments). Without dirty page
tracking, big chunks (sometimes even almost the entire screen) was blank.

Regarding USB, unless you have both of the patchsets in the host kernel,
the guest will indeed crash early during boot. Gerd confirmed for me
that usb controller (all uhci/ehci/xhci) pci regions see both read
(status bits) and write (control bits) access. So if there's any
corruption in there, on read, that looks like a malfunctioning piece of
hw for the guest kernel, and in this case it happens to crash.

 With TCG, both bits work fine.

Yep.

Thanks
Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-02 Thread Laszlo Ersek
On 03/02/15 17:47, Paolo Bonzini wrote:
 
 Also, we may want to invalidate the cache for dirty pages before
 returning the dirty bitmap, and probably should do that directly in
 KVM_GET_DIRTY_LOG.

I agree.

If KVM_GET_DIRTY_LOG is supposed to be atomic fetch and clear (from
userspace's aspect), then the cache invalidation should be an atomic
part of it too (from the same aspect).

(Sorry if I just said something incredibly stupid.)

Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] arm, arm64: KVM: handle potential incoherency of readonly memslots

2014-11-22 Thread Laszlo Ersek
On 11/22/14 11:18, Christoffer Dall wrote:
 On Fri, Nov 21, 2014 at 05:50:43PM -0800, Mario Smarduch wrote:

 But virtio writes to guest memory directly and that appears to
 work just fine. I read that code sometime back, and will need to revisit.

 In any case, that's a QEMU implementation issue and nothing the kernel
 needs to be concerned with.

(Let's call it qemu implementation detail -- there's no issue. The
reason virtio works is not by incident, it has a known cause. As
confirmed by Peter, there's no problem with the virtio buffers because
the guest doesn't mark them as uncacheable, so *all* accesses go through
the dcache.)

Thanks
Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] arm, arm64: KVM: handle potential incoherency of readonly memslots

2014-11-20 Thread Laszlo Ersek
On 11/20/14 00:32, Mario Smarduch wrote:
 Hi Laszlo,
 
 couple observations.
 
  I'm wondering if access from qemu and guest won't
 result in mixed memory attributes and if that's acceptable
 to the CPU.

Normally this would be a problem I think (Jon raised the topic of live
migration). However, for flash programming specifically, I think the
guest's access pattern ensures that we'll see things OK.

When the guest issues the first write access, the memslot is deleted,
and everything is forwarded to qemu, both reads and writes. In response
qemu modifies the array that *otherwise* backs the flash. These
modifications by qemu end up in the dcache mostly. When the guest is
done programming, it writes a special command (read array mode) at
which point the memslot is recreated (as read-only) and flushed / set up
for flushing during demand paging.

So from the emulated flash POV, the memslot either doesn't exist at all
(and then qemu serves all accesses just fine), or it exists r/o, at
which point qemu (host userspace) will have stopped writing to it, and
will have set it up for flushing before and during guest read accesses.

 Also is if you update memory from qemu you may break
 dirty page logging/migration.

Very probably. Jon said the same thing.

 Unless there is some other way
 you keep track. Of course it may not be applicable in your
 case (i.e. flash unused after boot).

The flash *is* used after boot, because the UEFI runtime variable
services *are* exercised by the guest kernel. However those use the same
access pattern (it's the same set of UEFI services just called by a
different client).

*Uncoordinated* access from guest and host in parallel will be a big
problem; but we're not that far yet, and we need to get the flash
problem sorted, so that we can at least boot and work on the basic
stuff. The flash programming dance happens to provide coordination; the
flash mode changes (which are equivalent to the teardown and the
recreation of the memslot) can be considered barriers.

I hope this is acceptable for the time being...

Thanks
Laszlo

 
 - Mario
 
 On 11/17/2014 07:49 AM, Laszlo Ersek wrote:
 On 11/17/14 16:29, Paolo Bonzini wrote:


 On 17/11/2014 15:58, Ard Biesheuvel wrote:
 Readonly memslots are often used to implement emulation of ROMs and
 NOR flashes, in which case the guest may legally map these regions as
 uncached.
 To deal with the incoherency associated with uncached guest mappings,
 treat all readonly memslots as incoherent, and ensure that pages that
 belong to regions tagged as such are flushed to DRAM before being passed
 to the guest.

 On x86, the processor combines the cacheability values from the two
 levels of page tables.  Is there no way to do the same on ARM?

 Combining occurs on ARMv8 too. The Stage1 (guest) mapping is very strict
 (Device non-Gathering, non-Reordering, no Early Write Acknowledgement --
 for EFI_MEMORY_UC), which basically overrides the Stage2 (very lax
 host) memory attributes.

 When qemu writes, as part of emulating the flash programming commands,
 to the RAMBlock that *otherwise* backs the flash range (as a r/o
 memslot), those writes (from host userspace) tend to end up in dcache.

 But, when the guest flips back the flash to romd mode, and tries to read
 back the values from the flash as plain ROM, the dcache is completely
 bypassed due to the strict stage1 mapping, and the guest goes directly
 to DRAM.

 Where qemu's earlier writes are not yet / necessarily visible.

 Please see my original patch (which was incomplete) in the attachment,
 it has a very verbose commit message.

 Anyway, I'll let others explain; they can word it better than I can :)

 FWIW,

 Series
 Reviewed-by: Laszlo Ersek ler...@redhat.com

 I ported this series to a 3.17.0+ based kernel, and tested it. It works
 fine. The ROM-like view of the NOR flash now reflects the previously
 programmed contents.

 Series
 Tested-by: Laszlo Ersek ler...@redhat.com

 Thanks!
 Laszlo



 ___
 kvmarm mailing list
 kvm...@lists.cs.columbia.edu
 https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] arm, arm64: KVM: handle potential incoherency of readonly memslots

2014-11-20 Thread Laszlo Ersek
On 11/20/14 21:10, Peter Maydell wrote:
 On 20 November 2014 19:49, Jon Masters j...@redhat.com wrote:
 On 11/20/2014 01:40 PM, Peter Maydell wrote:
 The situation is not so bleak as this. See section B2.9 Mismatched
 memory attributes in the ARMv8 ARM ARM (DDI0487A.d), which lays
 out in some detail what the results of mismatched attributes are
 (generally, you lose ordering or coherency guarantees you might
 have hoped to have). They're not pretty, but it's not as bad
 as completely UNPREDICTABLE behaviour.

 Quick side note that I did raise exactly this issue with the ARM
 Architecture team several years ago (that of missmatched memory
 attributes between a guest and hypervisor) and it is a known concern.
 
 I think in practice for a well-behaved guest we can arrange
 that everything is fine (roughly, the guest has to treat
 DMA-capable devices as doing coherent-dma, which we can tell
 them to do via DT bindings or ACPI[*], plus the special
 case handling we already have for bootup), and naughty guests
 will only confuse themselves. But I need to think a bit more
 about it (and we should probably write down how it works
 somewhere :-)).
 
 [*] We should be able to emulate non-coherent-DMA devices but
 would need an extra API from KVM so userspace can do clean
 dcache to point of coherency. And in practice I'm not sure
 we need to emulate those devices...

This basically means that virtio transfers should just use normal memory
in the guest (treating virtio transfers as coherent DMA), right?

We're actually doing that in the ArmVirtualizationQemu.dsc build of
edk2, and Things Work Great (TM) in guests on the Mustang.

Thanks
Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] arm, arm64: KVM: handle potential incoherency of readonly memslots

2014-11-17 Thread Laszlo Ersek
On 11/17/14 16:29, Paolo Bonzini wrote:
 
 
 On 17/11/2014 15:58, Ard Biesheuvel wrote:
 Readonly memslots are often used to implement emulation of ROMs and
 NOR flashes, in which case the guest may legally map these regions as
 uncached.
 To deal with the incoherency associated with uncached guest mappings,
 treat all readonly memslots as incoherent, and ensure that pages that
 belong to regions tagged as such are flushed to DRAM before being passed
 to the guest.
 
 On x86, the processor combines the cacheability values from the two
 levels of page tables.  Is there no way to do the same on ARM?

Combining occurs on ARMv8 too. The Stage1 (guest) mapping is very strict
(Device non-Gathering, non-Reordering, no Early Write Acknowledgement --
for EFI_MEMORY_UC), which basically overrides the Stage2 (very lax
host) memory attributes.

When qemu writes, as part of emulating the flash programming commands,
to the RAMBlock that *otherwise* backs the flash range (as a r/o
memslot), those writes (from host userspace) tend to end up in dcache.

But, when the guest flips back the flash to romd mode, and tries to read
back the values from the flash as plain ROM, the dcache is completely
bypassed due to the strict stage1 mapping, and the guest goes directly
to DRAM.

Where qemu's earlier writes are not yet / necessarily visible.

Please see my original patch (which was incomplete) in the attachment,
it has a very verbose commit message.

Anyway, I'll let others explain; they can word it better than I can :)

FWIW,

Series
Reviewed-by: Laszlo Ersek ler...@redhat.com

I ported this series to a 3.17.0+ based kernel, and tested it. It works
fine. The ROM-like view of the NOR flash now reflects the previously
programmed contents.

Series
Tested-by: Laszlo Ersek ler...@redhat.com

Thanks!
Laszlo
From a2b4da9b03f03ccdb8b0988a5cc64d1967f00398 Mon Sep 17 00:00:00 2001
From: Laszlo Ersek ler...@redhat.com
Date: Sun, 16 Nov 2014 01:43:11 +0100
Subject: [PATCH] arm, arm64: KVM: clean cache on page fault also when IPA is
 uncached (WIP)

This patch builds on Marc Zyngier's commit
2d58b733c87689d3d5144e4ac94ea861cc729145.

(1) The guest bypasses the cache *not only* when the VCPU's dcache is
disabled (see bit 0 and bit 2 in SCTLR_EL1, MMU enable and Cache
enable, respectively -- vcpu_has_cache_enabled()).

The guest bypasses the cache *also* when the Stage 1 memory attributes
say device memory about the Intermediate Page Address in question,
independently of the Stage 2 memory attributes. Refer to:

  Table D5-38
  Combining the stage 1 and stage 2 memory type assignments

in the ARM ARM. (This is likely similar to MTRRs on x86.)

(2) In edk2 (EFI Development Kit II), the ARM NOR flash driver,

  ArmPlatformPkg/Drivers/NorFlashDxe/NorFlashFvbDxe.c

uses the AddMemorySpace() and SetMemorySpaceAttributes() Global
Coherency Domain Services of DXE (Driver eXecution Environment) to
*justifiedly* set the attributes of the guest memory covering the
flash chip to EFI_MEMORY_UC (uncached).

According to the AArch64 bindings for UEFI (see 2.3.6.1 Memory types
in the UEFI-2.4A specification), EFI_MEMORY_UC is mapped to:

   ARM Memory Type:
   MAIR attribute encodingARM Memory Type:
EFI Memory TypeAttrn [7:4] [3:0]Meaning
------
EFI_MEMORY_UC     Device-nGnRnE (Device
(Not cacheable)   non-Gathering,
  non-Reordering, no Early
  Write Acknowledgement)

This is correctly implemented in edk2, in the ArmConfigureMmu()
function, via the ArmSetMAIR() call and the MAIR_ATTR() macro:

The TT_ATTR_INDX_DEVICE_MEMORY (== 0) memory attribute index, which is
used for EFI_MEMORY_UC memory, is associated with the
MAIR_ATTR_DEVICE_MEMORY (== 0x00, see above) memory attribute value,
in the MAIR_ELx register.

As a consequence of (1) and (2), when edk2 code running in the guest
accesses an IPA falling in the flash range, it will completely bypass
the cache.

Therefore, when such a page is faulted in in user_mem_abort(), we must
flush the data cache; otherwise the guest will see stale data in the flash
chip.

This patch is not complete because I have no clue how to calculate the
memory attribute for fault_ipa in user_mem_abort(). Right now I set
fault_ipa_uncached to constant true, which might incur some performance
penalty for data faults, but it certainly improves correctness -- the
ArmVirtualizationPkg platform build of edk2 actually boots as a KVM guest
on APM Mustang.

Signed-off-by: Laszlo Ersek ler...@redhat.com
---
 arch/arm/include/asm/kvm_mmu.h   |  5 +++--
 arch/arm64/include/asm/kvm_mmu.h |  5 +++--
 arch/arm/kvm/mmu.c   | 10 --
 3 files changed, 14 insertions

[INVITE] OVMF BoF session at the KVM Forum 2014

2014-09-18 Thread Laszlo Ersek
Hello All,

I've been made an offer that I couldn't refuse :) to organize a Birds
of a Feather session concerning OVMF at the KVM Forum 2014.

Interested people, please sign up:

  http://www.linux-kvm.org/page/KVM_Forum_2014_BOF#OVMF

Everyone else: apologies about the noise.

Thanks,
Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [INVITE] OVMF BoF session at the KVM Forum 2014

2014-09-18 Thread Laszlo Ersek
On 09/18/14 13:44, Andreas Färber wrote:
 Hello Laszlo,
 
 Am 18.09.2014 um 10:23 schrieb Laszlo Ersek:
 I've been made an offer that I couldn't refuse :) to organize a Birds
 of a Feather session concerning OVMF at the KVM Forum 2014.

 Interested people, please sign up:

   http://www.linux-kvm.org/page/KVM_Forum_2014_BOF#OVMF
 
 Nice idea. Your summary mentions only ia32 and x86_64 - I would be
 interested in an update on OVMF for AArch64 - there seemed to already be
 support for ARM's Foundation Model but not yet for QEMU.

We've successfully UEFI-booted
- GNU/Linux guest(s) on
- upstream edk2 (*) and
- upstream qemu-system-aarch64 with
  - TCG on x86_64 host,
  - KVM on aarch64 host (**)

(*) Ard's patches for upstream edk2 are in the process of being tested /
merged.

(**) Ard's patches for the upstream host kernel (== KVM) have been...
ugh, not sure... applied to a maintainer tree? Ard? :)

So, it works (as far as I tested it myself on TCG, and heard reports
about it on KVM), but right now you need to apply a handful of pending
patches manually.

We can certainly talk about Aarch64 at the BoF, but then Ard should
co-organize. No good deed goes unpunished, as ever! :)

Cheers,
Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ARM/arm64: KVM: fix use of WnR bit in kvm_is_write_fault()

2014-09-08 Thread Laszlo Ersek
On 09/08/14 22:29, Ard Biesheuvel wrote:
 The ISS encoding for an exception from a Data Abort has a WnR
 bit[6] that indicates whether the Data Abort was caused by a
 read or a write instruction. While there are several fields
 in the encoding that are only valid if the ISV bit[24] is set,
 WnR is not one of them, so we can read it unconditionally.
 
 Signed-off-by: Ard Biesheuvel ard.biesheu...@linaro.org
 ---
 
 This fixes an issue I observed with UEFI running under QEMU/KVM using
 NOR flash emulation and the upcoming KVM_CAP_READONLY_MEM support, where
 NOR flash reads were mistaken for NOR flash writes, resulting in all read
 accesses to go through the MMIO emulation layer.
 
  arch/arm/include/asm/kvm_mmu.h   | 5 +
  arch/arm64/include/asm/kvm_mmu.h | 5 +
  2 files changed, 2 insertions(+), 8 deletions(-)
 
 diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
 index 5cc0b0f5f72f..fad5648980ad 100644
 --- a/arch/arm/include/asm/kvm_mmu.h
 +++ b/arch/arm/include/asm/kvm_mmu.h
 @@ -83,10 +83,7 @@ static inline bool kvm_is_write_fault(unsigned long hsr)
   unsigned long hsr_ec = hsr  HSR_EC_SHIFT;
   if (hsr_ec == HSR_EC_IABT)
   return false;
 - else if ((hsr  HSR_ISV)  !(hsr  HSR_WNR))
 - return false;
 - else
 - return true;
 + return hsr  HSR_WNR;
  }
  
  static inline void kvm_clean_pgd(pgd_t *pgd)
 diff --git a/arch/arm64/include/asm/kvm_mmu.h 
 b/arch/arm64/include/asm/kvm_mmu.h
 index 8e138c7c53ac..09fd9e4c13d8 100644
 --- a/arch/arm64/include/asm/kvm_mmu.h
 +++ b/arch/arm64/include/asm/kvm_mmu.h
 @@ -100,10 +100,7 @@ static inline bool kvm_is_write_fault(unsigned long esr)
   if (esr_ec == ESR_EL2_EC_IABT)
   return false;
  
 - if ((esr  ESR_EL2_ISV)  !(esr  ESR_EL2_WNR))
 - return false;
 -
 - return true;
 + return esr  ESR_EL2_WNR;
  }
  
  static inline void kvm_clean_pgd(pgd_t *pgd) {}
 

No clue about the 32-bit case, but the 64-bit change checks out against
the ARM ARM. The ISV (ISS[24]) is documented to determine the validity
of bits ISS[23:14], but WnR is ISS[6], hence it appears independent indeed.

The pre-patch code only considered a clear WnR meaningful /
consequential only if the ISV was set -- more precisely, it only even
looked at the WnR then, due to the short-circuit nature of  --, and it
defaulted to write fault. Synchronous data aborts due to funky
register writeback instructions don't set the ISV, hence the code used
to turn its back on the clear WnR. (Apologies for explaining it to
myself publicly.) We now ignore the ISV and key off the WnR only.

You're awesome, Ard. (And now you can drop a few patches from your
linaro-topic-virt-post-v7-roundup branch! :))

Acked-by: Laszlo Ersek ler...@redhat.com

Cheers
Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/3] x86: Use common variable range MTRR counts

2014-08-14 Thread Laszlo Ersek
On 08/14/14 21:24, Alex Williamson wrote:
 We currently define the number of variable range MTRR registers as 8
 in the CPUX86State structure and vmstate, but use MSR_MTRRcap_VCNT
 (also 8) to report to guests the number available.  Change this to
 use MSR_MTRRcap_VCNT consistently.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com
 Cc: Laszlo Ersek ler...@redhat.com
 Cc: qemu-sta...@nongnu.org
 ---
 
  target-i386/cpu.h |2 +-
  target-i386/machine.c |2 +-
  2 files changed, 2 insertions(+), 2 deletions(-)
 
 diff --git a/target-i386/cpu.h b/target-i386/cpu.h
 index e634d83..d37d857 100644
 --- a/target-i386/cpu.h
 +++ b/target-i386/cpu.h
 @@ -930,7 +930,7 @@ typedef struct CPUX86State {
  /* MTRRs */
  uint64_t mtrr_fixed[11];
  uint64_t mtrr_deftype;
 -MTRRVar mtrr_var[8];
 +MTRRVar mtrr_var[MSR_MTRRcap_VCNT];
  
  /* For KVM */
  uint32_t mp_state;
 diff --git a/target-i386/machine.c b/target-i386/machine.c
 index 16d2f6a..fb89065 100644
 --- a/target-i386/machine.c
 +++ b/target-i386/machine.c
 @@ -677,7 +677,7 @@ VMStateDescription vmstate_x86_cpu = {
  /* MTRRs */
  VMSTATE_UINT64_ARRAY_V(env.mtrr_fixed, X86CPU, 11, 8),
  VMSTATE_UINT64_V(env.mtrr_deftype, X86CPU, 8),
 -VMSTATE_MTRR_VARS(env.mtrr_var, X86CPU, 8, 8),
 +VMSTATE_MTRR_VARS(env.mtrr_var, X86CPU, MSR_MTRRcap_VCNT, 8),
  /* KVM-related states */
  VMSTATE_INT32_V(env.interrupt_injected, X86CPU, 9),
  VMSTATE_UINT32_V(env.mp_state, X86CPU, 9),
 

Reviewed-by: Laszlo Ersek ler...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] x86: kvm: Add MTRR support for kvm_get|put_msrs()

2014-08-14 Thread Laszlo Ersek
You're going to use my name in contexts that I won't wish to be privy
to. :) I like everything about this patch except:

On 08/14/14 21:24, Alex Williamson wrote:
 The MTRR state in KVM currently runs completely independent of the
 QEMU state in CPUX86State.mtrr_*.  This means that on migration, the
 target loses MTRR state from the source.  Generally that's ok though
 because KVM ignores it and maps everything as write-back anyway.  The
 exception to this rule is when we have an assigned device and an IOMMU
 that doesn't promote NoSnoop transactions from that device to be cache
 coherent.  In that case KVM trusts the guest mapping of memory as
 configured in the MTRR.
 
 This patch updates kvm_get|put_msrs() so that we retrieve the actual
 vCPU MTRR settings and therefore keep CPUX86State synchronized for
 migration.  kvm_put_msrs() is also used on vCPU reset and therefore
 allows future modificaitons of MTRR state at reset to be realized.
 
 Note that the entries array used by both functions was already
 slightly undersized for holding every possible MSR, so this patch
 increases it beyond the 28 new entries necessary for MTRR state.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com
 Cc: Laszlo Ersek ler...@redhat.com
 Cc: qemu-sta...@nongnu.org
 ---
 
  target-i386/cpu.h |2 +
  target-i386/kvm.c |  101 
 -
  2 files changed, 101 insertions(+), 2 deletions(-)
 
 diff --git a/target-i386/cpu.h b/target-i386/cpu.h
 index d37d857..3460b12 100644
 --- a/target-i386/cpu.h
 +++ b/target-i386/cpu.h
 @@ -337,6 +337,8 @@
  #define MSR_MTRRphysBase(reg)   (0x200 + 2 * (reg))
  #define MSR_MTRRphysMask(reg)   (0x200 + 2 * (reg) + 1)
  
 +#define MSR_MTRRphysIndex(addr) addr)  ~1u) - 0x200) / 2)
 +
  #define MSR_MTRRfix64K_00x250
  #define MSR_MTRRfix16K_80x258
  #define MSR_MTRRfix16K_A0x259
 diff --git a/target-i386/kvm.c b/target-i386/kvm.c
 index 097fe11..3c46d4a 100644
 --- a/target-i386/kvm.c
 +++ b/target-i386/kvm.c
 @@ -79,6 +79,7 @@ static int lm_capable_kernel;
  static bool has_msr_hv_hypercall;
  static bool has_msr_hv_vapic;
  static bool has_msr_hv_tsc;
 +static bool has_msr_mtrr;
  
  static bool has_msr_architectural_pmu;
  static uint32_t num_architectural_pmu_counters;
 @@ -739,6 +740,10 @@ int kvm_arch_init_vcpu(CPUState *cs)
  env-kvm_xsave_buf = qemu_memalign(4096, sizeof(struct kvm_xsave));
  }
  
 +if (env-features[FEAT_1_EDX]  CPUID_MTRR) {
 +has_msr_mtrr = true;
 +}
 +
  return 0;
  }
  
 @@ -1183,7 +1188,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
  CPUX86State *env = cpu-env;
  struct {
  struct kvm_msrs info;
 -struct kvm_msr_entry entries[100];
 +struct kvm_msr_entry entries[150];
  } msr_data;
  struct kvm_msr_entry *msrs = msr_data.entries;
  int n = 0, i;
 @@ -1278,6 +1283,37 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
  kvm_msr_entry_set(msrs[n++], HV_X64_MSR_REFERENCE_TSC,
env-msr_hv_tsc);
  }
 +if (has_msr_mtrr) {
 +kvm_msr_entry_set(msrs[n++], MSR_MTRRdefType, 
 env-mtrr_deftype);
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRfix64K_0, env-mtrr_fixed[0]);
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRfix16K_8, env-mtrr_fixed[1]);
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRfix16K_A, env-mtrr_fixed[2]);
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRfix4K_C, env-mtrr_fixed[3]);
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRfix4K_C8000, env-mtrr_fixed[4]);
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRfix4K_D, env-mtrr_fixed[5]);
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRfix4K_D8000, env-mtrr_fixed[6]);
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRfix4K_E, env-mtrr_fixed[7]);
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRfix4K_E8000, env-mtrr_fixed[8]);
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRfix4K_F, env-mtrr_fixed[9]);
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRfix4K_F8000, env-mtrr_fixed[10]);
 +for (i = 0; i  MSR_MTRRcap_VCNT; i++) {
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRphysBase(i), 
 env-mtrr_var[i].base);
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRphysMask(i), 
 env-mtrr_var[i].mask);
 +}
 +}
  
  /* Note: MSR_IA32_FEATURE_CONTROL is written separately, see

Re: [PATCH v2 3/3] x86: Clear MTRRs on vCPU reset

2014-08-14 Thread Laszlo Ersek
On 08/14/14 21:24, Alex Williamson wrote:
 The SDM specifies (June 2014 Vol3 11.11.5):
 
 On a hardware reset, the P6 and more recent processors clear the
 valid flags in variable-range MTRRs and clear the E flag in the
 IA32_MTRR_DEF_TYPE MSR to disable all MTRRs. All other bits in the
 MTRRs are undefined.
 
 We currently do none of that, so whatever MTRR settings you had prior
 to reset is what you have after reset.  Usually this doesn't matter
 because KVM often ignores the guest mappings and uses write-back
 anyway.  However, if you have an assigned device and an IOMMU that
 allows NoSnoop for that device, KVM defers to the guest memory
 mappings which are now stale after reset.  The result is that OVMF
 rebooting on such a configuration takes a full minute to LZMA
 decompress the firmware volume, a process that is nearly instant on
 the initial boot.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com
 Cc: Laszlo Ersek ler...@redhat.com
 Cc: qemu-sta...@nongnu.org
 ---
 
  target-i386/cpu.c |   10 ++
  1 file changed, 10 insertions(+)
 
 diff --git a/target-i386/cpu.c b/target-i386/cpu.c
 index 6d008ab..9768be1 100644
 --- a/target-i386/cpu.c
 +++ b/target-i386/cpu.c
 @@ -2588,6 +2588,16 @@ static void x86_cpu_reset(CPUState *s)
  
  env-xcr0 = 1;
  
 +/*
 + * SDM 11.11.5 requires:
 + *  - IA32_MTRR_DEF_TYPE MSR.E = 0
 + *  - IA32_MTRR_PHYSMASKn.V = 0
 + * All other bits are undefined.  For simplification, zero it all.
 + */
 +env-mtrr_deftype = 0;
 +memset(env-mtrr_var, 0, sizeof(env-mtrr_var));
 +memset(env-mtrr_fixed, 0, sizeof(env-mtrr_fixed));
 +
  #if !defined(CONFIG_USER_ONLY)
  /* We hard-wire the BSP to the first CPU. */
  if (s-cpu_index == 0) {
 

I like this heavy-handed approach.

Reviewed-by: Laszlo Ersek ler...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] x86: kvm: Add MTRR support for kvm_get|put_msrs()

2014-08-14 Thread Laszlo Ersek
On 08/14/14 21:24, Alex Williamson wrote:
 The MTRR state in KVM currently runs completely independent of the
 QEMU state in CPUX86State.mtrr_*.  This means that on migration, the
 target loses MTRR state from the source.  Generally that's ok though
 because KVM ignores it and maps everything as write-back anyway.  The
 exception to this rule is when we have an assigned device and an IOMMU
 that doesn't promote NoSnoop transactions from that device to be cache
 coherent.  In that case KVM trusts the guest mapping of memory as
 configured in the MTRR.
 
 This patch updates kvm_get|put_msrs() so that we retrieve the actual
 vCPU MTRR settings and therefore keep CPUX86State synchronized for
 migration.  kvm_put_msrs() is also used on vCPU reset and therefore
 allows future modificaitons of MTRR state at reset to be realized.
 
 Note that the entries array used by both functions was already
 slightly undersized for holding every possible MSR, so this patch
 increases it beyond the 28 new entries necessary for MTRR state.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com
 Cc: Laszlo Ersek ler...@redhat.com
 Cc: qemu-sta...@nongnu.org
 ---
 
  target-i386/cpu.h |2 +
  target-i386/kvm.c |  101 
 -
  2 files changed, 101 insertions(+), 2 deletions(-)

Another (positive) remark I wanted to add: if we migrate from an
MTRR-capable KVM host that lacks these patches, to an MTRR-capable KVM
host that has these patches, then the migration stream will simply
contain zeros (because the patch-less source never fetched those from
the source-side KVM), so when we send those zeros to the target KVM, we
won't regress (because those zeroes should match the initial KVM MTRR
state that the target comes up in anyway).

If we migrate from patchful to patchless (ie. reverse direction), then
we lose MTRR state, which is the current status quo; not bad.

Thanks
Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/3] x86: kvm: Add MTRR support for kvm_get|put_msrs()

2014-08-14 Thread Laszlo Ersek
On 08/14/14 23:39, Alex Williamson wrote:
 The MTRR state in KVM currently runs completely independent of the
 QEMU state in CPUX86State.mtrr_*.  This means that on migration, the
 target loses MTRR state from the source.  Generally that's ok though
 because KVM ignores it and maps everything as write-back anyway.  The
 exception to this rule is when we have an assigned device and an IOMMU
 that doesn't promote NoSnoop transactions from that device to be cache
 coherent.  In that case KVM trusts the guest mapping of memory as
 configured in the MTRR.
 
 This patch updates kvm_get|put_msrs() so that we retrieve the actual
 vCPU MTRR settings and therefore keep CPUX86State synchronized for
 migration.  kvm_put_msrs() is also used on vCPU reset and therefore
 allows future modificaitons of MTRR state at reset to be realized.
 
 Note that the entries array used by both functions was already
 slightly undersized for holding every possible MSR, so this patch
 increases it beyond the 28 new entries necessary for MTRR state.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com
 Cc: Laszlo Ersek ler...@redhat.com
 Cc: qemu-sta...@nongnu.org
 ---
 
  target-i386/cpu.h |2 +
  target-i386/kvm.c |  101 
 -
  2 files changed, 101 insertions(+), 2 deletions(-)
 
 diff --git a/target-i386/cpu.h b/target-i386/cpu.h
 index d37d857..3460b12 100644
 --- a/target-i386/cpu.h
 +++ b/target-i386/cpu.h
 @@ -337,6 +337,8 @@
  #define MSR_MTRRphysBase(reg)   (0x200 + 2 * (reg))
  #define MSR_MTRRphysMask(reg)   (0x200 + 2 * (reg) + 1)
  
 +#define MSR_MTRRphysIndex(addr) addr)  ~1u) - 0x200) / 2)
 +
  #define MSR_MTRRfix64K_00x250
  #define MSR_MTRRfix16K_80x258
  #define MSR_MTRRfix16K_A0x259
 diff --git a/target-i386/kvm.c b/target-i386/kvm.c
 index 097fe11..ddedc73 100644
 --- a/target-i386/kvm.c
 +++ b/target-i386/kvm.c
 @@ -79,6 +79,7 @@ static int lm_capable_kernel;
  static bool has_msr_hv_hypercall;
  static bool has_msr_hv_vapic;
  static bool has_msr_hv_tsc;
 +static bool has_msr_mtrr;
  
  static bool has_msr_architectural_pmu;
  static uint32_t num_architectural_pmu_counters;
 @@ -739,6 +740,10 @@ int kvm_arch_init_vcpu(CPUState *cs)
  env-kvm_xsave_buf = qemu_memalign(4096, sizeof(struct kvm_xsave));
  }
  
 +if (env-features[FEAT_1_EDX]  CPUID_MTRR) {
 +has_msr_mtrr = true;
 +}
 +
  return 0;
  }
  
 @@ -1183,7 +1188,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
  CPUX86State *env = cpu-env;
  struct {
  struct kvm_msrs info;
 -struct kvm_msr_entry entries[100];
 +struct kvm_msr_entry entries[150];
  } msr_data;
  struct kvm_msr_entry *msrs = msr_data.entries;
  int n = 0, i;
 @@ -1278,6 +1283,37 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
  kvm_msr_entry_set(msrs[n++], HV_X64_MSR_REFERENCE_TSC,
env-msr_hv_tsc);
  }
 +if (has_msr_mtrr) {
 +kvm_msr_entry_set(msrs[n++], MSR_MTRRdefType, 
 env-mtrr_deftype);
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRfix64K_0, env-mtrr_fixed[0]);
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRfix16K_8, env-mtrr_fixed[1]);
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRfix16K_A, env-mtrr_fixed[2]);
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRfix4K_C, env-mtrr_fixed[3]);
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRfix4K_C8000, env-mtrr_fixed[4]);
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRfix4K_D, env-mtrr_fixed[5]);
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRfix4K_D8000, env-mtrr_fixed[6]);
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRfix4K_E, env-mtrr_fixed[7]);
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRfix4K_E8000, env-mtrr_fixed[8]);
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRfix4K_F, env-mtrr_fixed[9]);
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRfix4K_F8000, env-mtrr_fixed[10]);
 +for (i = 0; i  MSR_MTRRcap_VCNT; i++) {
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRphysBase(i), 
 env-mtrr_var[i].base);
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRphysMask(i), 
 env-mtrr_var[i].mask);
 +}
 +}
  
  /* Note: MSR_IA32_FEATURE_CONTROL is written separately, see
   *   kvm_put_msr_feature_control. */
 @@ -1484,7 +1520,7 @@ static int kvm_get_msrs(X86CPU *cpu)
  CPUX86State *env = cpu-env;
  struct

Re: [PATCH] x86: Reset MTRR on vCPU reset

2014-08-13 Thread Laszlo Ersek
a number of comments -- feel free to address or ignore each as you see fit:

On 08/13/14 21:09, Alex Williamson wrote:
 The SDM specifies (June 2014 Vol3 11.11.5):
 
 On a hardware reset, the P6 and more recent processors clear the
 valid flags in variable-range MTRRs and clear the E flag in the
 IA32_MTRR_DEF_TYPE MSR to disable all MTRRs. All other bits in the
 MTRRs are undefined.
 
 We currently do none of that, so whatever MTRR settings you had prior
 to reset is what you have after reset.  Usually this doesn't matter
 because KVM often ignores the guest mappings and uses write-back
 anyway.  However, if you have an assigned device and an IOMMU that
 allows NoSnoop for that device, KVM defers to the guest memory
 mappings which are now stale after reset.  The result is that OVMF
 rebooting on such a configuration takes a full minute to LZMA
 decompress the EFI volume, a process that is nearly instant on the

For pedantry, instead of EFI volume we could say LZMA-compressed
Firmware File System file in the FVMAIN_COMPACT firmware volume.

 initial boot.
 
 Add support for reseting the SDM defined bits on vCPU reset.
 
 Also, by my count we're already in danger of overflowing the entries
 array that we pass to KVM, so I've topped it up for a bit of headroom.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com
 Cc: qemu-sta...@nongnu.org
 ---
 
  target-i386/cpu.c |6 ++
  target-i386/cpu.h |4 
  target-i386/kvm.c |   14 +-
  3 files changed, 23 insertions(+), 1 deletion(-)
 
 diff --git a/target-i386/cpu.c b/target-i386/cpu.c
 index 6d008ab..b5ae654 100644
 --- a/target-i386/cpu.c
 +++ b/target-i386/cpu.c
 @@ -2588,6 +2588,12 @@ static void x86_cpu_reset(CPUState *s)
  
  env-xcr0 = 1;
  
 +/* MTRR init - Clear global enable bit and valid bit in each variable 
 reg */
 +env-mtrr_deftype = ~MSR_MTRRdefType_Enable;
 +for (i = 0; i  MSR_MTRRcap_VCNT; i++) {
 +env-mtrr_var[i].mask = ~MSR_MTRRphysMask_Valid;
 +}
 +

I can see that the limit, MSR_MTRRcap_VCNT, is #defined as 8. Would you
be willing to update the definition of the CPUX86State.mtrr_var array
too, in target-i386/cpu.h? Currently it says:

MTRRVar mtrr_var[8];

  #if !defined(CONFIG_USER_ONLY)
  /* We hard-wire the BSP to the first CPU. */
  if (s-cpu_index == 0) {
 diff --git a/target-i386/cpu.h b/target-i386/cpu.h
 index e634d83..139890f 100644
 --- a/target-i386/cpu.h
 +++ b/target-i386/cpu.h
 @@ -337,6 +337,8 @@
  #define MSR_MTRRphysBase(reg)   (0x200 + 2 * (reg))
  #define MSR_MTRRphysMask(reg)   (0x200 + 2 * (reg) + 1)
  
 +#define MSR_MTRRphysMask_Valid (1  11)
 +

Note: a signed integer (int32_t).

  #define MSR_MTRRfix64K_00x250
  #define MSR_MTRRfix16K_80x258
  #define MSR_MTRRfix16K_A0x259
 @@ -353,6 +355,8 @@
  
  #define MSR_MTRRdefType 0x2ff
  
 +#define MSR_MTRRdefType_Enable (1  11)
 +

Note: a signed integer (int32_t).

Now, if you scroll back to the bit-clearing in x86_cpu_reset(), you see

  ~MSR_MTRRdefType_Enable

and

 ~MSR_MTRRphysMask_Valid

These expressions evaluate to negative int (int32_t) values (because the
bit-neg sets their sign bits).

Due to two's complement (which we are allowed to assume in qemu, see
HACKING), the negative int32_t values will be just correct for the next
step, when they are converted to uint64_t for the bit-ands, as part of
the usual arithmetic conversions. (env-mtrr_deftype and
env-mtrr_var[i].mask are uint64_t.) Mathematically this means an
addition of UINT64_MAX+1. (Sign extended.)

In general, even though they are correct due to two's complement, I
dislike such detours into negative-valued signed integers by way of
bit-neg, because people are mostly unaware of them and assume they just
work. My preferred solution would be

#define MSR_MTRRphysMask_Valid (1ull  11)
#define MSR_MTRRdefType_Enable (1ull  11)

Feel free to ignore this of course.

  #define MSR_CORE_PERF_FIXED_CTR00x309
  #define MSR_CORE_PERF_FIXED_CTR10x30a
  #define MSR_CORE_PERF_FIXED_CTR20x30b
 diff --git a/target-i386/kvm.c b/target-i386/kvm.c
 index 097fe11..cb31338 100644
 --- a/target-i386/kvm.c
 +++ b/target-i386/kvm.c
 @@ -79,6 +79,7 @@ static int lm_capable_kernel;
  static bool has_msr_hv_hypercall;
  static bool has_msr_hv_vapic;
  static bool has_msr_hv_tsc;
 +static bool has_msr_mtrr;
  
  static bool has_msr_architectural_pmu;
  static uint32_t num_architectural_pmu_counters;
 @@ -739,6 +740,10 @@ int kvm_arch_init_vcpu(CPUState *cs)
  env-kvm_xsave_buf = qemu_memalign(4096, sizeof(struct kvm_xsave));
  }
  
 +if (env-features[FEAT_1_EDX]  CPUID_MTRR) {
 +has_msr_mtrr = true;
 +}
 +

Seems to match MTRR Feature Identification in my (older) copy of the SDM.

  return 0;
  }
  
 @@ -1183,7 +1188,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
  CPUX86State *env = cpu-env;
  

Re: [PATCH] x86: Reset MTRR on vCPU reset

2014-08-13 Thread Laszlo Ersek
On 08/14/14 00:06, Alex Williamson wrote:
 On Wed, 2014-08-13 at 22:33 +0200, Laszlo Ersek wrote:
 a number of comments -- feel free to address or ignore each as you see fit:

 On 08/13/14 21:09, Alex Williamson wrote:

 mappings which are now stale after reset.  The result is that OVMF
 rebooting on such a configuration takes a full minute to LZMA
 decompress the EFI volume, a process that is nearly instant on the

 For pedantry, instead of EFI volume we could say LZMA-compressed
 Firmware File System file in the FVMAIN_COMPACT firmware volume.
 
 Can you come up with something with maybe half that many words?

Firmware volume then. Firmware volume is not a generic term, it's a
specific term in the Platform Initialization (PI) spec.

 And
 also, does it matter?

No. :)

 I want someone using OVMF and experiencing a long
 reboot delay to know that this might fix their problem.  Noting that the
 major time consuming stall is in the LZMA decompression code helps to
 rationalize why the mapping change is important.  The specific blob of
 data that's being decompressed seems mostly irrelevant, which is why I
 only gave it 2 words.

Fair enough, it's just that EFI volume doesn't mean anything specific
(to me), while firmware volume does.

 @@ -1183,7 +1188,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
  CPUX86State *env = cpu-env;
  struct {
  struct kvm_msrs info;
 -struct kvm_msr_entry entries[100];
 +struct kvm_msr_entry entries[128];
  } msr_data;
  struct kvm_msr_entry *msrs = msr_data.entries;
  int n = 0, i;
 @@ -1278,6 +1283,13 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
  kvm_msr_entry_set(msrs[n++], HV_X64_MSR_REFERENCE_TSC,
env-msr_hv_tsc);
  }
 +if (has_msr_mtrr) {
 +kvm_msr_entry_set(msrs[n++], MSR_MTRRdefType, 
 env-mtrr_deftype);
 +for (i = 0; i  MSR_MTRRcap_VCNT; i++) {
 +kvm_msr_entry_set(msrs[n++],
 +  MSR_MTRRphysMask(i), 
 env-mtrr_var[i].mask);
 +}
 +}
  
  /* Note: MSR_IA32_FEATURE_CONTROL is written separately, see
   *   kvm_put_msr_feature_control. */


 I think that this code is correct (and sufficient for the reset
 problem), but I'm uncertain if it's complete:

 (a) Shouldn't you put the matching PhysBase registers as well (for the
 variable range ones)?

 Plus, shouldn't you put mtrr_fixed[11] too (MSR_MTRRfix64K_0, ...)?
 
 If my change wasn't isolated to the reset portion of kvm_put_msrs() then
 I would agree with you.  But since it is, all of those registers are
 undefined by the SDM.

That's a good way to express your point indeed, and a good way to
formulate my concern: I'm not sure your change is isolated to the reset
portion. The check that gates the new hunk says

  level = KVM_PUT_RESET_STATE

and a higher level than that does exist: KVM_PUT_FULL_STATE, which is
used in incoming migration.

 (b) You only modify kvm_put_msrs(). What about kvm_get_msrs()? I can see
 that you make the msr putting dependent on:

 /*
  * The following MSRs have side effects on the guest or are too
  * heavy for normal writeback. Limit them to reset or full state
  * updates.
  */
 if (level = KVM_PUT_RESET_STATE) {

 But that's probably not your reason for omitting matching new code from
 kvm_get_msrs(): HV_X64_MSR_REFERENCE_TSC is also heavy-weight (visible
 in your patch's context), but that one is nevertheless handled in
 kvm_get_msrs().

 My only reason for (b) is simply symmetry. For example, commit 48a5f3bc
 added HV_X64_MSR_REFERENCE_TSC at once to both put() and get().

 According to target-i386/machine.c, mtrr_deftype and co. are even
 migrated (part of vmstate), so this asymmetry could become a problem in
 migration. Eg. source host doesn't fetch MTRR state from KVM, hence wire
 format carries garbage, but on the target you put (part of) that garbage
 (right now, just the mask) back into KVM:

 do_savevm()
   qemu_savevm_state()
 qemu_savevm_state_complete()
   cpu_synchronize_all_states()
 cpu_synchronize_state()
   kvm_cpu_synchronize_state()
 do_kvm_cpu_synchronize_state()
   kvm_arch_get_registers()
 kvm_get_msrs()

 do_loadvm()
   load_vmstate()
 qemu_loadvm_state()
   cpu_synchronize_all_post_init()
 cpu_synchronize_post_init()
   kvm_cpu_synchronize_post_init()
 kvm_arch_put_registers(..., KVM_PUT_FULL_STATE)
   kvm_put_msrs(..., KVM_PUT_FULL_STATE)

 /* state subset modified during VCPU reset */
 #define KVM_PUT_RESET_STATE 2

 /* full state set, modified during initialization or on vmload */
 #define KVM_PUT_FULL_STATE  3

 Hence I suspect (a) and (b) should be handled.

 ... And then we arrive at cross-version migration, where both source and
 target hosts support MTRR, but the source qemu sends

Re: [PATCH] x86: Reset MTRR on vCPU reset

2014-08-13 Thread Laszlo Ersek
On 08/14/14 01:17, Laszlo Ersek wrote:

 - With KVM, the lack of loading MTRR state from KVM, combined with the
   (partial) storing of MTRR state to KVM, has two consequences:
   - migration invalidates (loses) MTRR state,

I'll concede that migration *already* loses MTRR state (on KVM), even
before your patch. On the incoming host, the difference is that
pre-patch, the guest continues running (after migration) with MTRRs in
the initial KVM state, while post-patch, the guest continues running
after an explicit zeroing of the variable MTRR masks and the deftype.

I admit that it wouldn't be right to say that the patch causes MTRR
state loss.

With that, I think I've actually convinced myself that your patch is
correct:

The x86_cpu_reset() hunk is correct in any case, independently of KVM
vs. TCG. (On TCG it even improves MTRR conformance.) Splitting that hunk
into a separate patch might be worthwhile, but not overly important.

The kvm_put_msrs() hunk forces a zero write to the variable MTRR
PhysMasks and the DefType, on both reset and on incoming migration. For
reset, this is correct behavior. For incoming migration, it is not, but
it certainly shouldn't qualify as a regression, relative to the current
status (where MTRR state is simply lost and replaced with initial MTRR
state on the incoming host).

I think the above end results could be expressed more clearly in the
code, but I'm already wondering if you'll ever talk to me again, so I'm
willing to give my R-b if you think that's useful... :)

(Again, I might be wrong, easily.)

Thanks
Laszlo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm smm mode support?

2014-04-28 Thread Laszlo Ersek
On 04/26/14 11:40, Paolo Bonzini wrote:
 Il 25/04/2014 09:39, Gerd Hoffmann ha scritto:
 Anyone has plans to add smm support to kvm?
 
 No plans, but it should be a Simple Matter Of Programming...
 
 A good start would be to write unit tests for SMM that work with QEMU.

Well I don't know what behavior to expect from SMM... :)

Plus, Kevin recently posted some remarks about the SMM implementation in
qemu-tcg -- apparently it's not faithful enough to physical hardware:

http://thread.gmane.org/gmane.comp.bios.coreboot.seabios/7959
http://thread.gmane.org/gmane.comp.emulators.qemu/268909

(But I can see that you've been already discussing the second thread; I
guess I should read up on it first...)

Thanks
Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ARM VM System Sepcification

2014-03-22 Thread Laszlo Ersek
On 03/23/14 00:38, Michael Casadevall wrote:
 On 03/22/2014 03:57 PM, Paolo Bonzini wrote:
 Il 22/03/2014 13:23, Grant Likely ha scritto:
 VM implementations SHOULD to implement persistent variables for 
 their UEFI implementation - with whatever constraints that may be
 put on higher level tools. Variable storage SHALL be a property
 of a VM instance, but SHALL NOT be stored as part of a portable
 disk image. Portable disk images SHALL conform to the UEFI
 removable disk requirements from the UEFI spec.

 I fully agree with this wording.

 Paolo
 
 +1 here

I cannot resist getting productive here, so I'll mention that the first
to in the language should probably be dropped.

I would also replace the but with and.

Happy to contribute, always.
Laszlo
:)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ARM VM System Sepcification

2014-03-08 Thread Laszlo Ersek
On 03/08/14 12:41, Michael Casadevall wrote:
 
 On 03/06/2014 05:46 PM, Paolo Bonzini wrote:
 Il 06/03/2014 09:52, Robie Basak ha scritto:
 On Sat, Mar 01, 2014 at 03:27:56PM +, Grant Likely wrote:
 I would also reference section 3.3 (Boot Option Variables Default Boot
 Behavior) and 3.4.1.1 (Removable Media Boot Behavior) here. It's
 fine to
 restate the meaning of the requirement in this spec, but the UEFI spec
 is the authoritative source. Distributed VM disk images fall under the
 same scenario as the firmware not having any valid boot variables.

 What happens when the VM is first booted without boot variables, but
 then the OS expects to be able to set boot variables and see them on
 next boot?

 UEFI scans the devices; looks for an EFI system partition on the
 disks; and builds a default boot order.

 If possible, I would prefer to mandate that the host implementation is
 permitted to no-op (or otherwise disable) boot variable write operations
 altogether to avoid having to deal with this. In the common case, I
 don't see why an OS installation shipped via a VM disk image would need
 to write boot variables anyway.

 Would there be any adverse consequences to doing this?

 Given the experience on x86 UEFI, no.

 Unlike bare metal, it is common to run UEFI VMs without persistent
 flash storage.  In this case the boot variables and boot order are
 rebuilt on the fly on every boot, and it just works for both Windows
 and Linux; there's no reason why it should be any different for ARM.

 While I realize in the real world, we can live with non-persistent boot
 variables, this is a *direct* violation of the UEFI spec; we can't call
 our VMs UEFI-compatible if we do this.
 
 However, I've been looking at the spec, and I think we're within spec if
 we save the variables on the HDD itself. There's some support for this
 already (Firmware Block Volume Device), but its possible we could
 implement boot variables as a file on system partition (UEFI's default
 search order can be used to figure out which variable file to use, or
 some sorta fingerprinting system). The biggest trick though is that
 UEFI's Runtime Services will need to be able to write this file, which
 may require us move a large chunk of UEFI to runtime services so the FAT
 filesystem stuff can stick around. If we give it a proper partition,
 then we can just do raw block read/writes. This however would require us
 mandating that said partition exists, and making sure there aren't any
 hidden gotchas in invoking this.

This is how OVMF fakes non-volatile variables when pflash is not enabled
or supported by the host. It stores variables in a file called \NvVars
in the EFI system partition. Before ExitBootServices(), changes are
synced out immediately. After ExitBootServices(), changes are kept only
in memory. If you then do an in-VM reboot, then you end up again
before ExitBootServices(), and then the in-memory changes are written
out. This was usually good enough to keep installers happy (because the
last step they do is install grub, run efibootmgr, and reboot).

Keeping around the UEFI FAT driver into runtime would be just one
problem. Another problem is that the OS could itself mount the partition
that holds the file, and then the two drivers (UEFI and Linux) would
corrupt the filesystem. A raw partition and a custom FVB driver would
probably be safer in this regard.

 Obviously this isn't ideal, but this might be the middle road solution
 we need here. I can dig through Tiano to get a realistic idea of how
 hard this will be in reality if we want to seriously look at this option.

You could check out Jordan's implementation for qemu pflash --
OvmfPkg/QemuFlashFvbServicesRuntimeDxe.

The protocol to implement is EFI_FIRMWARE_VOLUME_BLOCK[2]_PROTOCOL,
everything else that should be seated on top automatically is already
present in edk2. The protocol is documented in Vol3 of the Platform Init
spec, chapter 3.4.2.

(...BLOCK2... is just a typedef to ...BLOCK..., see
MdePkg/Include/Protocol/FirmwareVolumeBlock.h)

Thanks
Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ARM VM System Sepcification

2014-03-06 Thread Laszlo Ersek
On 03/06/14 10:46, Paolo Bonzini wrote:
 Il 06/03/2014 09:52, Robie Basak ha scritto:
 On Sat, Mar 01, 2014 at 03:27:56PM +, Grant Likely wrote:
 I would also reference section 3.3 (Boot Option Variables Default Boot
 Behavior) and 3.4.1.1 (Removable Media Boot Behavior) here. It's fine to
 restate the meaning of the requirement in this spec, but the UEFI spec
 is the authoritative source. Distributed VM disk images fall under the
 same scenario as the firmware not having any valid boot variables.

+1


 What happens when the VM is first booted without boot variables, but
 then the OS expects to be able to set boot variables and see them on
 next boot?
 
 UEFI scans the devices; looks for an EFI system partition on the disks;
 and builds a default boot order.
 
 If possible, I would prefer to mandate that the host implementation is
 permitted to no-op (or otherwise disable) boot variable write operations
 altogether to avoid having to deal with this. In the common case, I
 don't see why an OS installation shipped via a VM disk image would need
 to write boot variables anyway.

 Would there be any adverse consequences to doing this?
 
 Given the experience on x86 UEFI, no.
 
 Unlike bare metal, it is common to run UEFI VMs without persistent flash
 storage.  In this case the boot variables and boot order are rebuilt on
 the fly on every boot, and it just works for both Windows and Linux;
 there's no reason why it should be any different for ARM.
 
 My reason is that this would save us from blocking a general OpenStack
 implementation on ARM by requiring that these pieces are implemented
 further up the stack first, when it would bring actual gain to doing so.

 This would not preclude host implementations from implementing writeable
 variables, or guests from using them. Just that for a _portable VM disk
 image_, the OS on it cannot assume that this functionality is present.
 
 This is already the case for most OSes.  Otherwise you wouldn't be able
 to move a hard disk from a (physical) machine to another.
 
 I strongly suggest that you take a look at the work done in Tiano Core's
 OvmfPkg, which has support for almost every QEMU feature thanks to the
 work of Laszlo Ersek and Jordan Justen.
 
 In particular, OvmfPkg has support for specifying a boot order in the VM
 configuration (which maps to the -boot option in QEMU).  In this case,
 the UEFI boot order is overridden by a variable that is placed in some
 architecture-specific firmware configuration mechanism (on x86 we have
 one called fw_cfg, on ARM you could look at the fdt).  This predates
 UEFI and is not a UEFI variable; in fact is is a list of OpenFirmware
 device paths.  UEFI will match the OF paths to UEFI paths, and use the
 result to build a UEFI boot order.

If I understand correctly, the question is this:

  Given a hypervisor that doesn't support non-volatile UEFI variables
  (including BootOrder and Boot), is it possible to automatically
  boot a carefully prepared VM image, made available as a fixed media
  device?

The answer is yes. See

  3.3 Boot Option Variables Default Boot Behavior

in the UEFI spec (already referenced by Grant Likely in the context above).

  3.3 Boot Option Variables Default Boot Behavior

  The default state of globally-defined variables is firmware vendor
  specific. However the boot options require a standard default behavior
  in the exceptional case that valid boot options are not present on a
  platform. The default behavior must be invoked any time the BootOrder
  variable does not exist or only points to nonexistent boot options.

  If no valid boot options exist, the boot manager will enumerate all
  removable media devices followed by all fixed media devices. The order
  within each group is undefined. These new default boot options are not
  saved to non volatile storage. The boot manger will then attempt to
  boot from each boot option. If the device supports the
  EFI_SIMPLE_FILE_SYSTEM_PROTOCOL then the removable media boot behavior
  (see Section 3.4.1.1) is executed. Otherwise, the firmware will
  attempt to boot the device via the EFI_LOAD_FILE_PROTOCOL.

  It is expected that this default boot will load an operating system or
  a maintenance utility. If this is an operating system setup program it
  is then responsible for setting the requisite environment variables
  for subsequent boots. The platform firmware may also decide to recover
  or set to a known set of boot options.

Basically, the removable media boot behavior applies to fixed media
devices as last resort. You can prepare a disk image where this behavior
will simply boot the OS.

See also Peter Jones' blog post about this:

http://blog.uncooperative.org/blog/2014/02/06/the-efi-system-partition/

(In short, my email is a +1 to what Grant said.)

Thanks
Laszlo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org

Re: [Qemu-devel] [PATCH] kvm: print suberror on all internal errors

2014-01-21 Thread Laszlo Ersek
On 01/21/14 18:11, Radim Krčmář wrote:
 KVM introduced internal error exit reason and suberror at the same time,
 and later extended it with internal error data.
 QEMU does not report suberror on hosts between these two events because
 we check for the extension. (half a year in 2009, but it is misleading)
 
 Fix by removing KVM_CAP_INTERNAL_ERROR_DATA condition on printf.
 
 (partially improved by bb44e0d12df70 and ba4047cf848a3 in the past)
 
 Signed-off-by: Radim Krčmář rkrc...@redhat.com
 ---
  kvm-all.c | 7 +++
  1 file changed, 3 insertions(+), 4 deletions(-)
 
 diff --git a/kvm-all.c b/kvm-all.c
 index 0bfb060..0a91d8e 100644
 --- a/kvm-all.c
 +++ b/kvm-all.c
 @@ -1539,17 +1539,16 @@ static void kvm_handle_io(uint16_t port, void *data, 
 int direction, int size,
  
  static int kvm_handle_internal_error(CPUState *cpu, struct kvm_run *run)
  {
 -fprintf(stderr, KVM internal error.);
 +fprintf(stderr, KVM internal error. Suberror: %d\n,
 +run-internal.suberror);
 +
  if (kvm_check_extension(kvm_state, KVM_CAP_INTERNAL_ERROR_DATA)) {
  int i;
  
 -fprintf(stderr,  Suberror: %d\n, run-internal.suberror);
  for (i = 0; i  run-internal.ndata; ++i) {
  fprintf(stderr, extra data[%d]: %PRIx64\n,
  i, (uint64_t)run-internal.data[i]);
  }
 -} else {
 -fprintf(stderr, \n);
  }
  if (run-internal.suberror == KVM_INTERNAL_ERROR_EMULATION) {
  fprintf(stderr, emulation failure\n);
 

Based on earlier discussion

Reviewed-by: Laszlo Ersek ler...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [edk2] apparent KVM problem with LRET in TianoCore S3 resume trampoline

2013-12-08 Thread Laszlo Ersek
On 12/06/13 13:03, Paolo Bonzini wrote:
 Il 05/12/2013 19:29, Laszlo Ersek ha scritto:
 On 12/05/13 18:42, Paolo Bonzini wrote:
 Il 05/12/2013 17:12, Laszlo Ersek ha scritto:
 Hi,

 I'm working on S3 suspend/resume in OVMF. The problem is that I'm getting 
 an
 unexpected guest reboot for code (LRET) that works on physical hardware. I
 tried to trace the problem with ftrace, but I didn't get any mentions of
 em_ret_far(). (Maybe I was looking in the wrong place.)

 What does ftrace say anyway?

 (pls. see in the next msg I sent)
 
 Actually I meant the ftrace without any patches.
 
 Thanks to your binary I now reproduced the issue and it looks like the
 64-bit-16-bit switch works:

Thank you for spending (apparently more than a little) time on this!

 
  qemu-system-x86-4081  [001] 62650.335040: kvm_exit: reason 
 CR_ACCESS rip 0x3cf7ae45 info 0 0
  qemu-system-x86-4081  [001] 62650.335041: kvm_cr:   cr_write 0 = 
 0x32
  qemu-system-x86-4081  [001] 62650.335046: kvm_entry:vcpu 0
 
   This is the mov %rax, %cr0. PE and PG are turned off.

I'm surprised by this result. The instruction you refer to is below
_AsmTransferControl_al_ (in the original, unpatched code).

I had earlier added an infinite loop right below that label (a different
loop than my  debug loop), and it was *never* reached in my test.
That is, from the lret that I reported as problematic, to the
instruction you refer to, the CPU would have had to cross (and finish)
the infinite loop that I had added earlier. And that never happened in
my test.

I had added that loop at _AsmTransferControl_al_ immediately
precisely because I wanted to see if the label is reached and the
problem is with something below that label, or with the first lret. I
sent my email to the KVM list after I had isolated the problem to the
first LRET:

http://thread.gmane.org/gmane.comp.bios.tianocore.devel/5297/focus=5325

On 12/04/13 19:05, Laszlo Ersek wrote:
 I tested if the (intended) target location of the LRET is reached, and
 it is not. (It's easy to test by adding a small infinite loop, moving
 it around, and seeing if the VM is spinning with or without producing
 a bunch of output on the debug port.) It's *really* that
 internally-targeted LRET that causes a reboot. [...]

I have absolutely no clue why this code executes for you and doesn't for
me :) What guest RAM size did you test with?

  qemu-system-x86-4081  [001] 62650.335047: kvm_exit: reason 
 MSR_READ rip 0x3cf7ae4e info 0 0
  qemu-system-x86-4081  [001] 62650.335048: kvm_msr:  msr_read 
 c080 = 0x100
  qemu-system-x86-4081  [001] 62650.335048: kvm_entry:vcpu 0
  qemu-system-x86-4081  [001] 62650.335048: kvm_exit: reason 
 MSR_WRITE rip 0x3cf7ae53 info 0 0
  qemu-system-x86-4081  [001] 62650.335049: kvm_msr:  msr_write 
 c080 = 0x0
  qemu-system-x86-4081  [001] 62650.335050: kvm_entry:vcpu 0
 
   LME is turned off.
 
  qemu-system-x86-4081  [001] 62650.335050: kvm_exit: reason 
 CR_ACCESS rip 0x3cf7ae55 info 304 0
  qemu-system-x86-4081  [001] 62650.335050: kvm_cr:   cr_write 4 = 
 0x640
  qemu-system-x86-4081  [001] 62650.335053: kvm_entry:vcpu 0
 
   PAE is turned off.
 
  qemu-system-x86-4081  [001] 62650.335054: kvm_exit: reason 
 CR_ACCESS rip 0x11e6 info 0 0
  qemu-system-x86-4081  [001] 62650.335054: kvm_cr:   cr_write 0 = 
 0x33
  qemu-system-x86-4081  [001] 62650.335054: kvm_entry:vcpu 0
 
   Here we're already in real mode.  The weird RIP is explained by
   the first few bytes after the FACS resume vector:

From this point on you were debugging the Linux wakeup code, in
arch/x86/realmode/rm/wakeup_asm.S. I think.

 
   0x9a1d::  cli
   0x9a1d:0001:  cld
   0x9a1d:0002:  ljmp   $9900,$11d7

ENTRY(wakeup_start)
cli
cld

LJMPW_RM(3f)
3:
/* Apparently some dimwit BIOS programmers don't know how to
   program a PM to RM transition, and we might end up here with
   junk in the data segment descriptor registers.  The only way
   to repair that is to go into PM and fix it ourselves... */
[...]

From Linux kernel commit 4b4f7280.

 The page tables are, ahem, crap:
 
 000c000: 6750 fe01        gP..
 000c010:          
 000c020:          
 000c030:          
 000c040:          
 000c050:          
 000c060:          
 000c070:          
 000c080:          
 000c090:  

Re: [edk2] apparent KVM problem with LRET in TianoCore S3 resume trampoline

2013-12-08 Thread Laszlo Ersek
On 12/08/13 18:43, Laszlo Ersek wrote:
 On 12/06/13 13:03, Paolo Bonzini wrote:

 Thanks to your binary I now reproduced the issue and it looks like the
 64-bit-16-bit switch works:
 
 Thank you for spending (apparently more than a little) time on this!
 

  qemu-system-x86-4081  [001] 62650.335040: kvm_exit: reason 
 CR_ACCESS rip 0x3cf7ae45 info 0 0
  qemu-system-x86-4081  [001] 62650.335041: kvm_cr:   cr_write 0 
 = 0x32
  qemu-system-x86-4081  [001] 62650.335046: kvm_entry:vcpu 0

  This is the mov %rax, %cr0. PE and PG are turned off.
 
 I'm surprised by this result. The instruction you refer to is below
 _AsmTransferControl_al_ (in the original, unpatched code).
 
 I had earlier added an infinite loop right below that label (a different
 loop than my  debug loop), and it was *never* reached in my test.
 That is, from the lret that I reported as problematic, to the
 instruction you refer to, the CPU would have had to cross (and finish)
 the infinite loop that I had added earlier. And that never happened in
 my test.
 
 I had added that loop at _AsmTransferControl_al_ immediately
 precisely because I wanted to see if the label is reached and the
 problem is with something below that label, or with the first lret. I
 sent my email to the KVM list after I had isolated the problem to the
 first LRET:
 
 http://thread.gmane.org/gmane.comp.bios.tianocore.devel/5297/focus=5325
 
 On 12/04/13 19:05, Laszlo Ersek wrote:
 I tested if the (intended) target location of the LRET is reached, and
 it is not. (It's easy to test by adding a small infinite loop, moving
 it around, and seeing if the VM is spinning with or without producing
 a bunch of output on the debug port.) It's *really* that
 internally-targeted LRET that causes a reboot. [...]

First, you're right and I'm wrong. I can see the triple fault in my kvm
trace as well.

Second... This is incredible. Guess what the following patch does.

 ASM_GLOBAL ASM_PFX(AsmTransferControl)
 ASM_PFX(AsmTransferControl):
 # rcx S3WakingVector:DWORD
 # rdx AcpiLowMemoryBase :DWORD
 lea   _AsmTransferControl_al_(%rip), %eax
 movq  $0x28, %r8
 orq   %r8, %rax
 pushq %rax
 shrd  $20, %ecx, %ebx
 andl  $0x0f, %ecx
 movw  %cx, %bx
 movl  %ebx, jmp_addr(%rip)
 lret
 _AsmTransferControl_al_:
 .byte0x0b8, 0x30, 0  # mov ax, 30h as selector
 movl  %eax, %ds
 movl  %eax, %es
 movl  %eax, %fs
 movl  %eax, %gs
 movl  %eax, %ss
+.code16
+movw $0x402, %dx
+movb $0x58, %al
+qqq:
+outb %al, %dx
+jmp qqq
+movb $0x59, %al
+outb %al, %dx
+.code64
 movq  %cr0, %rax
 movq  %cr4, %rbx

The infinite loop that I used to pin down the first instruction that
broke around lret -- it was useless.

The above patch should produce an infinite string of X characters on
the qemu debug port. After the inifnite string, it should produce a
Y character.

In practice, I'm seeing *one* X character, and then a triple fault.

The jmp qqq instruction *itself* causes a triple fault (a guest
reboot), masking the problem that I was looking for.

When I had such an empty infinite loop (at that time, still without the
outb) right before the lret, it worked. As soon as I moved it below
_AsmTransferControl_al_ (ie. under the target of the lret), the jmp
*itself* caused a triple fault (guest reboot), causing me to think that
the *lret* caused the reboot. Because, the tight jmp loop would always
work, and if I'm seeing a reboot loop instead of tight spinning right
after the lret, that means that the lret caused the reboot, right? Right?

This is the binary:

0037 qqq:
  37:   ee  out%al,(%dx)
  38:   eb fd   jmp37 qqq

Opcode  Instruction Op/ 64-Bit Compat/   Description
En  Mode   Leg Mode
EB cb   JMP rel8A   Valid  Valid Jump short, RIP = RIP + 8-bit
 displacement sign extended
 to 64-bits

My belief in the sanity of x86 has been shattered to its core. I don't
feel stupid. I feel cheated.

Thank you for your help.
Laszlo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


apparent KVM problem with LRET in TianoCore S3 resume trampoline

2013-12-05 Thread Laszlo Ersek
Hi,

I'm working on S3 suspend/resume in OVMF. The problem is that I'm getting an
unexpected guest reboot for code (LRET) that works on physical hardware. I
tried to trace the problem with ftrace, but I didn't get any mentions of
em_ret_far(). (Maybe I was looking in the wrong place.)

Please find the the assembly-language trampoline that is invoked (in 64-bit
mode) with the 16-bit real mode resume vector placed in rcx (EFIAPI calling
convention). The excerpt is from the edk2 tree,
MdeModulePkg/Universal/Acpi/BootScriptExecutorDxe/X64/S3Asm.S.

I'm annotating the source code to the right -- please excuse my audacity as I
know you all eat assembly for breakfast, but maybe it will speed up your
processing. (Or perhaps I'll sneakily confuse you with my errors :))

  ASM_GLOBAL ASM_PFX(AsmTransferControl) #
  ASM_PFX(AsmTransferControl):   #
  # rcx S3WakingVector:DWORD # ecx:  
  
  # rdx AcpiLowMemoryBase :DWORD #
  lea   _AsmTransferControl_al_(%rip), %eax  # pushing $0x28 
for CS
  movq  $0x28, %r8   # and address of
  orq   %r8, %rax# 
_AsmTransferControl_al_
  pushq %rax # for RIP
  shrd  $20, %ecx, %ebx  # ebx:  
  
  andl  $0x0f, %ecx  # ecx:  
  
  movw  %cx, %bx # ebx:  
  
  movl  %ebx, jmp_addr(%rip) # stores vector as 
16-bit segment:offset pair
  :  # -- my own loop
  jmp    # -- for debugging
  lret   # (*) TRIGGERS 
REBOOT
  _AsmTransferControl_al_:   #
  .byte0x0b8, 0x30, 0  # mov ax, 30h as selector #
  movl  %eax, %ds#
  movl  %eax, %es#
  movl  %eax, %fs#
  movl  %eax, %gs#
  movl  %eax, %ss#
  movq  %cr0, %rax   #
  movq  %cr4, %rbx   #
  .byte0x66  # (**)
  andl  $0x7ffe, %eax# preps for 
turning off Paging and Protection Enable
  andb  $0xdf, %bl   # preps for 
turning off PAE
  movq  %rax, %cr0   # Paging and PE off
  .byte0x66  # (**)
  movl  $0x0c080, %ecx   #
  rdmsr  #
  andb  $0xfe, %ah   #
  wrmsr  # IA-32e Mode 
Enable off
  movq  %rbx, %cr4   # PAE off
  .byte0x0ea  # jmp far jmp_addr #
  jmp_addr:  #
  .long0 #  
:

The small loop at  is my debug loop. The lret instruction right after
(marked with (*)) triggers a reboot in KVM.

In the loop, this is the register dump (taken with the info registers qemu
monitor command):

  RAX=00289c75be2b RBX=9a1d RCX= 
RDX=
  RSI= RDI= RBP=9f7bafd0 
RSP=9f7bae30
  R8 =0028 R9 = R10=008454cd 
R11=
  R12= R13= R14=008454c6 
R15=
  RIP=9c75be28 RFL=0046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
  ES =0018   00c09300 DPL=0 DS   [-WA]
  CS =0018   00a09b00 DPL=0 CS64 [-RA]
  SS =0018   00c09300 DPL=0 DS   [-WA]
  DS =0018   00c09300 DPL=0 DS   [-WA]
  FS =0018   00c09300 DPL=0 DS   [-WA]
  GS =0018   00c09300 DPL=0 DS   [-WA]
  LDT=   8200 DPL=0 LDT
  TR =   8b00 DPL=0 TSS64-busy
  GDT= 00844c80 0047
  IDT= 9c01fd60 021f
  CR0=8033 CR2= CR3=0008 CR4=0660
  DR0= DR1= DR2= 
DR3=
  DR6=0ff0 

Re: [edk2] apparent KVM problem with LRET in TianoCore S3 resume trampoline

2013-12-05 Thread Laszlo Ersek
Small addition -- apologies for the self-followup:

On 12/05/13 17:12, Laszlo Ersek wrote:
 I tried to trace the problem with ftrace, but I didn't get any mentions of
 em_ret_far(). (Maybe I was looking in the wrong place.)

I applied the following small patch (to the original code):

diff --git a/MdeModulePkg/Universal/Acpi/BootScriptExecutorDxe/X64/S3Asm.S 
b/MdeModulePkg/Universal/Acpi/BootScriptExecutorDxe/X64/S3Asm.S
index e59fd04..daa4f7e 100644
--- a/MdeModulePkg/Universal/Acpi/BootScriptExecutorDxe/X64/S3Asm.S
+++ b/MdeModulePkg/Universal/Acpi/BootScriptExecutorDxe/X64/S3Asm.S
@@ -18,8 +18,8 @@ ASM_GLOBAL ASM_PFX(AsmTransferControl)
 ASM_PFX(AsmTransferControl):
 # rcx S3WakingVector:DWORD
 # rdx AcpiLowMemoryBase :DWORD
-lea   _AsmTransferControl_al_(%rip), %eax
-movq  $0x28, %r8
+lea   AsmTransferControl(%rip), %eax
+movq  $0x38, %r8
 orq   %r8, %rax
 pushq %rax
 shrd  $20, %ecx, %ebx

This turns the code right under AsmTransferControl into a working, 64-bit mode
loop. (Recall that 0x38 selects a descriptor that has the L (64-bitC) bit
set:

   0x0038: 0x00AF9B00: Base=0x Limit=0xF Type=0xB (C ER A  
) S=0x1 (code/data) DPL=0x0 Present=1 Avail=0 64-bitC=1 D/B=0 
 LimitGran=0x1 (4KB)
)

While this was spinning (I checked the RIP several times with the qemu monitor
and it was alternating between a few close values -- ie. not stuck), I ran
trace-cmd. The report seems to confirm that the lret is not emulated, because
the only lines I'm seeing are:

 qemu-system-x86-3901  [001] 38939.599663: kvm_exit: reason 
EXTERNAL_INTERRUPT rip 0x9c75be0a info 0 80ef
 qemu-system-x86-3901  [001] 38939.599684: kvm_entry:vcpu 0

repeated infinitely. The rip varies between a few close values,

458 rip 0x9c75be04
313 rip 0x9c75be0a
  5 rip 0x9c75be17
  4 rip 0x9c75be18
  3 rip 0x9c75be22
  8 rip 0x9c75be28

Thanks again and sorry for the noise.
Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [edk2] apparent KVM problem with LRET in TianoCore S3 resume trampoline

2013-12-05 Thread Laszlo Ersek
On 12/05/13 18:42, Paolo Bonzini wrote:
 Il 05/12/2013 17:12, Laszlo Ersek ha scritto:
 Hi,

 I'm working on S3 suspend/resume in OVMF. The problem is that I'm getting an
 unexpected guest reboot for code (LRET) that works on physical hardware. I
 tried to trace the problem with ftrace, but I didn't get any mentions of
 em_ret_far(). (Maybe I was looking in the wrong place.)
 
 What does ftrace say anyway?

(pls. see in the next msg I sent)

 
 Please find the the assembly-language trampoline that is invoked (in 64-bit
 mode) with the 16-bit real mode resume vector placed in rcx (EFIAPI calling
 convention). The excerpt is from the edk2 tree,
 MdeModulePkg/Universal/Acpi/BootScriptExecutorDxe/X64/S3Asm.S.
 
 Can you send me a pointer to a git tree or, even better, an OVMF.fd file + 
 instructions
 on how to trigger the problem?

http://people.redhat.com/~lersek/ovmf_s3_lret/

I use a stock F19 guest, with the default systemd target set to
multi-user. Then I login as root at the console, and issue pm-suspend.
Once the guest is suspended, I type

virsh qemu-monitor-command ovmf.f19 --hmp system_wakeup

on the host.

At this point the guest starts spinning (visible eg. in the virt-manager
CPU usage chart), and the OVMF debug log (written to the qemu debug
console) is continuously growing. It always takes the same path: selects
the S3 boot path due to the 0xFE byte at 0xF in CMOS, progresses to the
trampoline, and resets.


 
   - shared properties:
 Base=0x
 Limit=0xF
 Type=0xB (C ER A )
 S=0x1 (code/data)
 DPL=0x0
 Present=1
 Avail=0
 LimitGran=0x1 (4KB)

   - different properties:
 0x0010: 0x00CF9B00: 64-bitC=0 D/B=1  works
 0x0028: 0x008F9B00: 64-bitC=0 D/B=0  reboots
 0x0038: 0x00AF9B00: 64-bitC=1 D/B=0  works

 That is:
 - if I let 64-bit mode execution enabled (64-bitC=1, desc 0x38), the lret
   works.
 - If I switch to compat mode execution (64-bitC=0, desc 0x10), *and* keep the
   default addr/op size 32 bits, the lret still works.
 
 Perhaps you could try switching to 32-bit mode first, then disable paging,
 then jump to 16-bit mode.  Like this (untested):

I had something like this in mind (I even mentioned it on edk2-devel
http://thread.gmane.org/gmane.comp.bios.tianocore.devel/5297/focus=5331),
but didn't know how to implement it. There are at least 6 factors in
play here:
- the L bit in the segment descriptor,
- the D/B bit in the segment descriptor,
- Paging,
- Protection Enable,
- IA-32e Mode Enable,
- PAE.

That's (almost) 6! orderings to test :)

So thanks a lot for suggesting a patch, I'll try to play with it.

 
 diff --git a/MdeModulePkg/Universal/Acpi/BootScriptExecutorDxe/X64/S3Asm.S 
 b/MdeModulePkg/Universal/Acpi/BootScriptExecutorDxe/X64/S3Asm.S
 index e59fd04..d1cac9d 100644
 --- a/MdeModulePkg/Universal/Acpi/BootScriptExecutorDxe/X64/S3Asm.S
 +++ b/MdeModulePkg/Universal/Acpi/BootScriptExecutorDxe/X64/S3Asm.S
 @@ -19,7 +19,7 @@ ASM_PFX(AsmTransferControl):
  # rcx S3WakingVector:DWORD
  # rdx AcpiLowMemoryBase :DWORD
  lea   _AsmTransferControl_al_(%rip), %eax 
 -movq  $0x28, %r8 
 +movq  $0x10, %r8 
  orq   %r8, %rax
  pushq %rax
  shrd  $20, %ecx, %ebx
 @@ -28,24 +28,32 @@ ASM_PFX(AsmTransferControl):
  movl  %ebx, jmp_addr(%rip) 
  lret
  _AsmTransferControl_al_:
 +# Old SS should still be okay?
 +addl  _AsmTransferControl_al_0001-_AsmTransferControl_al_, %eax
 +pushl $0x28
 +pushl %eax
 +movq  %cr0, %rax
 +movq  %cr4, %rbx
 +andl  $0x7fff, %eax
 +andb  $0xdf, %bl
 +movq  %rax, %cr0 # sets EFER.LMA=0 too, so says Intel
 +movl  $0x0c080, %ecx
 +rdmsr
 +andb  $0xfe, %ah # set EFER.LME=0
 +wrmsr
 +movq  %rbx, %cr4 # only now set CR4.PAE=0
 +lret
 +_AsmTransferControl_al_0001:
  .byte0x0b8, 0x30, 0  # mov ax, 30h as selector
  movl  %eax, %ds
  movl  %eax, %es
  movl  %eax, %fs
  movl  %eax, %gs
  movl  %eax, %ss
 -movq  %cr0, %rax
 -movq  %cr4, %rbx
 -.byte0x66
 -andl  $0x7ffe, %eax 
 -andb  $0xdf, %bl 
 -movq  %rax, %cr0
 -.byte0x66
 -movl  $0x0c080, %ecx 
 -rdmsr
 -andb  $0xfe, %ah 
 -wrmsr
 -movq  %rbx, %cr4
 +movl  %cr0, %rax# Get control register 0
 +.byte 0x66
 +.byte 0x83,0xe0,0xfe# andeax, 0fffeh  ; Clear PE bit (bit #0)
 +.byte 0xf,0x22,0xc0 # movcr0, eax ; Activate real mode
 
 - If I switch to compat mode execution (64-bitC=0, desc 0x28), but also 
 change
   the default addr/op size to 16-bits, then the lret reboots the guest in KVM
   (but works on physical hardware).
 
 Did you try this on physical hardware, or just assumed that? :)

I was told on edk2-devel O:)

http://thread.gmane.org/gmane.comp.bios.tianocore.devel/5297/focus=5333

Thanks!
Laszlo
--
To unsubscribe

Re: [edk2] apparent KVM problem with LRET in TianoCore S3 resume trampoline

2013-12-05 Thread Laszlo Ersek
On 12/05/13 18:42, Paolo Bonzini wrote:

 diff --git a/MdeModulePkg/Universal/Acpi/BootScriptExecutorDxe/X64/S3Asm.S 
 b/MdeModulePkg/Universal/Acpi/BootScriptExecutorDxe/X64/S3Asm.S
 index e59fd04..d1cac9d 100644
 --- a/MdeModulePkg/Universal/Acpi/BootScriptExecutorDxe/X64/S3Asm.S
 +++ b/MdeModulePkg/Universal/Acpi/BootScriptExecutorDxe/X64/S3Asm.S
 @@ -19,7 +19,7 @@ ASM_PFX(AsmTransferControl):
  # rcx S3WakingVector:DWORD
  # rdx AcpiLowMemoryBase :DWORD
  lea   _AsmTransferControl_al_(%rip), %eax 
 -movq  $0x28, %r8 
 +movq  $0x10, %r8 
  orq   %r8, %rax
  pushq %rax
  shrd  $20, %ecx, %ebx
 @@ -28,24 +28,32 @@ ASM_PFX(AsmTransferControl):
  movl  %ebx, jmp_addr(%rip) 
  lret
  _AsmTransferControl_al_:
 +# Old SS should still be okay?
 +addl  _AsmTransferControl_al_0001-_AsmTransferControl_al_, %eax
 +pushl $0x28
 +pushl %eax
 +movq  %cr0, %rax
 +movq  %cr4, %rbx
 +andl  $0x7fff, %eax
 +andb  $0xdf, %bl
 +movq  %rax, %cr0 # sets EFER.LMA=0 too, so says Intel
 +movl  $0x0c080, %ecx
 +rdmsr
 +andb  $0xfe, %ah # set EFER.LME=0
 +wrmsr
 +movq  %rbx, %cr4 # only now set CR4.PAE=0
 +lret
 +_AsmTransferControl_al_0001:
  .byte0x0b8, 0x30, 0  # mov ax, 30h as selector
  movl  %eax, %ds
  movl  %eax, %es
  movl  %eax, %fs
  movl  %eax, %gs
  movl  %eax, %ss
 -movq  %cr0, %rax
 -movq  %cr4, %rbx
 -.byte0x66
 -andl  $0x7ffe, %eax 
 -andb  $0xdf, %bl 
 -movq  %rax, %cr0
 -.byte0x66
 -movl  $0x0c080, %ecx 
 -rdmsr
 -andb  $0xfe, %ah 
 -wrmsr
 -movq  %rbx, %cr4
 +movl  %cr0, %rax# Get control register 0
 +.byte 0x66
 +.byte 0x83,0xe0,0xfe# andeax, 0fffeh  ; Clear PE bit (bit #0)
 +.byte 0xf,0x22,0xc0 # movcr0, eax ; Activate real mode

I had to add this incremental patch to get it to compile:

diff --git a/MdeModulePkg/Universal/Acpi/BootScriptExecutorDxe/X64/S3Asm.S 
b/MdeModulePkg/Universal/Acpi/BootScriptExecutorDxe/X64/S3Asm.S
index c28df3f..85d2a36 100644
--- a/MdeModulePkg/Universal/Acpi/BootScriptExecutorDxe/X64/S3Asm.S
+++ b/MdeModulePkg/Universal/Acpi/BootScriptExecutorDxe/X64/S3Asm.S
@@ -30,8 +30,8 @@ ASM_PFX(AsmTransferControl):
 _AsmTransferControl_al_:
 # Old SS should still be okay?
 addl  _AsmTransferControl_al_0001-_AsmTransferControl_al_, %eax
-pushl $0x28
-pushl %eax
+.byte 0x6a,0x28  # pushl $0x28 ; opnd sz = 32bits in seg 0x10
+.byte 0x50   # pushl %eax
 movq  %cr0, %rax
 movq  %cr4, %rbx
 andl  $0x7fff, %eax
@@ -50,7 +50,7 @@ _AsmTransferControl_al_0001:
 movl  %eax, %fs
 movl  %eax, %gs
 movl  %eax, %ss
-movl  %cr0, %rax# Get control register 0
+.byte 0x0f,0x20,0xc0# movl  %cr0, %eax; Get control register 0
 .byte 0x66
 .byte 0x83,0xe0,0xfe# andeax, 0fffeh  ; Clear PE bit (bit #0)
 .byte 0xf,0x22,0xc0 # movcr0, eax ; Activate real mode

The 2nd lret is reached (just before _AsmTransferControl_al_0001), but then the 
CPU goes off in the woods. For a while it seems to be spinning who knows where, 
and in 15-20 seconds or so the guest reboots.

Does gas support mode switches in one file? I found examples on the net (for 
nasm I think) where people were thunking to real mode and back to protected 
mode in a single assembly file, and they could use native mnemonics for each 
part. (They just switched the assembler's mode in sync with execution modes.)

Thanks
Laszlo

Thanks,
Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for 2013-06-11

2013-06-11 Thread Laszlo Ersek
On 06/11/13 17:45, Michael S. Tsirkin wrote:

 To summarize, there's a concensus now that generating ACPI
 tables in QEMU is a good idea.
 
 Two issues that need to be addressed:
 - original patches break cross-version migration. Need to fix that.
 
 - Anthony requested that patchset is merged together with
   some new feature. I'm not sure the reasoning is clear:
   current a version intentionally generates tables
   that are bug for bug compatible with seabios,
   to simplify testing.

Sorry about not following the series more closely -- is there now a qemu
interface available that allows any firmware just take the tables, maybe
to fix them up blindly / algorithmically, and to install them?

IOW, is the interface at such a point that in OVMF we could start
looking throwing out specific code, in favor of implementing the generic
fw-side algorithm?

   It seems clear we have users for this such as
   hotplug of devices behind pci bridges, so
   why keep the infrastructure out of tree?
 
   Looking for something additional, smaller as the hotplug patch
   is a bit big, so might delay merging.
 
 
 Going forward - would we want to move
 smbios as well? Everyone seems to think it's a
 good idea.

I think the current fw_cfg interface for SMBIOS tables is already good
enough to save a lot of work in OVMF. Namely, if all required tables
were generated (table template + field-wise patching) in qemu, and then
all exported over fw_cfg as verbatim tables, my SMBIOS series currently
pending for OVMF should be able to install them.

This would save OVMF the coding of templates (and any necessary
patching) for types 3, 4 (especially nasty), 9, 16, 17, 19, and 32.
(Basically all except type 0 and type 1, which are already implemented
(but verbatim tables from qemu would take priority even for type 0 and
type 1). Type 7 can be left out apparently; IIRC dmidecode doesn't
report it even under SeaBIOS.)

I'm not implying anyone should start working on this (myself included
:)), but yeah, moving SMBIOS would save work in OVMF. (Provided there
was any reason to support said SMBIOS tables in OVMF :))

Thanks,
Laszlo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for 2013-05-28

2013-05-31 Thread Laszlo Ersek
On 05/31/13 09:09, Jordan Justen wrote:

 Why is updating the ACPI tables in seabios viewed as such a burden?
 Either qemu does it, or seabios... (And, OVMF too, but I don't think
 you guys are concerned with that. :)

I am :)

 On the flip side, why is moving the ACPI tables to QEMU such an issue?
 It seems like Xen and virtualbox both already do this. Why is running
 iasl not an issue for them?

I think something was mentioned about iasl having problems on BE
machines? I could be easily wrong but I *guess* qemu's hosts x targets
(emulate what on what) set is a proper superset of xen's and
virtualbox's. Presumably if you want to run an x86 guest on a MIPS host,
and also want to build qemu on the same MIPS (or SPARC) host, you'd have
to run iasl there too.

 Maybe we are doing lots of things horribly wrong in our OVMF ACPI
 tables :)

Impossible. :)

In earnest, I think what we have now is (mostly) correct, just not
extensive / flexible enough. No support for PCI hotplug or CPU hotplug,
none for S3 (although all of these tie into UEFI deeply), no MTRR setup,
no MPTABLE; let alone a non-PIIX chipset. (Well maybe I shouldn't lump
these under the ACPI umbrella.)

 but I haven't seen it as much of a burden. (Of course,
 Laszlo has helped out with many of the ACPI changes in OVMF, so his
 opinion should be taken into consideration too. :)

It hasn't been a burden in the sense of me not liking the activity; I
actually like fiddling with knobs. It has certainly been extra work to
bring OVMF's ACPI tables closer to SeaBIOS's functionality / flexibility
(and we still lag behind it quite.).

Due to licensing differences I can't just port code from SeaBIOS to OVMF
(and I never have without explicit permission), so it's been a lot of
back and forth with acpidump / iasl -d in guests (massage OVMF, boot
guest, check guest dmesg / lspci, dump tables, compare, repeat), brain
picking colleagues, the ACPI and PIIX specs and so on. I have a page on
the RH intranet dedicated to this. When something around these parts is
being changed (or looks like it could be changed) in SeaBIOS, or between
qemu and SeaBIOS, I always must be alert and consider reimplementing it
in, or porting it with permission to, OVMF. (Most recent example:
pvpanic device -- currently only in SeaBIOS.)

It worries me that if I slack off, or am busy with something else, or
simply don't notice, then the gap will widen again. I appreciate
learning a bunch about ACPI, and don't mind the days of work that went
into some of my simple-looking ACPI patches for OVMF, but had the tables
come from a common (programmatic) source, none of this would have been
an issue, and I wouldn't have felt even occasionally that ACPI patches
for OVMF were both duplicate work *and* futile (considering how much
ahead SeaBIOS was).

I don't mind reimplementing stuff, or porting it with permission, going
forward, but the sophisticated parts in SeaBIOS are a hard nut. For
example I'll never be able to auto-extract offsets from generated AML
and patch the AML using those offsets; the edk2 build tools (a project
separate from edk2) don't support this, and it takes several months to
get a thing as simple as gcc-47 build flags into edk2-buildtools.

Instead I have to write template ASL, compile it to AML, hexdump the
result, verify it against the AML grammar in the ACPI spec (offsets
aren't obvious, BytePrefix and friends are a joy), define  initialize a
packed struct or array in OVMF, and patch the template AML using fixed
field names or array subscripts. Workable, but dog slow. If the ACPI
payload came from up above, we might be as well provided with a list of
(canonical name, offset, size) triplets, and could perhaps blindly patch
the contents. (Not unlike Michael's linker code for connecting tables
into a hierarchy.)

AFAIK most recently iasl got built-in support for offset extraction (and
in the process the current SeaBIOS build method was broken...), so that
part might get easier in the future.

Oh well it's Friday, sorry about this rant! :) I'll happily do what I
can in the current status quo, but frequently, it won't amount to much.

Thanks,
Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [SeaBIOS] KVM call agenda for 2013-05-28

2013-05-31 Thread Laszlo Ersek
On 05/31/13 10:13, Peter Stuge wrote:

 ACPI bytes are obviously a function of QEMU configuration.

Precisely!

When we evaluate that (mathematical-sense) function in boot firmware, we
need to retrieve the function's arguments. Those arguments are bits of
QEMU configuration, as you say, and fw_cfg is extremely inconvenient for
fetching them. Whenever the domain or arity of said mathematical
function changes (we need more arguments, or different kinds of
arguments), we have to extend fw_cfg with yet another ad-hoc key or file
entry.

The set of arguments going into ACPI tables *is* ad-hoc and arbitrary,
there's nothing to do about it. But at least we shouldn't impede the
retrieval of said arguments with artificial obstacles, such as
half-assedly serializing them over fw_cfg (and not documenting them,
naturally). In qemu the entire pool of arguments, current and future,
would be at just an arm's length, in immediately consumable form.

I've said the same about the acpi_build_madt() prototype that was
proposed for qemu:
http://thread.gmane.org/gmane.comp.emulators.qemu/207171/focus=208719.

Thanks,
Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for 2013-05-28

2013-05-31 Thread Laszlo Ersek
On 05/31/13 16:08, David Woodhouse wrote:
 On Fri, 2013-05-31 at 08:04 -0500, Anthony Liguori wrote:

 soapbox

 Fork OVMF, drop the fat module, and just add GPL code.  It's an easily
 solvable problem.
 
 Heh. Actually it doesn't need to be a fork. It's modular, and the FAT
 driver is just a single module. Which is actually included in *binary*
 form in the EDK2 repository, I believe, and its source code is
 elsewhere.

Correct.

 We could happily make a GPL¹ or LGPL implementation of a FAT module and
 build our OVMF with that instead, and we wouldn't need to fork OVMF at
 all.

Yes, that's one plan, *if* someone can sort out, or is willing to
shoulder, the perhaps illogical but still worrisome surroundings of
FatPkg / FatBinPkg.

(I don't intend to spread FUD!)

For example, if your employer authorizes you to implement GplFatPkg from
scratch, and distribute it as an external module, I -- as someone
without any education in law though -- will give you a standing ovation
and buy you a case of beer at KVM Forum 2013. Deal? :)

(You proved to have great leverage by getting the efi compat table
extended, so... :))

 ¹ If it's GPL, of course, then we mustn't include any *other* binary
 blobs in our OVMF build. But the whole point in this conversation is
 that we don't *want* to do that. So that's fine.

Right. Eg. Shell1 is embedded as a pre-built binary, but that's just
convenience, you can build the in-tree Shell2 from source afresh and
embed that instead (and ship its source too).

Laszlo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for 2013-05-28

2013-05-31 Thread Laszlo Ersek
On 05/31/13 16:38, Anthony Liguori wrote:

 It's either Open Source or it's not.  It's currently not.

I disagree with this binary representation of Open Source or Not. If it
weren't (mostly) Open Source, how could we fork (most of) it as you're
suggesting (from the soapbox :))?

 I have a hard
 time sympathesizing with trying to work with a proprietary upstream.

My experience has been positive.

First of all, whether UEFI is a good thing or not is controversial. I
won't try to address that.

However UEFI is here to stay, machines are being shipped with it, Linux
and other OSen try to support it. Developing (or running) an OS in
combination with a specific firmware is sometimes easier / more economic
in a virtual environment, hence there should be support for qemu + UEFI.
It is this mindset that I operate in. (Oh, I also forgot to mention that
this task has been assigned to me by my superiors as well :))

Jordan, the OvmfPkg maintainer is responsive and progressive in the true
FLOSS manner (*), which was a nice surprise for a project whose coding
standards for example are made 100% after Windows source code, and whose
mailing list is mostly subscribed to by proprietary vendors. Really when
it comes to OvmfPkg patches the process follows the normal FLOSS
development model.

(*) Jordan, I hope this will prompt you to merge VirtioNetDxe v4 real
soon now :)

I personally think the 2-clause BSDL for 99% of the project was a very
sane and practical one from Intel et al.

FatPkg is a sad exception. One might even consider it a bad accident:
http://thread.gmane.org/gmane.comp.bios.tianocore.devel/1861/focus=1878

I have no idea how that selection process went down, but I assume if
FLOSS people had been screaming very loud at that time and had offered a
*simple* (which ext2 is not, I gather), wide-spread and unencumbered
filesystem, things would be different today.

 In terms of creating a FAT module, the most likely source would seem to
 be the kernel code and since that's GPL, I don't think it's terribly
 avoidable to end up with a GPL'd uefi implementation.
 
 If that's inevitable, then we're wasting effort by rewriting stuff under
 a BSD license.

Please ask your employer if they'd be willing to put their name on an
original, clean-room, GPL-licensed implementation of FAT32 for UEFI.


Thus far we've been talking copyright rather than patents, but there's
also this:

http://en.wikipedia.org/wiki/FAT_filesystem#Challenge
http://en.wikipedia.org/wiki/FAT_filesystem#Patent_infringement_lawsuits

It almost doesn't matter who prevails in such a lawsuit; the
*possibility* of such a lawsuit gives people cold feet. Blame the USPTO.

Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for 2013-05-28

2013-05-31 Thread Laszlo Ersek
On 05/31/13 17:43, Anthony Liguori wrote:
 David Woodhouse dw...@infradead.org writes:
 
 On Fri, 2013-05-31 at 08:04 -0500, Anthony Liguori wrote:

 soapbox

 Fork OVMF, drop the fat module, and just add GPL code.  It's an easily
 solvable problem.

 Heh. Actually it doesn't need to be a fork. It's modular, and the FAT
 driver is just a single module. Which is actually included in *binary*
 form in the EDK2 repository, I believe, and its source code is
 elsewhere.

 We could happily make a GPL¹ or LGPL implementation of a FAT module and
 build our OVMF with that instead, and we wouldn't need to fork OVMF at
 all.
 
 So can't we have GPL virtio modules too?  I don't think there's any
 problem there except for the FAT module.

I share your assessment.

 I would propose more of a virtual fork.  It could consist of a git repo with
 the GPL modules + a submodule for edk2.  Ideally, there would be no need
 to actually fork edk2.

Indeed. edk2 is extremely modular. But in order to get a useful firmware
image ultimately, you need a FAT driver.

 My assumption is that edk2 won't take GPL code.

Correct, see eg. OvmfPkg/Contributions.txt.

Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for 2013-05-28

2013-05-31 Thread Laszlo Ersek
On 05/31/13 18:33, David Woodhouse wrote:
 On Fri, 2013-05-31 at 10:43 -0500, Anthony Liguori wrote:
 It's even more fundamental.  OVMF as a whole (at least in it's usable
 form) is not Open Source. 
 
 The FAT module is required to make EDK2 usable, and yes, that's not Open
 Source. So in a sense you're right.
 
 But we're talking here about *replacing* the FAT module with something
 that *is* open source. And the FAT module isn't a fundamental part of
 EDK2; it's just an optional module that happens to be bundled with the
 repository.

Yes. *Some* FAT module is a hard requirement.

 So I think you're massively overstating the issue. OVMF/EDK2 *is* Open
 Source,

Agreed,

 and replacing the FAT module really isn't that hard.

technically it's not hard; for a seasoned file system developer (which
I'm not, of course), even possibly missing UEFI bits, it should be
children's play actually, considering the high quality of UEFI
documentation and the responsiveness of edk2-devel.

Considering US legal climate however, it appears *extremely* hard to
replace the FAT module, in my unwashed personal opinion.

Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for 2013-05-28

2013-05-31 Thread Laszlo Ersek
On 05/31/13 23:03, Jordan Justen wrote:

 Of course, the fact that the current FAT driver is exclusionary for
 free software projects is a point that is not lost on me. I just don't
 agree that the best response to this is a GPL'd FAT driver.

What would you suggest?

Thank you,
Laszlo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [SeaBIOS] KVM call agenda for 2013-05-28

2013-05-30 Thread Laszlo Ersek
On 05/30/13 11:23, David Woodhouse wrote:
 On Wed, 2013-05-29 at 11:18 -0500, Anthony Liguori wrote:

 Certainly an option, but that is a long-term project.

 Out of curiousity, are there other benefits to using coreboot as a core
 firmware in QEMU?

 Is there a payload we would ever plausibly use besides OVMF and SeaBIOS?
 
 I like the idea of using Coreboot on the UEFI side — if the most
 actively used TianoCore platform is CorebootPkg instead of OvmfPkg, that
 makes it a lot easier for people using *real* hardware with Coreboot to
 use TianoCore.

Where is CorebootPkg available from?

 And it helps to dispel the stupid misconception in some quarters that
 Coreboot *competes* with UEFI and thus cannot possibly be supported
 because helping something that competes with UEFI would be bad.

I'm not sure who do you mean by some quarters, but for some
distributions Coreboot would be yet another component (package) to
support, for no obvious benefit.

(Gerd said it better than I possibly could:
http://thread.gmane.org/gmane.comp.bios.coreboot.seabios/5685/focus=5705.)

 
 Is there a payload we would ever plausibly use besides OVMF and
 SeaBIOS?
 
 For my part I want to get to the point where the default firmware
 shipped with qemu can be OVMF.

For some distributions this is a licensing question as well. See
FatBinPkg/License.txt. (The same applies if you rebuild it from source
(FatPkgDev), based on FatBinPkg/ReadMe.txt.)  For example Fedora can't
ship OVMF because of it.

If you implement a UEFI FAT driver based on Microsoft's official
specification, you're bound by the same restrictions on use and
redistribution.

If you implement a private UEFI FAT driver from scratch, or port a free
software FAT implementation (eg. the r/o one in grub or the r/w one in
mtools), you could still run into legal problems, I've been told.

If you rip out the FAT driver, then OVMF won't be UEFI compliant and no
installer media will boot on it.

Interestingly, Ubuntu has OVMF in Universe
http://packages.ubuntu.com/raring/ovmf. I think they missed the
FatBinPkg license (I would have missed it too, after all you have to
track down the licenses of every module included in the FDF file -- it
was Paolo who told me about it) and I believe they should actually ship
OVMF in Multiverse or Restricted
https://help.ubuntu.com/community/Repositories/Ubuntu.

 We have SeaBIOS-as-CSM working, which was
 one of the biggest barriers.

Agreed, and I could have never done that. Thanks for implementing it
with Kevin.

We need at least one out-of-tree edk2 patch for now (from you) but
apparently that's no problem.

 There are a few more things (like NV
 variable storage, in particular) that I need to fix before I can
 actually make that suggestion with a straight face though...

As far as I understand:
- Jordan is nearing completion of flash support on KVM,
- he also has WIP NvVar storage on top of qemu flash.

http://thread.gmane.org/gmane.comp.emulators.qemu/213690
http://thread.gmane.org/gmane.comp.bios.tianocore.devel/2781/focus=2798

(Please coordinate I guess :))

Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [SeaBIOS] KVM call agenda for 2013-05-28

2013-05-30 Thread Laszlo Ersek
On 05/30/13 14:19, David Woodhouse wrote:

 Yeah, but if we're shoving a lot of hardware-specific ACPI table
 generation into the guest's firmware, instead of just doing it on the
 qemu side where a number of us seem to think it belongs, then there *is*
 a benefit to using Coreboot. When stuff changes on the qemu side and we
 have to update the table generation to match, you end up having to
 update just the Coreboot package, and *not* having to patch both SeaBIOS
 and OVMF.
 
 The extra package in the distro really isn't painful to handle, and I
 suspect it would end up *reducing* the amount of work that we have to do
 to update. You update *just* Coreboot, not *both* of SeaBIOS and OVMF.

I can't deny there's logic in that, but it still feels like tying
ourselves up in some intricate bondage choreography. We'd like to move
ACPI tables out of firmware, but we can't move them to qemu due to
project direction disagreement, so let's adopt a middleman. (I'm not
trying to denigrate coreboot -- I don't know it at all --, but
introducing it in a (granted, distro-specific) stack just for this
purpose seems quite arbitrary.)

 If you implement a private UEFI FAT driver from scratch, or port a free
 software FAT implementation (eg. the r/o one in grub or the r/w one in
 mtools), you could still run into legal problems, I've been told.
 
 That has been said, and it's been said for the FAT implementation in the
 kernel too. If a distribution is happy to ship the kernel without
 ripping out its FAT support, then I see no reason why that distribution
 wouldn't also be happy to ship a version of OVMF with a clean
 implementation of FAT support. It doesn't make sense to be happy with
 one but not the other.

Under my *personal* impression, logic and Common Law don't really mix,
especially not in the US. Still under my *personal* impression, someone
might not feel convenient suing you for redistributing code that already
exists in the upstream Linux kernel, but might happily drag you to court
for an original clean implementation, and then you can explain it's
illogical for them to do so.

The best I can do with your suggestion is to take it to our legal dept.
I would be happy to work on a new FAT driver. (I used to feel
differently earlier but I've changed my mind.)


 We need at least one out-of-tree edk2 patch for now (from you) but
 apparently that's no problem.
 
 That'll get merged soon. We are working on the corresponding spec
 update...

Great news!

Thanks,
Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [SeaBIOS] KVM call agenda for 2013-05-28

2013-05-30 Thread Laszlo Ersek
On 05/30/13 18:20, Jordan Justen wrote:

 I think ACPI table generation lives in firmware on real products,
 because on real products the firmware is the point that best
 understands the actual hardware layout for the machine. In qemu, I
 would say that qemu best knows the hardware layout, given that the
 firmware is generally a slightly separate project from qemu.
 
 I don't think adding a coreboot layer into the picture helps, if it
 brings along the coreboot payload boot interface as a requirement.
 
 Then again, I don't really understand how firmware could be swapped
 out in this case. What would -bios do? How would the coreboot ACPI
 shim layer be specified to qemu?

I guess -bios would load coreboot. Coreboot would siphon the data
necessary for ACPI table building through the current (same) fw_cfg
bottleneck, build the tables, load the boot firmware (SeaBIOS or OVMF or
something else -- not sure how to configure that), and pass down the
tables to the firmware (through a now unspecified interface -- perhaps
the tables could even be installed at this point). This could introduce
another interface (fw_cfg+something rather than just fw_cfg), but ACPI
table preparation would be concentrated in one project.

I guess.

Laszlo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [SeaBIOS] KVM call agenda for 2013-05-28

2013-05-30 Thread Laszlo Ersek
On 05/30/13 18:57, Jordan Justen wrote:
 On Thu, May 30, 2013 at 9:41 AM, Laszlo Ersek ler...@redhat.com wrote:
 On 05/30/13 18:20, Jordan Justen wrote:
 I think ACPI table generation lives in firmware on real products,
 because on real products the firmware is the point that best
 understands the actual hardware layout for the machine. In qemu, I
 would say that qemu best knows the hardware layout, given that the
 firmware is generally a slightly separate project from qemu.

 I don't think adding a coreboot layer into the picture helps, if it
 brings along the coreboot payload boot interface as a requirement.

 Then again, I don't really understand how firmware could be swapped
 out in this case. What would -bios do? How would the coreboot ACPI
 shim layer be specified to qemu?

 I guess -bios would load coreboot. Coreboot would siphon the data
 necessary for ACPI table building through the current (same) fw_cfg
 bottleneck, build the tables, load the boot firmware (SeaBIOS or OVMF or
 something else -- not sure how to configure that), and pass down the
 tables to the firmware (through a now unspecified interface -- perhaps
 the tables could even be installed at this point). This could introduce
 another interface (fw_cfg+something rather than just fw_cfg), but ACPI
 table preparation would be concentrated in one project.

 I guess.
 
 For reference, I believe that both Xen and virtualbox build ACPI table
 in the VMM rather than firmware. They both dump the tables into the
 0xe000 segment (yuck) where firmware finds and publishes it to the OS.
 I think fw-cfg would be a reasonable alternative to this for
 communicating the tables.

I think Xen uses a separate utility called hvmloader that runs in the
domU.

- http://thread.gmane.org/gmane.comp.bios.coreboot.seabios/5453/focus=5668
- http://xenbits.xen.org/gitweb/?p=xen.git;a=tree;f=tools/firmware/hvmloader
- http://thread.gmane.org/gmane.comp.bios.coreboot.seabios/6255/focus=110562

Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] virtio-pci: new config layout: using memory BAR

2013-05-29 Thread Laszlo Ersek
On 05/29/13 09:27, Paolo Bonzini wrote:
 Il 29/05/2013 06:33, Rusty Russell ha scritto:
 Paolo Bonzini pbonz...@redhat.com writes:
 Il 28/05/2013 19:32, Michael S. Tsirkin ha scritto:
 +
 +switch (addr) {
 +case offsetof(struct virtio_pci_common_cfg, 
 device_feature_select):
 +return proxy-device_feature_select;

 Oh dear no...  Please use defines like the rest of QEMU.
 Any good reason not to use offsetof?

 I'm not sure it's portable to use it in case labels.  IIRC, the
 definition ((int)(((T *)0)-field)) is not a valid C integer constant
 expression.  Laszlo?

 It's defined to yield an integer constant expression in the ISO standard
 (and I think ANSI too, though that's not at hand):
 
 It's not in C89.

It is, see 7.1.6 Common definitions stddef.h.


 The oldest compiler QEMU cares about is GCC 4.2.  I
 don't know if it has a builtin offsetof, probably it does.

Talking about nothing else but the ISO C standard(s), it doesn't matter
how the C implementation deals with offsetof() internally. C
implementation in this sense includes both compiler and standard library.

If the standard library header (stddef.h or something internal
included by it) expands the offsetof() macro to replacement text that
does pointer black magic, magic that would be otherwise nonportable or
undefined, it is alright as long as the accompanying compiler guarantees
that the replacement text will work.

That is, offsetof() gives the C implementor leeway to implement it in
wherever distribution between compiler and standard library; the main
thing is that the C program use the one offsetof() provided by the C
implementation.

Of course in the FLOSS world OS distros might mix and match gcc and
glibc statistically randomly; normally some documentation should enable
the user to put his/her system into ISO C or even SUSv[1234] mode.

(

An interesting example for, say, non-unified implementation is
getconf vs. c99. I'll pick SUSv3 for this example.

http://pubs.opengroup.org/onlinepubs/95399/utilities/getconf.html
http://pubs.opengroup.org/onlinepubs/95399/utilities/c99.html

In a nutshell one can interrogate getconf for CFLAGS, LDFLAGS and LIBS
so that c99 builds a program in ILP32_OFF32 / ILP32_OFFBIG / LP64_OFF64
/ LPBIG_OFFBIG mode (programming environment).

On my x86_64 RHEL-6.4 laptop, getconf is part of glibc
(glibc-common-2.12-1.107.el6.x86_64), while c99 is part of gcc
(gcc-4.4.7-3.el6.x86_64).

Supposing I'd like to build a 32-bit program with 64-bit off_t:

  getconf _POSIX_V6_ILP32_OFFBIG
  1

  getconf -v POSIX_V6_ILP32_OFFBIG POSIX_V6_ILP32_OFFBIG_CFLAGS
  -m32 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64

  getconf -v POSIX_V6_ILP32_OFFBIG POSIX_V6_ILP32_OFFBIG_LDFLAGS
  -m32

  getconf -v POSIX_V6_ILP32_OFFBIG POSIX_V6_ILP32_OFFBIG_LIBS
  [empty string]

However if I try to actually build a program using these flags:

  #define _XOPEN_SOURCE 600
  int main(void) { return 0; }

  c99 \
  -m32 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 \
  -o mytest \
  -m32 \
  mytest.c

I get

  /usr/bin/ld: crt1.o: No such file: No such file or directory
  collect2: ld returned 1 exit status

unless I install glibc-devel.i686.

This problem is rooted in getconf (a glibc utility) being unaware of
RHEL packaging choices: if the system can't guarantee the final c99
invocation to succeed, then the very first getconf invocation should
write -1\n or undefined\n to stdout.

(I'm aware that RHEL-6 is not certified UNIX 03; this is just an example
for problems caused by various parts of a standard's
(quasi-)implementation coming from separate projects. In that sense,
caring about the offsetof() internals may have merit.)

)

Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] virtio-pci: new config layout: using memory BAR

2013-05-28 Thread Laszlo Ersek
On 05/28/13 19:43, Paolo Bonzini wrote:
 Il 28/05/2013 19:32, Michael S. Tsirkin ha scritto:
 +
 +switch (addr) {
 +case offsetof(struct virtio_pci_common_cfg, device_feature_select):
 +return proxy-device_feature_select;

 Oh dear no...  Please use defines like the rest of QEMU.
 Any good reason not to use offsetof?
 
 I'm not sure it's portable to use it in case labels.  IIRC, the
 definition ((int)(((T *)0)-field)) is not a valid C integer constant
 expression.  Laszlo?

As long as we use the offsetof() that comes with the C implementation
(ie. we don't hack it up ourselves), we should be safe; ISO C99
elegantly defines offsetof() in 7.17 Common definitions stddef.h p3
as

  The macros are [...]

offsetof(type, member-designator)

  which expands to an integer constant expression that has type
  *size_t*, the value of which is the offset in bytes, to the structure
  member (designated by /member-designator/), from the beginning of its
  structure (designated by /type/). The type and member designator shall
  be such that given

static type t;

  then the expression (t.member-designator) evaluates to an address
  constant. (If the specified member is a bit-field, the behavior is
  undefined.)

(Address constant is described in 6.6p9 but quoting even that here
doesn't seem necesary.)

From 6.8.4.2p3, The expression of each case label shall be an integer
constant expression [...].

So,
(a) if we use the standard offsetof(),
(b) and you can also write, for example,

  static struct virtio_pci_common_cfg t;
  static void *p = t.device_feature_select;

then the construct seems valid.

(Consistently with the above mention of UB if the specified member is a
bit-field: the address-of operator's constraints also forbid a bit-field
object as operand.)

Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: suggesting wording fixes for virtio-spec 0.9.5

2013-04-23 Thread Laszlo Ersek
On 04/23/13 06:05, Rusty Russell wrote:
 Laszlo Ersek ler...@redhat.com writes:
 Hi,

 (I'm not subscribed to either list,)

 using the word descriptor is misleading in the following sections:
 
 Yes, I like the use of 'descriptor chains'.  This is a definite
 improvement.
 
 Here's the diff I ended up with (massaged to minimize it).
 
 Thanks!
 Rusty.
 
 --- virtio-spec.txt-old   2013-04-23 13:22:21.339158214 +0930
 +++ virtio-spec.txt   2013-04-23 13:34:14.055176464 +0930
 @@ -482,10 +482,10 @@
  
  2.3.4 Available Ring
  
 -The available ring refers to what descriptors we are offering the 
 -device: it refers to the head of a descriptor chain. The “flags” 
 +The available ring refers to what descriptor chains we are offering the
 +device: each entry refers to the head of a descriptor chain. The “flags”
  field is currently 0 or 1: 1 indicating that we do not need an 
 -interrupt when the device consumes a descriptor from the 
 +interrupt when the device consumes a descriptor chain from the
  available ring. Alternatively, the guest can ask the device to 
  delay interrupts until an entry with an index specified by the “
  used_event” field is written in the used ring (equivalently, 
 @@ -671,16 +671,16 @@
  
  avail-ring[avail-idx % qsz] = head;
  
 -However, in general we can add many descriptors before we update 
 -the “idx” field (at which point they become visible to the 
 -device), so we keep a counter of how many we've added:
 +However, in general we can add many separate descriptor chains before we 
 update
 +the “idx” field (at which point they become visible to the device),
 +so we keep a counter of how many we've added:
  
  avail-ring[(avail-idx + added++) % qsz] = head;
  
  2.4.1.3 Updating The Index Field
  
  Once the idx field of the virtqueue is updated, the device will 
 -be able to access the descriptor entries we've created and the 
 +be able to access the descriptor chains we've created and the 
  memory they refer to. This is why a memory barrier is generally 
  used before the idx update, to ensure it sees the most up-to-date 
  copy.
 

Not sure if it's customary here or if you need it / want it, but anyway

Reviewed-by: Laszlo Ersek ler...@redhat.com

(Also I've fixed the OVMF driver; just reposting the patch today with a
better commit message.)

Thanks much!
Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


suggesting wording fixes for virtio-spec 0.9.5

2013-04-22 Thread Laszlo Ersek
Hi,

(I'm not subscribed to either list,)

using the word descriptor is misleading in the following sections:

  2.4.1.2 Updating The Available Ring

  [...] However, in general we can add many descriptors before we
  update the idx field (at which point they become visible to the
  device), so we keep a counter of how many we've added: [...]

and

  2.4.1.3 Updating The Index Field

  Once the idx field of the virtqueue is updated, the device will be
  able to access the descriptor entries we've created and the memory
  they refer to. [...]

(The word descriptor in the above language is the reason I
mis-implemented the virtio-blk guest driver in OVMF.)

In fact the available ring tracks *head* descriptors only. I suggest

  s/many descriptors/many separate descriptor chains/
  s/descriptor entries/separate descriptor chains/

for the above.

Similarly, 2.3.4 Available Ring should start with something like

  The available ring describes what descriptor chains we are offering
  the device: each entry of the available ring refers to the head
  descriptor of a separate descriptor chain.

Thanks
Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fixup e432cef9 (aio help text): end sentences with periods

2012-01-24 Thread Laszlo Ersek
(Please keep me CC'd on any followup; I'm not subscribed. Thanks.)

Signed-off-by: Laszlo Ersek ler...@redhat.com
---
 qemu-io.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/qemu-io.c b/qemu-io.c
index ffa62fb..938b20c 100644
--- a/qemu-io.c
+++ b/qemu-io.c
@@ -1118,7 +1118,7 @@ static void aio_read_help(void)
  Reads a segment of the currently open file, optionally dumping it to the\n
  standard output stream (with -v option) for subsequent inspection.\n
  The read is performed asynchronously and the aio_flush command must be\n
- used to ensure all outstanding aio requests have been completed\n
+ used to ensure all outstanding aio requests have been completed.\n
  -C, -- report statistics in a machine parsable format\n
  -P, -- use a pattern to verify read data\n
  -v, -- dump buffer to standard output\n
@@ -1214,7 +1214,7 @@ static void aio_write_help(void)
  Writes into a segment of the currently open file, using a buffer\n
  filled with a set pattern (0xcdcdcdcd).\n
  The write is performed asynchronously and the aio_flush command must be\n
- used to ensure all outstanding aio requests have been completed\n
+ used to ensure all outstanding aio requests have been completed.\n
  -P, -- use different pattern to fill file\n
  -C, -- report statistics in a machine parsable format\n
  -q, -- quiet mode, do not show I/O statistics\n
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fixup e432cef9 (aio help text): end sentences with periods

2012-01-24 Thread Laszlo Ersek

On 01/24/12 21:05, Markus Armbruster wrote:

Laszlo Ersekler...@redhat.com  writes:


(Please keep me CC'd on any followup; I'm not subscribed. Thanks.)


Patch is fine, but it needs to go toqemu-de...@nongnu.org.


Ooops... Thanks!
Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html