Re: Timekeeping on ARM guests/hosts

2018-10-11 Thread Laszlo Ersek
On 10/11/18 09:54, Marc Zyngier wrote:
> Hi Miriam,
> 
> On Wed, 10 Oct 2018 19:38:47 +0100,
> Miriam Zimmerman  wrote:
>>
>> (oops, sorry for lack of plaintext in the first email. must've
>> forgotten to click the button in my email client)
>>
>> Until that happens, what's the best workaround? Just running an ntp
>> daemon in guest?
> 
> Christoffer reminded me yesterday that stolen time accounting only
> affects scheduling, and is not evaluated for
> 
> An NTP daemon may not be the best course of action, as the guest is
> going to see a massive jump anyway, which most NTP implementations are
> not design to handle (they rightly assume that something else is
> wrong). It would also mean that you'd have to run a NTP server
> somewhere on the host, as you cannot always assume full connectivity.
> 
> A popular way to solve this seems to be using the QEMU guest agent,
> but I must admit I never really investigated that side of the problem.

The guest agent method is documented here, for example:

https://git.qemu.org/?p=qemu.git;a=blob;f=qga/qapi-schema.json;h=dfbc4a5e32bde4070f12497c23973c604accfa7d;hb=v3.0.0#l128

and IIRC it is exposed (for example) via "virsh domtime" to the libvirt
user (or to higher level mgmt tools).

I suspect though that the guest agent method might introduce the same
kind of jump to the guest clock.

> I'm quite curious of how this is done on x86 though. KVM_GUEST mostly
> seems to give the guest a PV clocksource, which is not going to help in
> terms of wall clock. Do you have any idea?

I've seen this question raised, specifically wrt. x86, with people
closing their laptops' lids, and their guests losing correct track of
time. AIUI, there is no easy answer. (I was surprised to see Miriam's
initial statement that CONFIG_KVM_GUEST had solved it.) Some references:

https://bugs.launchpad.net/qemu/+bug/1174654
https://bugzilla.redhat.com/show_bug.cgi?id=1352992
https://bugzilla.redhat.com/show_bug.cgi?id=1380893

I'll spare you the verbatim quoting of the emails that I produced back
then :) ; a summary of workarounds is:

* Before you suspend the host, suspend the guest first. This way the
guest will not be surprised when it sees the physical clock (= whatever
it thinks is a physical clock) jump. Another benefit is that, if the
host fails to resume for some reason, data loss on the VM disks should
be reasonably unlikely, because when the guest suspends, it will flush
its stuff first.

* Use "-rtc clock=vm" on the QEMU command line. (Equivalently, use
 in the libvirt domain XML.) See the
QEMU manual, and the libvirt domain XML manual on these. Those settings
decouple the guest's RTC from the host's time, bringing both benefits
(no jumps in guest time) and drawbacks (the timelines diverge).

* Also, I've heard rumors that libvirtd might put a suspend inhibitor in
place (on the host) while some VMs are running. ("Suspend inhibitor" is
a SystemD term, I think.) Not sure how/if that works in practice; either
way it would solve the issue from a different perspective (namely, you
couldn't suspend the host).


Obviously I'm not trying to speak on this with any kind of "authority",
so take it FWIW. I happen to be a fan of the first option (manual guest
suspend).

Thanks,
Laszlo
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: use vnc in arm-kvm

2018-05-16 Thread Laszlo Ersek
On 05/16/18 04:24, lizhuoyao wrote:
> hi everyone:
>   currently, I meet a problem about qemu.Use virt-install order to create a 
> domain, return a failed:
>   order: virt-install -n centos-gg -r 1024 --disk 
> centos-gg.img,format=qcow2,size=10 --cdrom 
> /home/Centos-7-aarch64-Everything.iso --graphics vnc, listen=0.0.0.0
> failed: this QEMU does not support 'cirrus' video device
>This issue only happen in arm,and it's ok in x86. I get some message in 
> the internet that VGA is a (more or less) x86-specific standard, not 
> available on ARM. Right?
> tool: qemu-2.12.0 libvirt-3.2.0

Please select the "virtio" video device model.

(Also, the kvmarm list is not appropriate for discussing this; please
use  or
 instead.)

Thanks
Laszlo
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [Qemu-devel] [PATCH v3 1/4] ACPI: Add APEI GHES Table Generation support

2017-07-07 Thread Laszlo Ersek
On 07/07/17 10:32, gengdongjiu wrote:
> Hi Laszlo,
>sorry for my late response.
> 
> On 2017/6/3 20:01, Laszlo Ersek wrote:
>> On 05/22/17 16:23, Laszlo Ersek wrote:
>>> Keeping some context:
>>>
>>> On 05/12/17 23:00, Laszlo Ersek wrote:
>>>> On 04/30/17 07:35, Dongjiu Geng wrote:
>>
>>> (68) In the code below, you are not taking an "OVMF header probe
>>> suppressor" into account.
>>>
>>> But, we have already planned to replace that quirk with a separate,
>>> dedicated allocation hint or command, so I'm not going to describe what
>>> an "OVMF header probe suppressor" is; instead, I'll describe the
>>> replacement for it.
>>>
>>> [...]
>>
>> So, the NOACPI allocation hint is a no-go at the moment, based on the
>> discussion in the following threads:
>>
>> http://mid.mail-archive.com/20170601112241.2580-1-ard.biesheuvel@linaro.org
>>
>> http://mid.mail-archive.com/c76b36de-ebf9-c662-d454-0a95b43901e8@redhat.com
>>
>> Therefore the header probe suppression remains necessary.
>>
>> In this case, it is not hard to do, you just have to reorder the
>> following two ADD_POINTER additions a bit:
>  Ok, it is no problem.
> 
>>
>>>>> +
>>>>> +bios_linker_loader_add_pointer(linker, GHES_ERRORS_FW_CFG_FILE,
>>>>> +sizeof(uint64_t) * i, sizeof(uint64_t),
>>>>> +GHES_ERRORS_FW_CFG_FILE,
>>>>> +MAX_ERROR_SOURCE_COUNT_V6 * 
>>>>> sizeof(uint64_t) +
>>>>> +i * MAX_RAW_DATA_LENGTH);
>>
>> This one should be moved out to a separate loop, after the current loop.
>>
>>>>> +bios_linker_loader_add_pointer(linker, ACPI_BUILD_TABLE_FILE,
>>>>> +address_registers_offset
>>>>> ++ i * sizeof(AcpiGenericHardwareErrorSource),
>>>>> +sizeof(uint32_t), GHES_ERRORS_FW_CFG_FILE,
>>>>> +i * sizeof(uint64_t));
>>
>> This one should be kept in the first (i.e., current) loop.
>>
>> The idea is, when you first point the HEST/GHES_n entries in
>> ACPI_BUILD_TABLE_FILE to the "address registers" in
>> GHES_ERRORS_FW_CFG_FILE, all those address registers will still be
>> zero-filled. This will fail the ACPI table header probe in
>> OvmfPkg/AcpiPlatformDxe, which is what we want.
>>
>> After this is done, the address registers in GHES_ERRORS_FW_CFG_FILE
>> should be pointed to the error status data blocks in the same fw_cfg
>> blob, in a separate loop. (Those error status data blocks will again be
>> zero-filled, so no ACPI table headers will be mistakenly recognized in
>> them.)
> I understand your idear. but I have a question:
> how about we exchange the two function's place, such as shown below:
> then it still meets ours needs, the change is easy.
> For every loop:
> (1)when patch address in ACPI_BUILD_TABLE_FILE to the "address registers", 
> the address register is zero-filed.
> (2)when patch address in GHES_ERRORS_FW_CFG_FILE to the error status data 
> blocks, the error status data block is still zero-filed.
> 
> for (i = 0; i < GHES_ACPI_HEST_NOTIFY_RESERVED; i++) {
> .
> bios_linker_loader_add_pointer(linker, ACPI_BUILD_TABLE_FILE,
> address_registers_offset
> + i * sizeof(AcpiGenericHardwareErrorSource),
> sizeof(uint32_t), GHES_ERRORS_FW_CFG_FILE,
> i * sizeof(uint64_t));
> 
> 
> bios_linker_loader_add_pointer(linker, GHES_ERRORS_FW_CFG_FILE,
> sizeof(uint64_t) * i, sizeof(uint64_t),
> GHES_ERRORS_FW_CFG_FILE,
> MAX_ERROR_SOURCE_COUNT_V6 * sizeof(uint64_t) +
> i * MAX_RAW_DATA_LENGTH);
>   .
> 
>  }

Your suggestion seems to do the same, but there is a subtle difference.

When the firmware scans the targets of the ADD_POINTER commands for byte
sequences that "look like" an ACPI table header, in order to suppress
that probe, we should keep 36 bytes (i.e., the size of the ACPI table
header structure) zeroed at the target location.

* In the patch, as posted, this was not the case, because it first
filled in the address register inside the GHES_ERRORS_FW_CFG_FILE blob

Re: [Qemu-devel] [PATCH v3 1/4] ACPI: Add APEI GHES Table Generation support

2017-06-03 Thread Laszlo Ersek
On 05/22/17 16:23, Laszlo Ersek wrote:
> Keeping some context:
> 
> On 05/12/17 23:00, Laszlo Ersek wrote:
>> On 04/30/17 07:35, Dongjiu Geng wrote:

> (68) In the code below, you are not taking an "OVMF header probe
> suppressor" into account.
> 
> But, we have already planned to replace that quirk with a separate,
> dedicated allocation hint or command, so I'm not going to describe what
> an "OVMF header probe suppressor" is; instead, I'll describe the
> replacement for it.
> 
> [...]

So, the NOACPI allocation hint is a no-go at the moment, based on the
discussion in the following threads:

http://mid.mail-archive.com/20170601112241.2580-1-ard.biesheuvel@linaro.org

http://mid.mail-archive.com/c76b36de-ebf9-c662-d454-0a95b43901e8@redhat.com

Therefore the header probe suppression remains necessary.

In this case, it is not hard to do, you just have to reorder the
following two ADD_POINTER additions a bit:

>>> +
>>> +bios_linker_loader_add_pointer(linker, GHES_ERRORS_FW_CFG_FILE,
>>> +sizeof(uint64_t) * i, sizeof(uint64_t),
>>> +GHES_ERRORS_FW_CFG_FILE,
>>> +MAX_ERROR_SOURCE_COUNT_V6 * 
>>> sizeof(uint64_t) +
>>> +i * MAX_RAW_DATA_LENGTH);

This one should be moved out to a separate loop, after the current loop.

>>> +bios_linker_loader_add_pointer(linker, ACPI_BUILD_TABLE_FILE,
>>> +address_registers_offset
>>> ++ i * sizeof(AcpiGenericHardwareErrorSource),
>>> +sizeof(uint32_t), GHES_ERRORS_FW_CFG_FILE,
>>> +i * sizeof(uint64_t));

This one should be kept in the first (i.e., current) loop.

The idea is, when you first point the HEST/GHES_n entries in
ACPI_BUILD_TABLE_FILE to the "address registers" in
GHES_ERRORS_FW_CFG_FILE, all those address registers will still be
zero-filled. This will fail the ACPI table header probe in
OvmfPkg/AcpiPlatformDxe, which is what we want.

After this is done, the address registers in GHES_ERRORS_FW_CFG_FILE
should be pointed to the error status data blocks in the same fw_cfg
blob, in a separate loop. (Those error status data blocks will again be
zero-filled, so no ACPI table headers will be mistakenly recognized in
them.)

Thanks
Laszlo
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [Qemu-devel] [PATCH v3 1/4] ACPI: Add APEI GHES Table Generation support

2017-05-29 Thread Laszlo Ersek
Hi,

did you remove me from the To: / Cc: list intentionally, or was that an
oversight? I caught your message in my list folders only by luck.

Some followup below:

On 05/29/17 17:27, gengdongjiu wrote:

>> (46) What is "physical_addr" good for? Below I can only see an 
>> assignment to it, in ghes_update_guest(). Where is the field read?

> this "physical_addr" address is the physical error address in the
> CPER. such as the physical address that happen hwpoison, this address
> is delivered by the KVM and QEMU transfer this address to physical.
I understand that in the ghes_update_guest() function, you accept a
parameter called "physical_address", and you pass it on to
ghes_generate_cper_record(). That makes sense, yes.

However, you also assign the same value to "ges.physical_addr". And that
structure field is never read. So my point is that the
"GhesErrorState.physical_addr" field is superfluous and should be removed.

I checked the other three patches in the series and they don't seem to
read that structure member either. Correct me if I'm wrong.

>> (55) What happens if you run out of the preallocated memory?

> if it run out of the preallocated memory. it will overwrite other 
> error source. every block's size is fixed. so it does not easy
> dynamically extend the size if it is overflow. Anyway I will add a
> error report if it happens overwrite.
I understand (and agree) that dynamic allocation is not possible here.

But that doesn't justify overwriting the error status data block that
belongs to a different data source. (Worse, if this happens with the
last error status data block, for error source 10, you could overwrite
memory that belongs to the OS.)

If an error status data block becomes full, then we should either wrap
back to the start of the same data block, or else stop forwarding errors
for that error source.

Does the ACPI spec say anything about this? I.e., about the case when
the system runs out of the memory that was reserved for recording
hardware errors?

 +
 +mem_err = (struct cper_sec_mem_err *) (gdata + 1);
 +
 +/* In order to simplify simulation, hardcode the CPER section to 
 memory
 + * section.
 + */
 +mem_err->validation_bits |= CPER_MEM_VALID_ERROR_TYPE;
 +mem_err->error_type = 3;
>>
>> (58) Is this supposed to stand for "Multi-bit ECC" (from "N.2.5 Memory 
>> Error Section" in UEFI 2.6)? Should we have a macro for that?

> Yes, it is. What do you mean a macro?

A #define for the integer value 3.

> For all the errors that happen in the guest OS, in order to simulate
> easy, I abstract all the error section to memory section, even though
> the error section is processor or other section.
Why is that a valid thing to do? (I'm not doubting it is valid, I'm just
asking.) Will that not confuse the ACPI subsystem of the guest OS?

> I do not know whether do you have some suggestion for that.
Well I would have thought (without any expertise on the subject) that
hardware errors from the host side should be mapped to the guest more or
less "type correct". IOW, it looks strange that, say, a CPU error is
reported as a memory error. But this is just an uneducated guess.

 +mem_err->validation_bits |= CPER_MEM_VALID_CARD | 
 CPER_MEM_VALID_MODULE |
 +CPER_MEM_VALID_BANK | CPER_MEM_VALID_ROW |
 +CPER_MEM_VALID_COLUMN | CPER_MEM_VALID_BIT_POSITION;
 +mem_err->card = 1;
 +mem_err->module = 2;
 +mem_err->bank = 3;
 +mem_err->row = 1;
 +mem_err->column = 2;
 +mem_err->bit_pos = 5;
>>
>> (60) I have no idea where these values come from.

> For all the errors that happen in the guest OS, in order to simulate
> easy, I abstract all the error section to memory section, and hard
> code the memory section error value as above.
Sure, but why is that safe? Will the guest OS not want to do something
about these error details? If we are feeding the guest OS invalid error
details, will that not lead to confusion?

>> (64) What does "reqr" stand for?
> It stand for the request size.
Can you please call it "req_size" or something similar? The English
expression

  request size

contains only one "r" letter, so it's hard to understand where the
second "r" in "reqr" comes from.

Thanks,
Laszlo
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [Qemu-devel] [PATCH v3 1/4] ACPI: Add APEI GHES Table Generation support

2017-05-22 Thread Laszlo Ersek
Keeping some context:

On 05/12/17 23:00, Laszlo Ersek wrote:
> On 04/30/17 07:35, Dongjiu Geng wrote:
>> This implements APEI GHES Table by passing the error cper info
>> to the guest via a fw_cfg_blob. After a CPER info is added, an
>> SEA/SEI exception will be injected into the guest OS.
>>
>> Below is the table layout, the max number of error soure is 11,
>> which is classified by notification type.
>>
>> etc/acpi/tables etc/hardware_errors
>>  ==
>>  +---+
>> +--+ | address   | +-> +--+
>> |HEST  + | registers | |   | Error Status |
>> + ++ | +-+ |   | Data Block 1 |
>> | | GHES1  | --> | |address1 | +   | ++
>> | | GHES2  | --> | |address2 | --+ | |  CPER  |
>> | | GHES3  | --> | |address3 | + | | |  CPER  |
>> | |    | --> | | ... | | | | |  CPER  |
>> | | GHES10 | --> | |address10| -+  | | | |  CPER  |
>> +-++ +-+-+  |  | | +-++
>> |  | |
>> |  | +---> +--+
>> |  |   | Error Status |
>> |  |   | Data Block 2 |
>> |  |   | ++
>> |  |   | |  CPER  |
>> |  |   | |  CPER  |
>> |  |   +-++
>> |  |
>> |  +-> +--+
>> |  | Error Status |
>> |  | Data Block 3 |
>> |  | ++
>> |  | |  CPER  |
>> |  +-++
>> |...
>> +> +--+
>>| Error Status |
>>| Data Block 10|
>>| ++
>>| |  CPER  |
>>| |  CPER  |
>>| |  CPER  |
>>+-++
>>
>> Signed-off-by: Dongjiu Geng <gengdong...@huawei.com>
>> ---
>>  default-configs/arm-softmmu.mak |   1 +
>>  hw/acpi/Makefile.objs   |   1 +
>>  hw/acpi/aml-build.c |   2 +
>>  hw/acpi/hest_ghes.c | 203 +++
>>  hw/arm/virt-acpi-build.c|   6 ++
>>  include/hw/acpi/acpi-defs.h | 227 
>> 
>>  include/hw/acpi/aml-build.h |   1 +
>>  include/hw/acpi/hest_ghes.h |  43 
>>  8 files changed, 484 insertions(+)
>>  create mode 100644 hw/acpi/hest_ghes.c
>>  create mode 100644 include/hw/acpi/hest_ghes.h

> Next file:
>
>> diff --git a/include/hw/acpi/hest_ghes.h b/include/hw/acpi/hest_ghes.h
>> new file mode 100644
>> index 000..0cadc2b
>> --- /dev/null
>> +++ b/include/hw/acpi/hest_ghes.h
>> @@ -0,0 +1,43 @@
>> +#ifndef ACPI_GHES_H
>> +#define ACPI_GHES_H
>> +
>> +#include "hw/acpi/bios-linker-loader.h"
>> +
>> +#define GHES_ERRORS_FW_CFG_FILE  "etc/hardware_errors"
>> +#define GHES_DATA_ADDR_FW_CFG_FILE  "etc/hardware_errors_addr"
>> +
>> +#define GAS_ADDRESS_OFFSET  4
>> +#define ERROR_STATUS_ADDRESS_OFFSET 20
>> +#define NOTIFICATION_STRUCTURE  32
>> +
>> +#define BFAPEI_OK   0
>> +#define BFAPEI_FAIL 1
>> +
>> +/* The max number of error source, the error sources
>> + * are classified by notification type, below is the definition
>> + * 0 - Polled
>> + * 1 - External Interrupt
>> + * 2 - Local Interrupt
>> + * 3 - SCI
>> + * 4 - NMI
>> + * 5 - CMCI
>> + * 6 - MCE
>> + * 7 - GPIO-Signal
>> + * 8 - ARMv8 SEA
>> + * 9 - ARMv8 SEI
>> +

Re: [Qemu-devel] [PATCH v3 1/4] ACPI: Add APEI GHES Table Generation support

2017-05-12 Thread Laszlo Ersek
On 04/30/17 07:35, Dongjiu Geng wrote:
> This implements APEI GHES Table by passing the error cper info
> to the guest via a fw_cfg_blob. After a CPER info is added, an
> SEA/SEI exception will be injected into the guest OS.
>
> Below is the table layout, the max number of error soure is 11,
> which is classified by notification type.
>
> etc/acpi/tables etc/hardware_errors
>  ==
>  +---+
> +--+ | address   | +-> +--+
> |HEST  + | registers | |   | Error Status |
> + ++ | +-+ |   | Data Block 1 |
> | | GHES1  | --> | |address1 | +   | ++
> | | GHES2  | --> | |address2 | --+ | |  CPER  |
> | | GHES3  | --> | |address3 | + | | |  CPER  |
> | |    | --> | | ... | | | | |  CPER  |
> | | GHES10 | --> | |address10| -+  | | | |  CPER  |
> +-++ +-+-+  |  | | +-++
> |  | |
> |  | +---> +--+
> |  |   | Error Status |
> |  |   | Data Block 2 |
> |  |   | ++
> |  |   | |  CPER  |
> |  |   | |  CPER  |
> |  |   +-++
> |  |
> |  +-> +--+
> |  | Error Status |
> |  | Data Block 3 |
> |  | ++
> |  | |  CPER  |
> |  +-++
> |...
> +> +--+
>| Error Status |
>| Data Block 10|
>| ++
>| |  CPER  |
>| |  CPER  |
>| |  CPER  |
>+-++
>
> Signed-off-by: Dongjiu Geng 
> ---
>  default-configs/arm-softmmu.mak |   1 +
>  hw/acpi/Makefile.objs   |   1 +
>  hw/acpi/aml-build.c |   2 +
>  hw/acpi/hest_ghes.c | 203 +++
>  hw/arm/virt-acpi-build.c|   6 ++
>  include/hw/acpi/acpi-defs.h | 227 
> 
>  include/hw/acpi/aml-build.h |   1 +
>  include/hw/acpi/hest_ghes.h |  43 
>  8 files changed, 484 insertions(+)
>  create mode 100644 hw/acpi/hest_ghes.c
>  create mode 100644 include/hw/acpi/hest_ghes.h

Disclaimer: I'm not an ACPI (or any kind of) QEMU maintainer, so I can
only share my personal opinion.

(1) This patch is too big. It should be split in two parts at least.

The first patch should contain the new ACPI structures and macros. The
second patch should contain the generation feature.

I'll reorder the diff in my response.

> diff --git a/include/hw/acpi/acpi-defs.h b/include/hw/acpi/acpi-defs.h
> index 4cc3630..27adede 100644
> --- a/include/hw/acpi/acpi-defs.h
> +++ b/include/hw/acpi/acpi-defs.h
> @@ -295,6 +295,58 @@ typedef struct AcpiMultipleApicTable 
> AcpiMultipleApicTable;
>  #define ACPI_APIC_GENERIC_TRANSLATOR15
>  #define ACPI_APIC_RESERVED  16   /* 16 and greater are reserved 
> */
>

(2) Please add a comment above the following macros: they come from the
UEFI Spec 2.6, "N.2.5 Memory Error Section".

> +#define CPER_MEM_VALID_ERROR_STATUS 0x0001
> +#define CPER_MEM_VALID_PA   0x0002
> +#define CPER_MEM_VALID_PA_MASK  0x0004
> +#define CPER_MEM_VALID_NODE 0x0008
> +#define CPER_MEM_VALID_CARD 0x0010
> +#define CPER_MEM_VALID_MODULE   0x0020
> +#define CPER_MEM_VALID_BANK 0x0040
> +#define CPER_MEM_VALID_DEVICE   0x0080
> +#define CPER_MEM_VALID_ROW  0x0100
> +#define CPER_MEM_VALID_COLUMN   0x0200
> +#define CPER_MEM_VALID_BIT_POSITION 0x0400
> +#define CPER_MEM_VALID_REQUESTOR_ID 0x0800
> +#define CPER_MEM_VALID_RESPONDER_ID 0x1000
> +#define CPER_MEM_VALID_TARGET_ID0x2000

(3) _ID should be dropped.

> +#define CPER_MEM_VALID_ERROR_TYPE   0x4000
> +#define CPER_MEM_VALID_RANK_NUMBER  0x8000
> +#define CPER_MEM_VALID_CARD_HANDLE  0x1
> +#define 

Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS

2017-04-24 Thread Laszlo Ersek
On 04/21/17 15:27, gengdongjiu wrote:
> Hi all/Laszlo,
> 
>   sorry, I have a question to consult with you.
> 
> 
> On 2017/4/7 2:55, Laszlo Ersek wrote:
>> On 04/06/17 14:35, gengdongjiu wrote:
>>> Dear, Laszlo
>>>Thanks for your detailed explanation.
>>>
>>> On 2017/3/29 19:58, Laszlo Ersek wrote:
>>>> (This ought to be one of the longest address lists I've ever seen :)
>>>> Thanks for the CC. I'm glad Shannon is already on the CC list. For good
>>>> measure, I'm adding MST and Igor.)
>>>>
>>>> On 03/29/17 12:36, Achin Gupta wrote:
>>>>> Hi gengdongjiu,
>>>>>
>>>>> On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote:
>>>>>>
>>>>>> Hi Laszlo/Biesheuvel/Qemu developer,
>>>>>>
>>>>>>Now I encounter a issue and want to consult with you in ARM64 
>>>>>> platform, as described below:
>>>>>>
>>>>>> when guest OS happen synchronous or asynchronous abort, kvm needs
>>>>>> to send the error address to Qemu or UEFI through sigbus to
>>>>>> dynamically generate APEI table. from my investigation, there are
>>>>>> two ways:
>>>>>>
>>>>>> (1) Qemu get the error address, and generate the APEI table, then
>>>>>> notify UEFI to know this generation, then inject abort error to
>>>>>> guest OS, guest OS read the APEI table.
>>>>>> (2) Qemu get the error address, and let UEFI to generate the APEI
>>>>>> table, then inject abort error to guest OS, guest OS read the APEI
>>>>>> table.
>>>>>
>>>>> Just being pedantic! I don't think we are talking about creating the APEI 
>>>>> table
>>>>> dynamically here. The issue is: Once KVM has received an error that is 
>>>>> destined
>>>>> for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject 
>>>>> the error
>>>>> into the guest OS, a CPER (Common Platform Error Record) has to be 
>>>>> generated
>>>>> corresponding to the error source (GHES corresponding to memory subsystem,
>>>>> processor etc) to allow the guest OS to do anything meaningful with the
>>>>> error. So who should create the CPER is the question.
>>>>>
>>>>> At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error 
>>>>> arrives
>>>>> at EL3 and secure firmware (at EL3 or a lower secure exception level) is
>>>>> responsible for creating the CPER. ARM is experimenting with using a 
>>>>> Standalone
>>>>> MM EDK2 image in the secure world to do the CPER creation. This will avoid
>>>>> adding the same code in ARM TF in EL3 (better for security). The error 
>>>>> will then
>>>>> be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM 
>>>>> Trusted
>>>>> Firmware.
>>>>>
>>>>> Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1
>>>>> interface (as discussed with Christoffer below). So it should generate 
>>>>> the CPER
>>>>> before injecting the error.
>>>>>
>>>>> This is corresponds to (1) above apart from notifying UEFI (I am assuming 
>>>>> you
>>>>> mean guest UEFI). At this time, the guest OS already knows where to pick 
>>>>> up the
>>>>> CPER from through the HEST. Qemu has to create the CPER and populate its 
>>>>> address
>>>>> at the address exported in the HEST. Guest UEFI should not be involved in 
>>>>> this
>>>>> flow. Its job was to create the HEST at boot and that has been done by 
>>>>> this
>>>>> stage.
>>>>>
>>>>> Qemu folk will be able to add but it looks like support for CPER 
>>>>> generation will
>>>>> need to be added to Qemu. We need to resolve this.
>>>>>
>>>>> Do shout if I am missing anything above.
>>>>
>>>> After reading this email, the use case looks *very* similar to what
>>>> we've just done with VMGENID for QEMU 2.9.
>>>>
>>>> We have a facility between QEMU and the guest firmware, called "ACPI
>>>> linker/loader", with which QEMU instructs the firmware to
>>>>
>

Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS

2017-04-07 Thread Laszlo Ersek
On 04/07/17 04:52, gengdongjiu wrote:
> 
> On 2017/4/7 2:55, Laszlo Ersek wrote:

>> I'm unsure if, by "not fixed", you are saying
>>
>>   the number of CPER entries that fits in Error Status Data Block N is
>>   not *uniform* across 0 <= N <= 10 [1]
>>
>> or
>>
>>   the number of CPER entries that fits in Error Status Data Block N is
>>   not *known* in advance, for all of 0 <= N <= 10 [2]
>>
>> Which one is your point?
>>
>> If [1], that's no problem; you can simply sum the individual error
>> status data block sizes in advance, and allocate "etc/hardware_errors"
>> accordingly, using the total size.
>>
>> (Allocating one shared fw_cfg blob for all status data blocks is more
>> memory efficient, as each ALLOCATE command will allocate whole pages
>> (rounded up from the actual blob size).)
>>
>> If your point is [2], then splitting the error status data blocks to
>> separate fw_cfg blobs makes no difference: regardless of whether we try
>> to place all the error status data blocks in a single fw_cfg blob, or in
>> separate fw_cfg blobs, the individual data block cannot be resized at OS
>> runtime, so there's no way to make it work.
>>
> My Point is [2]. The HEST(Hardware Error Source Table) table format is here:
> https://wiki.linaro.org/LEG/Engineering/Kernel/RAS/APEITables#Hardware_Error_Source_Table_.28HEST.29
> 
> Now I understand your thought.

But if you mean [2], then I am confused, with regard to firmware on
physical hardware. Namely, even on physical machines, the firmware has
to estimate, in advance, the area size that will be needed for CPERs,
doesn't it? And once the firmware allocates that memory area, it cannot
be resized at OS runtime. If there are more CPERs at runtime (due to
hardware errors) than the firmware allocated room for, they must surely
wrap around in the preallocated buffer (like in a ring buffer). Isn't
that correct?

On the diagrams that you linked above (great looking diagrams BTW!), I
see CPER in two places (it is helpfully shaded red):

- to the right of BERT; the CPER is part of a box that is captioned
"firmware reserved memory"

- to the right of HEST; again the CPER is part of a box that is
captioned "firmware reserved memory"

So, IMO, when QEMU has to guesstimate the room for CPERs in advance,
that doesn't differ from the physical firmware case. In QEMU maybe you
can let the user specify the area size on the command line, with a
machine type property or similar.

Thanks
Laszlo
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS

2017-04-06 Thread Laszlo Ersek
On 04/06/17 14:35, gengdongjiu wrote:
> Dear, Laszlo
>Thanks for your detailed explanation.
> 
> On 2017/3/29 19:58, Laszlo Ersek wrote:
>> (This ought to be one of the longest address lists I've ever seen :)
>> Thanks for the CC. I'm glad Shannon is already on the CC list. For good
>> measure, I'm adding MST and Igor.)
>>
>> On 03/29/17 12:36, Achin Gupta wrote:
>>> Hi gengdongjiu,
>>>
>>> On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote:
>>>>
>>>> Hi Laszlo/Biesheuvel/Qemu developer,
>>>>
>>>>Now I encounter a issue and want to consult with you in ARM64 platform, 
>>>> as described below:
>>>>
>>>> when guest OS happen synchronous or asynchronous abort, kvm needs
>>>> to send the error address to Qemu or UEFI through sigbus to
>>>> dynamically generate APEI table. from my investigation, there are
>>>> two ways:
>>>>
>>>> (1) Qemu get the error address, and generate the APEI table, then
>>>> notify UEFI to know this generation, then inject abort error to
>>>> guest OS, guest OS read the APEI table.
>>>> (2) Qemu get the error address, and let UEFI to generate the APEI
>>>> table, then inject abort error to guest OS, guest OS read the APEI
>>>> table.
>>>
>>> Just being pedantic! I don't think we are talking about creating the APEI 
>>> table
>>> dynamically here. The issue is: Once KVM has received an error that is 
>>> destined
>>> for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the 
>>> error
>>> into the guest OS, a CPER (Common Platform Error Record) has to be generated
>>> corresponding to the error source (GHES corresponding to memory subsystem,
>>> processor etc) to allow the guest OS to do anything meaningful with the
>>> error. So who should create the CPER is the question.
>>>
>>> At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error 
>>> arrives
>>> at EL3 and secure firmware (at EL3 or a lower secure exception level) is
>>> responsible for creating the CPER. ARM is experimenting with using a 
>>> Standalone
>>> MM EDK2 image in the secure world to do the CPER creation. This will avoid
>>> adding the same code in ARM TF in EL3 (better for security). The error will 
>>> then
>>> be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM 
>>> Trusted
>>> Firmware.
>>>
>>> Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1
>>> interface (as discussed with Christoffer below). So it should generate the 
>>> CPER
>>> before injecting the error.
>>>
>>> This is corresponds to (1) above apart from notifying UEFI (I am assuming 
>>> you
>>> mean guest UEFI). At this time, the guest OS already knows where to pick up 
>>> the
>>> CPER from through the HEST. Qemu has to create the CPER and populate its 
>>> address
>>> at the address exported in the HEST. Guest UEFI should not be involved in 
>>> this
>>> flow. Its job was to create the HEST at boot and that has been done by this
>>> stage.
>>>
>>> Qemu folk will be able to add but it looks like support for CPER generation 
>>> will
>>> need to be added to Qemu. We need to resolve this.
>>>
>>> Do shout if I am missing anything above.
>>
>> After reading this email, the use case looks *very* similar to what
>> we've just done with VMGENID for QEMU 2.9.
>>
>> We have a facility between QEMU and the guest firmware, called "ACPI
>> linker/loader", with which QEMU instructs the firmware to
>>
>> - allocate and download blobs into guest RAM (AcpiNVS type memory) --
>> ALLOCATE command,
>>
>> - relocate pointers in those blobs, to fields in other (or the same)
>> blobs -- ADD_POINTER command,
>>
>> - set ACPI table checksums -- ADD_CHECKSUM command,
>>
>> - and send GPAs of fields within such blobs back to QEMU --
>> WRITE_POINTER command.
>>
>> This is how I imagine we can map the facility to the current use case
>> (note that this is the first time I read about HEST / GHES / CPER):
>>
>> etc/acpi/tables etc/hardware_errors
>>  ==
>>  +-

Re: host stalls when qemu-system-aarch64 with kvm and pflash

2017-03-30 Thread Laszlo Ersek
On 03/29/17 20:56, Christoffer Dall wrote:
> On Tue, Mar 28, 2017 at 01:24:15PM -0700, Radha Mohan wrote:
>> On Tue, Mar 28, 2017 at 1:16 PM, Christoffer Dall  wrote:
>>> Hi Radha,
>>>
>>> On Tue, Mar 28, 2017 at 12:58:24PM -0700, Radha Mohan wrote:
 Hi,
 I am seeing an issue with qemu-system-aarch64 when using pflash
 (booting kernel via UEFI bios).

 Host kernel: 4.11.0-rc3-next-20170323
 Qemu version: v2.9.0-rc1

 Command used:
 ./aarch64-softmmu/qemu-system-aarch64 -cpu host -enable-kvm -M
 virt,gic_version=3 -nographic -smp 1 -m 2048 -drive
 if=none,id=hd0,file=/root/zesty-server-cloudimg-arm64.img,id=0 -device
 virtio-blk-device,drive=hd0 -pflash /root/flash0.img -pflash
 /root/flash1.img


 As soon as the guest kernel boots the host starts to stall and prints
 the below messages. And the system never recovers. I can neither
 poweroff the guest nor the host. So I have resort to external power
 reset of the host.

 ==
 [  116.199077] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s!
 [kworker/25:1:454]
 [  116.206901] Modules linked in: binfmt_misc nls_iso8859_1 aes_ce_blk
 shpchp crypto_simd gpio_keys cryptd aes_ce_cipher ghash_ce sha2_ce
 sha1_ce uio_pdrv_genirq uio autofs4 btrfs raid10 rai
 d456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
 raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm
 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_s
 ys_fops drm nicvf ahci nicpf libahci thunder_bgx thunder_xcv
 mdio_thunder mdio_cavium

 [  116.206995] CPU: 25 PID: 454 Comm: kworker/25:1 Not tainted
 4.11.0-rc3-next-20170323 #1
 [  116.206997] Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Feb 
 23 2017
 [  116.207010] Workqueue: events netstamp_clear
 [  116.207015] task: 801f906b5400 task.stack: 801f901a4000
 [  116.207020] PC is at smp_call_function_many+0x284/0x2e8
 [  116.207023] LR is at smp_call_function_many+0x244/0x2e8
 [  116.207026] pc : [] lr : []
 pstate: 8145
 [  116.207028] sp : 801f901a7be0
 [  116.207030] x29: 801f901a7be0 x28: 09139000
 [  116.207036] x27: 09139434 x26: 0080
 [  116.207041] x25:  x24: 081565d0
 [  116.207047] x23: 0001 x22: 08e11e00
 [  116.207052] x21: 801f6d5cff00 x20: 801f6d5cff08
 [  116.207057] x19: 09138e38 x18: 0a03
 [  116.207063] x17: b77c9028 x16: 082e81d8
 [  116.207068] x15: 3d0d6dd44d08 x14: 0036312196549b4a
 [  116.207073] x13: 58dabe4c x12: 0018
 [  116.207079] x11: 366e2f04 x10: 09f0
 [  116.207084] x9 : 801f901a7d30 x8 : 0002
 [  116.207089] x7 :  x6 : 
 [  116.207095] x5 :  x4 : 0020
 [  116.207100] x3 : 0020 x2 : 
 [  116.207105] x1 : 801f6d682578 x0 : 0003

 [  150.443116] INFO: rcu_sched self-detected stall on CPU
 [  150.448261]  25-...: (14997 ticks this GP)
 idle=47a/141/0 softirq=349/349 fqs=7495
 [  150.451115] INFO: rcu_sched detected stalls on CPUs/tasks:
 [  150.451123]  25-...: (14997 ticks this GP)
 idle=47a/141/0 softirq=349/349 fqs=7495
 [  150.451124]  (detected by 13, t=15002 jiffies, g=805, c=804, q=8384)
 [  150.451136] Task dump for CPU 25:
 [  150.451138] kworker/25:1R  running task0   454  2 
 0x0002
 [  150.451155] Workqueue: events netstamp_clear
 [  150.451158] Call trace:
 [  150.451164] [] __switch_to+0x90/0xa8
 [  150.451172] [] static_key_slow_inc+0x128/0x138
 [  150.451175] [] static_key_enable+0x34/0x60
 [  150.451178] [] netstamp_clear+0x68/0x80
 [  150.451181] [] process_one_work+0x158/0x478
 [  150.451183] [] worker_thread+0x50/0x4a8
 [  150.451187] [] kthread+0x108/0x138
 [  150.451190] [] ret_from_fork+0x10/0x50
 [  150.477451]   (t=15008 jiffies g=805 c=804 q=8384)
 [  150.482242] Task dump for CPU 25:
 [  150.482245] kworker/25:1R  running task0   454  2 
 0x0002
 [  150.482259] Workqueue: events netstamp_clear
 [  150.482264] Call trace:
 [  150.482271] [] dump_backtrace+0x0/0x2b0
 [  150.482277] [] show_stack+0x24/0x30
 [  150.482281] [] sched_show_task+0x128/0x178
 [  150.482285] [] dump_cpu_task+0x48/0x58
 [  150.482288] [] rcu_dump_cpu_stacks+0xa0/0xe8
 [  150.482297] [] rcu_check_callbacks+0x774/0x938
 [  150.482305] [] update_process_times+0x34/0x60
 [  150.482314] [] tick_sched_handle.isra.7+0x38/0x70
 [  150.482319] [] tick_sched_timer+0x4c/0x98
 [  150.482324] [] 

Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS

2017-03-29 Thread Laszlo Ersek
On 03/29/17 16:48, Christoffer Dall wrote:
> On Wed, Mar 29, 2017 at 10:36:51PM +0800, gengdongjiu wrote:
>> 2017-03-29 18:36 GMT+08:00, Achin Gupta :

>>> Qemu is essentially fulfilling the role of secure firmware at the
>>> EL2/EL1 interface (as discussed with Christoffer below). So it
>>> should generate the CPER before injecting the error.
>>>
>>> This is corresponds to (1) above apart from notifying UEFI (I am
>>> assuming you mean guest UEFI). At this time, the guest OS already
>>> knows where to pick up the CPER from through the HEST. Qemu has
>>> to create the CPER and populate its address at the address
>>> exported in the HEST. Guest UEFI should not be involved in this 
>>> flow. Its job was to create the HEST at boot and that has been
>>> done by this stage.
>>
>> Sorry,  As I understand it, after Qemu generate the CPER table, it
>> should pass the CPER table to the guest UEFI, then Guest UEFI  place
>> this CPER table to the guest OS memory. In this flow, the Guest UEFI
>> should be involved, else the Guest OS can not see the CPER table.
>>
> 
> I think you need to explain the "pass the CPER table to the guest UEFI"
> concept in terms of what really happens, step by step, and when you say
> "then Guest UEFI place the CPER table to the guest OS memory", I'm
> curious who is running what code on the hardware when doing that.

I strongly suggest to keep the guest firmware's runtime involvement to
zero. Two reasons:

(1) As you explained above (... which I conveniently snipped), when you
inject an interrupt to the guest, the handler registered for that
interrupt will come from the guest kernel.

The only exception to this is when the platform provides a type of
interrupt whose handler can be registered and then locked down by the
firmware. On x86, this is the SMI.

In practice though,
- in OVMF (x86), we only do synchronous (software-initiated) SMIs (for
privileged UEFI varstore access),
- and in ArmVirtQemu (ARM / aarch64), none of the management mode stuff
exists at all.

I understand that the Platform Init 1.5 (or 1.6?) spec abstracted away
the MM (management mode) protocols from Intel SMM, but at this point
there is zero code in ArmVirtQemu for that. (And I'm unsure how much of
any eligible underlying hw emulation exists in QEMU.)

So you can't get the guest firmware to react to the injected interrupt
without the guest OS coming between first.

(2) Achin's description matches really-really closely what is possible,
and what should be done with QEMU, ArmVirtQemu, and the guest kernel.

In any solution for this feature, the firmware has to reserve some
memory from the OS at boot. The current facilities we have enable this.
As I described previously, the ACPI linker/loader actions can be mapped
more or less 1:1 to Achin's design. From a practical perspective, you
really want to keep the guest firmware as dumb as possible (meaning: as
generic as possible), and keep the ACPI specifics to the QEMU and the
guest kernel sides.

The error serialization actions -- the co-operation between guest kernel
and QEMU on the special memory areas -- that were mentioned earlier by
Michael and Punit look like a complication. But, IMO, they don't differ
from any other device emulation -- DMA actions in particular -- that
QEMU already does. Device models are what QEMU *does*. Read the command
block that the guest driver placed in guest memory, parse it, sanity
check it, verify it, execute it, write back the status code, inject an
interrupt (and/or let any polling guest driver notice it "soon after" --
use barriers as necessary).

Thus, I suggest to rely on the generic ACPI linker/loader interface
(between QEMU and guest firmware) *only* to make the firmware lay out
stuff (= reserve buffers, set up pointers, install QEMU's ACPI tables)
*at boot*. Then, at runtime, let the guest kernel and QEMU (the "device
model") talk to each other directly. Keep runtime firmware involvement
to zero.

You *really* don't want to debug three components at runtime, when you
can solve the thing with two. (Two components whose build systems won't
drive you mad, I should add.)

IMO, Achin's design nailed it. We can do that.

Laszlo
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS

2017-03-29 Thread Laszlo Ersek
On 03/29/17 14:51, Michael S. Tsirkin wrote:
> On Wed, Mar 29, 2017 at 01:58:29PM +0200, Laszlo Ersek wrote:
>> (8) When QEMU gets SIGBUS from the kernel -- I hope that's going to come
>> through a signalfd -- QEMU can format the CPER right into guest memory,
>> and then inject whatever interrupt (or assert whatever GPIO line) is
>> necessary for notifying the guest.
> 
> I think I see a race condition potential - what if guest accesses
> CPER in guest memory while it's being written?

I'm not entirely sure about the data flow here (these parts of the ACPI
spec are particularly hard to read...), but I thought the OS wouldn't
look until it got a notification.

Or, are you concerned about the next CPER write by QEMU, while the OS is
reading the last one (and maybe the CPER area could wrap around?)

> 
> We can probably use another level of indirection to fix this:
> 
> allocate twice the space, add a pointer to where the valid
> table is located and update that after writing CPER completely.
> The pointer can be written atomically but also needs to
> be read atomically, so I suspect it should be a single byte
> as we don't know how are OSPMs implementing this.
> 

A-B-A problem? (Is that usually solved with a cookie or a wider
generation counter? But that would again require wider atomics.)

I do wonder though how this is handled on physical hardware. Assuming
the hardware error traps to the firmware first (which, on phys hw, is
responsible for depositing the CPER), in that scenario the phys firmware
would face the same issue (i.e., asynchronously interrupting the OS,
which could be reading the previously stored CPER).

Thanks,
Laszlo
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS

2017-03-29 Thread Laszlo Ersek
(This ought to be one of the longest address lists I've ever seen :)
Thanks for the CC. I'm glad Shannon is already on the CC list. For good
measure, I'm adding MST and Igor.)

On 03/29/17 12:36, Achin Gupta wrote:
> Hi gengdongjiu,
> 
> On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote:
>>
>> Hi Laszlo/Biesheuvel/Qemu developer,
>>
>>Now I encounter a issue and want to consult with you in ARM64 platform, 
>> as described below:
>>
>> when guest OS happen synchronous or asynchronous abort, kvm needs
>> to send the error address to Qemu or UEFI through sigbus to
>> dynamically generate APEI table. from my investigation, there are
>> two ways:
>>
>> (1) Qemu get the error address, and generate the APEI table, then
>> notify UEFI to know this generation, then inject abort error to
>> guest OS, guest OS read the APEI table.
>> (2) Qemu get the error address, and let UEFI to generate the APEI
>> table, then inject abort error to guest OS, guest OS read the APEI
>> table.
> 
> Just being pedantic! I don't think we are talking about creating the APEI 
> table
> dynamically here. The issue is: Once KVM has received an error that is 
> destined
> for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the 
> error
> into the guest OS, a CPER (Common Platform Error Record) has to be generated
> corresponding to the error source (GHES corresponding to memory subsystem,
> processor etc) to allow the guest OS to do anything meaningful with the
> error. So who should create the CPER is the question.
> 
> At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error arrives
> at EL3 and secure firmware (at EL3 or a lower secure exception level) is
> responsible for creating the CPER. ARM is experimenting with using a 
> Standalone
> MM EDK2 image in the secure world to do the CPER creation. This will avoid
> adding the same code in ARM TF in EL3 (better for security). The error will 
> then
> be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM Trusted
> Firmware.
> 
> Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1
> interface (as discussed with Christoffer below). So it should generate the 
> CPER
> before injecting the error.
> 
> This is corresponds to (1) above apart from notifying UEFI (I am assuming you
> mean guest UEFI). At this time, the guest OS already knows where to pick up 
> the
> CPER from through the HEST. Qemu has to create the CPER and populate its 
> address
> at the address exported in the HEST. Guest UEFI should not be involved in this
> flow. Its job was to create the HEST at boot and that has been done by this
> stage.
> 
> Qemu folk will be able to add but it looks like support for CPER generation 
> will
> need to be added to Qemu. We need to resolve this.
> 
> Do shout if I am missing anything above.

After reading this email, the use case looks *very* similar to what
we've just done with VMGENID for QEMU 2.9.

We have a facility between QEMU and the guest firmware, called "ACPI
linker/loader", with which QEMU instructs the firmware to

- allocate and download blobs into guest RAM (AcpiNVS type memory) --
ALLOCATE command,

- relocate pointers in those blobs, to fields in other (or the same)
blobs -- ADD_POINTER command,

- set ACPI table checksums -- ADD_CHECKSUM command,

- and send GPAs of fields within such blobs back to QEMU --
WRITE_POINTER command.

This is how I imagine we can map the facility to the current use case
(note that this is the first time I read about HEST / GHES / CPER):

etc/acpi/tables etc/hardware_errors
 ==
 +---+
+--+ | address   | +-> +--+
|HEST  + | registers | |   | Error Status |
+ ++ | +-+ |   | Data Block 1 |
| | GHES   | --> | | address | +   | ++
| | GHES   | --> | | address | --+ | |  CPER  |
| | GHES   | --> | | address | + | | |  CPER  |
| | GHES   | --> | | address | -+  | | | |  CPER  |
+-++ +-+-+  |  | | +-++
|  | |
|  | +---> +--+
|  |   | Error Status |
|  |   | Data Block 2 |
|  |   | ++
|  |   | |  CPER  |
|  |   | |  CPER  |
|  |   +-++
|  |
|  +-> +--+
|  | Error Status |
   

Re: issues with emulated PCI MMIO backed by host memory under KVM

2016-06-28 Thread Laszlo Ersek
(adding Gerd)

On 06/28/16 15:14, Ard Biesheuvel wrote:

> In any case, reconciling software that requires a framebuffer with a
> GPU emulation that does not expose one by design is going to be
> problematic even without this issue. How is this supposed to work on
> x86?

AFAIK:

"virtio-gpu-pci" is the device model without the framebuffer. It is good
for secondary displays (i.e. those that you don't boot with, only use
after the guest kernel starts up).

"virtio-vga" is the same, but it also has the legacy VGA framebuffer,
hence it can be used for accommodating boot loaders. (Except it won't
work for aarch64 KVM guests, because of $SUBJECT.)

Thanks
Laszlo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: issues with emulated PCI MMIO backed by host memory under KVM

2016-06-28 Thread Laszlo Ersek
On 06/28/16 12:04, Christoffer Dall wrote:
> On Mon, Jun 27, 2016 at 03:57:28PM +0200, Ard Biesheuvel wrote:
>> On 27 June 2016 at 15:35, Christoffer Dall  
>> wrote:
>>> On Mon, Jun 27, 2016 at 02:30:46PM +0200, Ard Biesheuvel wrote:
 On 27 June 2016 at 12:34, Christoffer Dall  
 wrote:
> On Mon, Jun 27, 2016 at 11:47:18AM +0200, Ard Biesheuvel wrote:
>> On 27 June 2016 at 11:16, Christoffer Dall  
>> wrote:
>>> Hi,
>>>
>>> I'm going to ask some stupid questions here...
>>>
>>> On Fri, Jun 24, 2016 at 04:04:45PM +0200, Ard Biesheuvel wrote:
 Hi all,

 This old subject came up again in a discussion related to PCIe support
 for QEMU/KVM under Tianocore. The fact that we need to map PCI MMIO
 regions as cacheable is preventing us from reusing a significant slice
 of the PCIe support infrastructure, and so I'd like to bring this up
 again, perhaps just to reiterate why we're simply out of luck.

 To refresh your memories, the issue is that on ARM, PCI MMIO regions
 for emulated devices may be backed by memory that is mapped cacheable
 by the host. Note that this has nothing to do with the device being
 DMA coherent or not: in this case, we are dealing with regions that
 are not memory from the POV of the guest, and it is reasonable for the
 guest to assume that accesses to such a region are not visible to the
 device before they hit the actual PCI MMIO window and are translated
 into cycles on the PCI bus.
>>>
>>> For the sake of completeness, why is this reasonable?
>>>
>>
>> Because the whole point of accessing these regions is to communicate
>> with the device. It is common to use write combining mappings for
>> things like framebuffers to group writes before they hit the PCI bus,
>> but any caching just makes it more difficult for the driver state and
>> device state to remain synchronized.
>>
>>> Is this how any real ARM system implementing PCI would actually work?
>>>
>>
>> Yes.
>>
 That means that mapping such a region
 cacheable is a strange thing to do, in fact, and it is unlikely that
 patches implementing this against the generic PCI stack in Tianocore
 will be accepted by the maintainers.

 Note that this issue not only affects framebuffers on PCI cards, it
 also affects emulated USB host controllers (perhaps Alex can remind us
 which one exactly?) and likely other emulated generic PCI devices as
 well.

 Since the issue exists only for emulated PCI devices whose MMIO
 regions are backed by host memory, is there any way we can already
 distinguish such memslots from ordinary ones? If we can, is there
 anything we could do to treat these specially? Perhaps something like
 using read-only memslots so we can at least trap guest writes instead
 of having main memory going out of sync with the caches unnoticed? I
 am just brainstorming here ...
>>>
>>> I think the only sensible solution is to make sure that the guest and
>>> emulation mappings use the same memory type, either cached or
>>> non-cached, and we 'simply' have to find the best way to implement this.
>>>
>>> As Drew suggested, forcing some S2 mappings to be non-cacheable is the
>>> one way.
>>>
>>> The other way is to use something like what you once wrote that rewrites
>>> stage-1 mappings to be cacheable, does that apply here ?
>>>
>>> Do we have a clear picture of why we'd prefer one way over the other?
>>>
>>
>> So first of all, let me reiterate that I could only find a single
>> instance in QEMU where a PCI MMIO region is backed by host memory,
>> which is vga-pci.c. I wonder of there are any other occurrences, but
>> if there aren't any, it makes much more sense to prohibit PCI BARs
>> backed by host memory rather than spend a lot of effort working around
>> it.
>
> Right, ok.  So Marc's point during his KVM Forum talk was basically,
> don't use the legacy VGA adapter on ARM and use virtio graphics, right?
>

 Yes. But nothing is preventing you currently from using that, and I
 think we should prefer crappy performance but correct operation over
 the current situation. So in general, we should either disallow PCI
 BARs backed by host memory, or emulate them, but never back them by a
 RAM memslot when running under ARM/KVM.
>>>
>>> agreed, I just think that emulating accesses by trapping them is not
>>> just slow, it's not really possible in practice and even if it is, it's
>>> probably *unusably* slow.
>>>
>>
>> Well, it would probably involve a lot of effort to implement 

Re: issues with emulated PCI MMIO backed by host memory under KVM

2016-06-28 Thread Laszlo Ersek
On 06/27/16 16:29, Alexander Graf wrote:
> 
> 
>> Am 27.06.2016 um 15:57 schrieb Ard Biesheuvel :
>>
>>> On 27 June 2016 at 15:35, Christoffer Dall  
>>> wrote:
 On Mon, Jun 27, 2016 at 02:30:46PM +0200, Ard Biesheuvel wrote:
> On 27 June 2016 at 12:34, Christoffer Dall  
> wrote:
>> On Mon, Jun 27, 2016 at 11:47:18AM +0200, Ard Biesheuvel wrote:
>>> On 27 June 2016 at 11:16, Christoffer Dall 
>>>  wrote:
>>> Hi,
>>>
>>> I'm going to ask some stupid questions here...
>>>
 On Fri, Jun 24, 2016 at 04:04:45PM +0200, Ard Biesheuvel wrote:
 Hi all,

 This old subject came up again in a discussion related to PCIe support
 for QEMU/KVM under Tianocore. The fact that we need to map PCI MMIO
 regions as cacheable is preventing us from reusing a significant slice
 of the PCIe support infrastructure, and so I'd like to bring this up
 again, perhaps just to reiterate why we're simply out of luck.

 To refresh your memories, the issue is that on ARM, PCI MMIO regions
 for emulated devices may be backed by memory that is mapped cacheable
 by the host. Note that this has nothing to do with the device being
 DMA coherent or not: in this case, we are dealing with regions that
 are not memory from the POV of the guest, and it is reasonable for the
 guest to assume that accesses to such a region are not visible to the
 device before they hit the actual PCI MMIO window and are translated
 into cycles on the PCI bus.
>>>
>>> For the sake of completeness, why is this reasonable?
>>
>> Because the whole point of accessing these regions is to communicate
>> with the device. It is common to use write combining mappings for
>> things like framebuffers to group writes before they hit the PCI bus,
>> but any caching just makes it more difficult for the driver state and
>> device state to remain synchronized.
>>
>>> Is this how any real ARM system implementing PCI would actually work?
>>
>> Yes.
>>
 That means that mapping such a region
 cacheable is a strange thing to do, in fact, and it is unlikely that
 patches implementing this against the generic PCI stack in Tianocore
 will be accepted by the maintainers.

 Note that this issue not only affects framebuffers on PCI cards, it
 also affects emulated USB host controllers (perhaps Alex can remind us
 which one exactly?) and likely other emulated generic PCI devices as
 well.

 Since the issue exists only for emulated PCI devices whose MMIO
 regions are backed by host memory, is there any way we can already
 distinguish such memslots from ordinary ones? If we can, is there
 anything we could do to treat these specially? Perhaps something like
 using read-only memslots so we can at least trap guest writes instead
 of having main memory going out of sync with the caches unnoticed? I
 am just brainstorming here ...
>>>
>>> I think the only sensible solution is to make sure that the guest and
>>> emulation mappings use the same memory type, either cached or
>>> non-cached, and we 'simply' have to find the best way to implement this.
>>>
>>> As Drew suggested, forcing some S2 mappings to be non-cacheable is the
>>> one way.
>>>
>>> The other way is to use something like what you once wrote that rewrites
>>> stage-1 mappings to be cacheable, does that apply here ?
>>>
>>> Do we have a clear picture of why we'd prefer one way over the other?
>>
>> So first of all, let me reiterate that I could only find a single
>> instance in QEMU where a PCI MMIO region is backed by host memory,
>> which is vga-pci.c. I wonder of there are any other occurrences, but
>> if there aren't any, it makes much more sense to prohibit PCI BARs
>> backed by host memory rather than spend a lot of effort working around
>> it.
>
> Right, ok.  So Marc's point during his KVM Forum talk was basically,
> don't use the legacy VGA adapter on ARM and use virtio graphics, right?

 Yes. But nothing is preventing you currently from using that, and I
 think we should prefer crappy performance but correct operation over
 the current situation. So in general, we should either disallow PCI
 BARs backed by host memory, or emulate them, but never back them by a
 RAM memslot when running under ARM/KVM.
>>>
>>> agreed, I just think that emulating accesses by trapping them is not
>>> just slow, it's not really possible in practice and even if it is, it's
>>> probably *unusably* slow.
>>
>> Well, it would probably involve a lot of effort to implement 

Re: issues with emulated PCI MMIO backed by host memory under KVM

2016-06-28 Thread Laszlo Ersek
On 06/27/16 12:34, Christoffer Dall wrote:
> On Mon, Jun 27, 2016 at 11:47:18AM +0200, Ard Biesheuvel wrote:

>> So first of all, let me reiterate that I could only find a single
>> instance in QEMU where a PCI MMIO region is backed by host memory,
>> which is vga-pci.c. I wonder of there are any other occurrences, but
>> if there aren't any, it makes much more sense to prohibit PCI BARs
>> backed by host memory rather than spend a lot of effort working around
>> it.
> 
> Right, ok.  So Marc's point during his KVM Forum talk was basically,
> don't use the legacy VGA adapter on ARM and use virtio graphics, right?

The EFI GOP (Graphics Output Protocol) abstraction provides two ways for
UEFI applications to access the display, and one way for a runtime OS to
inherit the display hardware from the firmware (without OS native drivers).

(a) For UEFI apps:
- direct framebuffer access
- Blt() (block transfer) member function

(b) For runtime OS:
- direct framebuffer access ("efifb" in Linux)

Virtio-gpu lacks a linear framebuffer by design. Therefore the above
methods are reduced to the following:

(c) UEFI apps can access virtio-gpu with:
- GOP.Blt() member function only

(d) The runtime guest OS can access the virtio-gpu device as-inherited
from the firmware (i.e., without native drivers) with:
- n/a.

Given that we expect all aarch64 OSes to include native virtio-gpu
drivers on their install media, (d) is actually not a problem. Whenever
the OS kernel runs, we except to have no need for "efifb", ever. So
that's good.

The problem is (c). UEFI boot loaders would have to be taught to call
GOP.Blt() manually, whenever they need to display something. I'm not
sure about grub2's current status, but it is free software, so in theory
it should be doable. However, UEFI windows boot loaders are proprietary
*and* they require direct framebuffer access (on x86 at least); they
don't work with Blt()-only. (I found some Microsoft presentations about
this earlier.)

So, virtio-gpu is an almost universal solution for the problem, but not
entirely. For any given GOP, offering Blt() *only* (i.e., not exposing a
linear framebuffer) conforms to the UEFI spec, but some boot loaders are
known to present further requirements (on x86 anyway).

Thanks
Laszlo
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED

2015-05-14 Thread Laszlo Ersek
On 05/14/15 15:00, Andrew Jones wrote:
 On Thu, May 14, 2015 at 01:38:11PM +0100, Peter Maydell wrote:
 On 14 May 2015 at 13:28, Paolo Bonzini pbonz...@redhat.com wrote:
 Well, PCI BARs are generally MMIO resources, and hence should not be cached.

 As an optimization, OS drivers can mark them as cacheable or
 write-combining or something like that, but in general it's a safe
 default to leave them uncached---one would think.

 Isn't this handled by the OS mapping them in the 'prefetchable'
 MMIO window rather than the 'non-prefetchable' one? (QEMU's
 generic-PCIe device doesn't yet support the prefetchable window.)
 
 I was thinking (with my limited PCI knowledge) the same thing, and
 was planning on experimenting with that.

This could be supported in UEFI as well, with the following steps:
- the DTB that QEMU provides UEFI with should advertise such a
  prefetchable window.
- The driver in UEFI that parses the DTB should understand that DTB
  node (well, record type), and store the appropriate base  size into
  some new dynamic PCDs (= basically, firmware wide global variables;
  PCD = platform configuration database)
- The entry point of the host bridge driver would call
  gDS-AddMemorySpace() twice, separately for the two different windows,
  with their appropriate caching attributes.
- The host bridge driver needs to be extended so that TypePMem32
  requests are not rejected (like now); they should be handled
  similarly to TypeMem32. Except, the gDS-AllocateMemorySpace() call
  should allocate from the prefetchable range (determined by the new
  PCDs above).
- QEMU's emulated devices should then expose their BARs as prefetchable
  (so that the above branch would be taken in the host bridge driver).

(Of course, if QEMU intends to emulate PCI devices somewhat
realistically, then QEMU should claim non-prefetchable for BARs that
would not be prefetchable on physical hardware either, and then the
hypervisor should accommodate the firmware's UC mapping and say hey I
know better, we're virtual in fact, and override the attribute (- use
WB instead of UC). With which we'd be back to square one...)

Thanks
Laszlo
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED

2015-05-14 Thread Laszlo Ersek
On 05/14/15 14:34, Christoffer Dall wrote:
 On Thu, May 14, 2015 at 02:28:49PM +0200, Paolo Bonzini wrote:


 On 14/05/2015 14:24, Christoffer Dall wrote:
 On Thu, May 14, 2015 at 02:08:49PM +0200, Paolo Bonzini wrote:


 On 14/05/2015 14:00, Christoffer Dall wrote:
 So, getting back to my original question.  Is the point then that UEFI
 must assume (from ACPI/DT) the cache-coherency properties of the PCI
 controller which exists in hardware on the system you're running on,
 even for the virtual PCI bus because that will be the semantics for
 assigned devices?

 And in that case, we have no way to distinguish between passthrough
 devices and virtual devices plugged into the virtual PCI bus?

 Well, we could use the subsystem id.  But it's a hack, and may cause
 incompatibilities with some drivers.  Michael, any ideas?

 What about the idea of having two virtual PCI buses on your system where
 one is always cache-coherent and uses for virtual devices, and the other
 is whatever the hardware is and used for passthrough devices?

 I think that was rejected before.

 Do you remember where?  I just remember Catalin mentioning the idea to
 me verbally.

 In the last centithread on the subject. :)

 At least I and Peter disagreed.  It's not about the heavy added use of
 resources, it's more about it being really easy to misconfigure.

 But I'm still not sure why UEFI/Linux currently sees our PCI bus as
 being non-coherent when in fact it is and we have no passthrough issues
 currently.  Are all PCI controllers always non-coherent for some reason
 and therefore we model it as such too?

 Well, PCI BARs are generally MMIO resources, and hence should not be cached.

 As an optimization, OS drivers can mark them as cacheable or
 write-combining or something like that, but in general it's a safe
 default to leave them uncached---one would think.

 ok, I guess this series makes sense then, assuming it works, and
 assuming we don't kill performance by going to RAM all the time when we
 don't have to...

The idea Paolo and I had discussed in the past is:
- Remove the kludge from UEFI, and map all MMIO BARs as uncached by
  default. This should be a theoretically correct approach, and for
  assigned devices, correct in practice too.

- At an appropriate place in the firmware (specifically, right before
  this line: [1]), when PCI devices have been enumerated, but their
  particular drivers (especially VGA) have not been connected yet,
  check the subsystem id / vendor id / etc for each, and if we can tell
  it's virtual, then set the attributes for all of its MMIO bars to
  writeback.

It doesn't seem hard to implement, I've just been shying away from
actually coding it up because I'd like to see it make a difference for
an actual assigned device. That is, reproducing the current (statically
kludged) behavior wouldn't be hard, but I prefer not to write a new
patch until I can test it both ways. UC is broken and WB works for
virtual devices, fine; now let's see if the exact reverse holds for
assigned devices in reality.

... Testing of which will require someone to send a PCI card (NIC or
GPU) -- with an AARCH64 UEFI driver oprom on it -- to my place, so that
I can plug into my Mustang. ;)

Thanks
Laszlo

[1]
https://github.com/tianocore/edk2/blob/99d9ade85aad554a0fa08fff8586b0fd40570ac3/ArmPlatformPkg/ArmVirtualizationPkg/Library/PlatformIntelBdsLib/IntelBdsPlatform.c#L366

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED

2015-05-14 Thread Laszlo Ersek
On 05/14/15 12:30, Christoffer Dall wrote:
 On Wed, May 13, 2015 at 01:31:51PM +0200, Andrew Jones wrote:
 Introduce a new memory region flag, KVM_MEM_UNCACHED, which is
 needed by ARM. This flag informs KVM that the given memory region
 is typically mapped by the guest as non-cacheable. KVM for ARM
 then ensures that that memory is indeed mapped non-cacheable by
 the guest, and also remaps that region as non-cacheable for
 userspace, allowing them both to maintain a coherent view.

 Changes since v1:
  1) don't pin pages [Paolo]
  2) ensure the guest maps the memory non-cacheable [me]
  3) clean up memslot flag documentation [Christoffer]
 changes 1 and 2 effectively redesigned/rewrote v1. Find v1 here
 http://www.spinics.net/lists/kvm-arm/msg14022.html

 The QEMU series for v1 hasn't really changed. Only the linux
 header hack needed to bump KVM_CAP_UNCACHED_MEM from 107 to
 116.  Find the series here
 http://www.spinics.net/lists/kvm-arm/msg14026.html

 Testing:
 This series still needs lots of testing, but I thought I'd
 kick it to the list early, as there's been recent interest
 in solving this problem, and I'd like to get test results
 and opinions on this approach from others sooner than later.
 I've tested with AAVMF (UEFI for AArch64 mach-virt guests).
 AAVMF has a kludge in it to avoid the coherency problem.
 
 How does the 'kludge' work?

https://github.com/tianocore/edk2/commit/f9a8be42

(It's probably worth looking at the documentation in the first hunk too,
under the commit message.)

Thanks
Laszlo


 
 I've tested both with and without that kludge active. Both
 worked for me (almost). Sometimes with the non-kludged
 version I was still able to see a bit of corruption in
 grub's output after edk2 loaded it - not much, and not always,
 but something.
 
 Remind me, this is a VGA framebuffer corruption with a PCI-plugged VGA
 card?
 
 Thanks,
 -Christoffer
 
 Anyway, it's quite frustrating, as I'm not sure
 what I'm missing...

 This series applies to Linus' 110bc76729d4, but I tested with
 a version backported to the current RHELSA kernel.

 Thanks for reviews and testing!

 drew


 Andrew Jones (3):
   arm/arm64: pageattr: add set_memory_nc
   KVM: promote KVM_MEMSLOT_INCOHERENT to uapi
   arm/arm64: KVM: implement 'uncached' mem coherency

  Documentation/virtual/kvm/api.txt | 20 --
  arch/arm/include/asm/cacheflush.h |  1 +
  arch/arm/include/asm/kvm_mmu.h|  5 -
  arch/arm/include/asm/pgtable-3level.h |  1 +
  arch/arm/include/asm/pgtable.h|  1 +
  arch/arm/include/uapi/asm/kvm.h   |  1 +
  arch/arm/kvm/arm.c|  1 +
  arch/arm/kvm/mmu.c| 39 
 ++-
  arch/arm/mm/pageattr.c|  7 +++
  arch/arm64/include/asm/cacheflush.h   |  1 +
  arch/arm64/include/asm/kvm_mmu.h  |  5 -
  arch/arm64/include/asm/memory.h   |  1 +
  arch/arm64/include/asm/pgtable.h  |  1 +
  arch/arm64/include/uapi/asm/kvm.h |  1 +
  arch/arm64/mm/pageattr.c  |  8 +++
  include/linux/kvm_host.h  |  1 -
  include/uapi/linux/kvm.h  |  2 ++
  virt/kvm/kvm_main.c   |  7 ++-
  18 files changed, 79 insertions(+), 24 deletions(-)

 -- 
 2.1.0


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm