date:20151124

Re: [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support

2015-11-24 Thread Michael S. Tsirkin

On Tue, Nov 24, 2015 at 09:35:26PM +0800, Lan Tianyu wrote:
> This patch is to add SRIOV VF migration support.
> Create new device type "vfio-sriov" and add faked PCI migration capability
> to the type device.
> 
> The purpose of the new capability
> 1) sync migration status with VF driver in the VM
> 2) Get mailbox irq vector to notify VF driver during migration.
> 3) Provide a way to control injecting irq or not.
> 
> Qemu will migrate PCI configure space regs and MSIX config for VF.
> Inject mailbox irq at last stage of migration to notify VF about
> migration event and wait VF driver ready for migration.

I think this last bit "wait VF driver ready for migration"
is wrong. Not a lot is gained as compared to hotunplug.

To really get a benefit from this feature migration should
succeed even if guest is stuck, then interrupt should
tell guest that it has to reset the driver.


> VF driver
> writeS PCI config reg PCI_VF_MIGRATION_VF_STATUS in the new cap table
> to tell Qemu.
> 
> Signed-off-by: Lan Tianyu 
> ---
>  hw/vfio/Makefile.objs |   2 +-
>  hw/vfio/pci.c |   6 ++
>  hw/vfio/pci.h |   4 ++
>  hw/vfio/sriov.c   | 178 
> ++
>  4 files changed, 189 insertions(+), 1 deletion(-)
>  create mode 100644 hw/vfio/sriov.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index d540c9d..9cf0178 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -1,6 +1,6 @@
>  ifeq ($(CONFIG_LINUX), y)
>  obj-$(CONFIG_SOFTMMU) += common.o
> -obj-$(CONFIG_PCI) += pci.o
> +obj-$(CONFIG_PCI) += pci.o sriov.o
>  obj-$(CONFIG_SOFTMMU) += platform.o
>  obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
>  endif
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 7c43fc1..e7583b5 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2013,6 +2013,11 @@ void vfio_pci_write_config(PCIDevice *pdev, uint32_t 
> addr,
>  } else if (was_enabled && !is_enabled) {
>  vfio_disable_msix(vdev);
>  }
> +} else if (vdev->migration_cap &&
> +ranges_overlap(addr, len, vdev->migration_cap, 0x10)) {
> +/* Write everything to QEMU to keep emulated bits correct */
> +pci_default_write_config(pdev, addr, val, len);
> +vfio_migration_cap_handle(pdev, addr, val, len);
>  } else {
>  /* Write everything to QEMU to keep emulated bits correct */
>  pci_default_write_config(pdev, addr, val, len);
> @@ -3517,6 +3522,7 @@ static int vfio_initfn(PCIDevice *pdev)
>  vfio_register_err_notifier(vdev);
>  vfio_register_req_notifier(vdev);
>  vfio_setup_resetfn(vdev);
> +vfio_add_migration_capability(vdev);
>  
>  return 0;
>  
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 6c00575..ee6ca5e 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -134,6 +134,7 @@ typedef struct VFIOPCIDevice {
>  PCIHostDeviceAddress host;
>  EventNotifier err_notifier;
>  EventNotifier req_notifier;
> +uint16_tmigration_cap;
>  int (*resetfn)(struct VFIOPCIDevice *);
>  uint32_t features;
>  #define VFIO_FEATURE_ENABLE_VGA_BIT 0
> @@ -162,3 +163,6 @@ uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t 
> addr, int len);
>  void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
> uint32_t val, int len);
>  void vfio_enable_msix(VFIOPCIDevice *vdev);
> +void vfio_add_migration_capability(VFIOPCIDevice *vdev);
> +void vfio_migration_cap_handle(PCIDevice *pdev, uint32_t addr,
> +   uint32_t val, int len);
> diff --git a/hw/vfio/sriov.c b/hw/vfio/sriov.c
> new file mode 100644
> index 000..3109538
> --- /dev/null
> +++ b/hw/vfio/sriov.c
> @@ -0,0 +1,178 @@
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "hw/hw.h"
> +#include "hw/vfio/pci.h"
> +#include "hw/vfio/vfio.h"
> +#include "hw/vfio/vfio-common.h"
> +
> +#define TYPE_VFIO_SRIOV "vfio-sriov"
> +
> +#define SRIOV_LM_SETUP 0x01
> +#define SRIOV_LM_COMPLETE 0x02
> +
> +QemuEvent migration_event;
> +
> +static void vfio_dev_post_load(void *opaque)
> +{
> +struct PCIDevice *pdev = (struct PCIDevice *)opaque;
> +VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
> +MSIMessage msg;
> +int vector;
> +
> +if (vfio_pci_read_config(pdev,
> +vdev->migration_cap + PCI_VF_MIGRATION_CAP, 1)
> +!= PCI_VF_MIGRATION_ENABLE)
> +return;
> +
> +vector = vfio_pci_read_config(pdev,
> +vdev->migration_cap + PCI_VF_MIGRATION_IRQ, 1);
> +
> +msg = msix_get_message(pdev, vector);
> +kvm_irqchip_send_msi(kvm_state, msg);
> +}
> +
> +static int vfio_dev_load(QEMUFile *f, void *opaque, int version_id)
> +{
> +struct PCIDevice *pdev = (struct PCIDevice *)opaque;
> +VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
> +int ret;
> +
> +

Re: [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver

2015-11-24 Thread Michael S. Tsirkin

On Tue, Nov 24, 2015 at 09:38:18PM +0800, Lan Tianyu wrote:
> This patch is to add migration support for ixgbevf driver. Using
> faked PCI migration capability table communicates with Qemu to
> share migration status and mailbox irq vector index.
> 
> Qemu will notify VF via sending MSIX msg to trigger mailbox
> vector during migration and store migration status in the
> PCI_VF_MIGRATION_VMM_STATUS regs in the new capability table.
> The mailbox irq will be triggered just befoe stop-and-copy stage
> and after migration on the target machine.
> 
> VF driver will put down net when detect migration and tell
> Qemu it's ready for migration via writing PCI_VF_MIGRATION_VF_STATUS
> reg. After migration, put up net again.
> 
> Qemu will in charge of migrating PCI config space regs and MSIX config.
> 
> The patch is to dedicate on the normal case that net traffic works
> when mailbox irq is enabled. For other cases(such as the driver
> isn't loaded, adapter is suspended or closed), mailbox irq won't be
> triggered and VF driver will disable it via PCI_VF_MIGRATION_CAP
> reg. These case will be resolved later.
> 
> Signed-off-by: Lan Tianyu 

I have to say, I was much more interested in the idea
of tracking dirty memory. I have some thoughts about
that one - did you give up on it then?



> ---
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |   5 ++
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 102 
> ++
>  2 files changed, 107 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
> b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> index 775d089..4b8ba2f 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> @@ -438,6 +438,11 @@ struct ixgbevf_adapter {
>   u64 bp_tx_missed;
>  #endif
>  
> + u8 migration_cap;
> + u8 last_migration_reg;
> + unsigned long migration_status;
> + struct work_struct migration_task;
> +
>   u8 __iomem *io_addr; /* Mainly for iounmap use */
>   u32 link_speed;
>   bool link_up;
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
> b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> index a16d267..95860c2 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> @@ -96,6 +96,8 @@ static int debug = -1;
>  module_param(debug, int, 0);
>  MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)");
>  
> +#define MIGRATION_IN_PROGRESS0
> +
>  static void ixgbevf_service_event_schedule(struct ixgbevf_adapter *adapter)
>  {
>   if (!test_bit(__IXGBEVF_DOWN, >state) &&
> @@ -1262,6 +1264,22 @@ static void ixgbevf_set_itr(struct ixgbevf_q_vector 
> *q_vector)
>   }
>  }
>  
> +static void ixgbevf_migration_check(struct ixgbevf_adapter *adapter) 
> +{
> + struct pci_dev *pdev = adapter->pdev;
> + u8 val;
> +
> + pci_read_config_byte(pdev,
> +  adapter->migration_cap + PCI_VF_MIGRATION_VMM_STATUS,
> +  );
> +
> + if (val != adapter->last_migration_reg) {
> + schedule_work(>migration_task);
> + adapter->last_migration_reg = val;
> + }
> +
> +}
> +
>  static irqreturn_t ixgbevf_msix_other(int irq, void *data)
>  {
>   struct ixgbevf_adapter *adapter = data;
> @@ -1269,6 +1287,7 @@ static irqreturn_t ixgbevf_msix_other(int irq, void 
> *data)
>  
>   hw->mac.get_link_status = 1;
>  
> + ixgbevf_migration_check(adapter);
>   ixgbevf_service_event_schedule(adapter);
>  
>   IXGBE_WRITE_REG(hw, IXGBE_VTEIMS, adapter->eims_other);
> @@ -1383,6 +1402,7 @@ out:
>  static int ixgbevf_request_msix_irqs(struct ixgbevf_adapter *adapter)
>  {
>   struct net_device *netdev = adapter->netdev;
> + struct pci_dev *pdev = adapter->pdev;
>   int q_vectors = adapter->num_msix_vectors - NON_Q_VECTORS;
>   int vector, err;
>   int ri = 0, ti = 0;
> @@ -1423,6 +1443,12 @@ static int ixgbevf_request_msix_irqs(struct 
> ixgbevf_adapter *adapter)
>   goto free_queue_irqs;
>   }
>  
> + if (adapter->migration_cap) {
> + pci_write_config_byte(pdev,
> + adapter->migration_cap + PCI_VF_MIGRATION_IRQ,
> + vector);
> + }
> +
>   return 0;
>  
>  free_queue_irqs:
> @@ -2891,6 +2917,59 @@ static void ixgbevf_watchdog_subtask(struct 
> ixgbevf_adapter *adapter)
>   ixgbevf_update_stats(adapter);
>  }
>  
> +static void ixgbevf_migration_task(struct work_struct *work)
> +{
> + struct ixgbevf_adapter *adapter = container_of(work,
> + struct ixgbevf_adapter,
> + migration_task);
> + struct pci_dev *pdev = adapter->pdev;
> + struct net_device *netdev = adapter->netdev;
> + u8 val;
> +
> + if (!test_bit(MIGRATION_IN_PROGRESS, >migration_status)) {
> + pci_read_config_byte(pdev,
> +

RE: [PATCH] KVM: x86: Add lowest-priority support for vt-d posted-interrupts

2015-11-24 Thread Wu, Feng



> -Original Message-
> From: Paolo Bonzini [mailto:pbonz...@redhat.com]
> Sent: Tuesday, November 24, 2015 10:38 PM
> To: Radim Krcmár ; Wu, Feng 
> Cc: kvm@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: Re: [PATCH] KVM: x86: Add lowest-priority support for vt-d posted-
> interrupts
> 
> 
> 
> On 24/11/2015 15:35, Radim Krcmár wrote:
> > > Thanks for your guys' review. Yes, we can introduce a module option
> > > for it. According to Radim's comments above, we need use the
> > > same policy for PI and non-PI lowest-priority interrupts, so here is the
> > > question: for vector hashing, it is easy to apply it for both non-PI and 
> > > PI
> > > case, however, for Round-Robin, in non-PI case, the round robin counter
> > > is used and updated when the interrupt is injected to guest, but for
> > > PI case, the interrupt is injected to guest totally by hardware, software
> > > cannot control it while interrupt delivery, we can only decide the
> > > destination vCPU for the PI interrupt in the initial configuration
> > > time (guest update vMSI -> QEMU -> KVM). Do you guys have any good
> > > suggestion to do round robin for PI lowest-priority? Seems Round robin
> > > is not a good way for PI lowest-priority interrupts. Any comments
> > > are appreciated!
> >
> > It's meaningless to try dynamic algorithms with PI so if we allow both
> > lowest priority algorithms, I'd let PI handle any lowest priority only
> > with vector hashing.  (It's an ugly compromise.)
> 
> For now, I would just keep the 4.4 behavior, i.e. disable PI unless
> there is a single destination || vector hashing is enabled.  We can flip
> the switch later.

Okay, let me try to understand this clearly:
- We will have a new KVM command line parameter to indicate whether
  vector hashing is enabled.
- If it is not enabled, for PI, we can only support single destination lowest
  priority interrupts, for non-PI, we continue to use RR.
- If it is enabled, for PI and non-PI we use vector hashing for both of them.

Is this the case you have in mind? Thanks a lot!

Thanks,
Feng

> 
> Paolo

RE: [PATCH] KVM: x86: Add lowest-priority support for vt-d posted-interrupts

2015-11-24 Thread Wu, Feng



> -Original Message-
> From: Radim Krčmář [mailto:rkrc...@redhat.com]
> Sent: Tuesday, November 24, 2015 10:32 PM
> To: Wu, Feng 
> Cc: pbonz...@redhat.com; kvm@vger.kernel.org; linux-
> ker...@vger.kernel.org
> Subject: Re: [PATCH] KVM: x86: Add lowest-priority support for vt-d posted-
> interrupts
> 
> 2015-11-24 01:26+, Wu, Feng:
> > "I don't think we do any vector hashing on our client parts.  This may be
> why the customer is not able to detect this on Skylake client silicon.
> > The vector hashing is micro-architectural and something we had done on
> server parts.
> >
> > If you look at the haswell server CPU spec (https://www-
> ssl.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-
> e5-v3-datasheet-vol-2.pdf)
> > In section 4.1.2, you will see an IntControl register (this is a register
> controlled/configured by BIOS) - see below.
> 
> Thank you!
> 
> > If you look at bits 6:4 in that register, you see the option we offer in
> hardware for what kind of redirection is applied to lowest priority 
> interrupts.
> > There are three options:
> > 1.  Fixed priority
> > 2.  Redirect last
> > 3.  Hash Vector
> >
> > If picking vector hash, then bits 10:8 specifies the APIC-ID bits used for 
> > the
> hashing."
> 
> The hash function just interprets a subset of vector's bits as a number
> and uses that as a starting offset in a search for an enabled APIC
> within the destination set?
> 
> For example:
> The x2APIC destination is 0x0055 (= first four even APICs in cluster
> 0), the vector is 0b1110, and bits 10:8 of IntControl are 000.
> 
> 000 means that bits 7:4 of vector are selected, thus the vector hash is
> 0b1110 = 14, so the round-robin effectively does 14 % 4 (because we only
> have 4 destinations) and delivers to the 3rd possible APIC (= ID 6)?

In my current implementation, I don't select a subset of vector's bits as
the number, instead, I use the whole vector number. For software emulation
p. o. v, do we really need to select a subset of the vector's bits as the base
number? What is your opinion? Thanks a lot!

Thank,
Feng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-24 Thread Lan Tianyu

On 2015年11月24日 22:20, Alexander Duyck wrote:
> I'm still not a fan of this approach.  I really feel like this is
> something that should be resolved by extending the existing PCI hot-plug
> rather than trying to instrument this per driver.  Then you will get the
> goodness for multiple drivers and multiple OSes instead of just one.  An
> added advantage to dealing with this in the PCI hot-plug environment
> would be that you could then still do a hot-plug even if the guest
> didn't load a driver for the VF since you would be working with the PCI
> slot instead of the device itself.
> 
> - Alex

Hi Alex:
What's you mentioned seems the bonding driver solution.
Paper "Live Migration with Pass-through Device for Linux VM" describes
it. It does VF hotplug during migration. In order to maintain Network
connection when VF is out, it takes advantage of Linux bonding driver to
switch between VF NIC and emulated NIC. But the side affects, that
requires VM to do additional configure and the performance during
switching two NIC is not good.

-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-24 Thread Alexander Duyck

On Tue, Nov 24, 2015 at 7:18 PM, Lan Tianyu  wrote:
> On 2015年11月24日 22:20, Alexander Duyck wrote:
>> I'm still not a fan of this approach.  I really feel like this is
>> something that should be resolved by extending the existing PCI hot-plug
>> rather than trying to instrument this per driver.  Then you will get the
>> goodness for multiple drivers and multiple OSes instead of just one.  An
>> added advantage to dealing with this in the PCI hot-plug environment
>> would be that you could then still do a hot-plug even if the guest
>> didn't load a driver for the VF since you would be working with the PCI
>> slot instead of the device itself.
>>
>> - Alex
>
> Hi Alex:
> What's you mentioned seems the bonding driver solution.
> Paper "Live Migration with Pass-through Device for Linux VM" describes
> it. It does VF hotplug during migration. In order to maintain Network
> connection when VF is out, it takes advantage of Linux bonding driver to
> switch between VF NIC and emulated NIC. But the side affects, that
> requires VM to do additional configure and the performance during
> switching two NIC is not good.

No, what I am getting at is that you can't go around and modify the
configuration space for every possible device out there.  This
solution won't scale.  If you instead moved the logic for notifying
the device into a separate mechanism such as making it a part of the
hot-plug logic then you only have to write the code once per OS in
order to get the hot-plug capability to pause/resume the device.  What
I am talking about is not full hot-plug, but rather to extend the
existing hot-plug in Qemu and the Linux kernel to support a
"pause/resume" functionality.  The PCI hot-plug specification calls
out the option of implementing something like this, but we don't
currently have support for it.

I just feel doing it through PCI hot-plug messages will scale much
better as you could likely make use of the power management
suspend/resume calls to take care of most of the needed implementation
details.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver

2015-11-24 Thread Alexander Duyck

On Tue, Nov 24, 2015 at 1:20 PM, Michael S. Tsirkin  wrote:
> On Tue, Nov 24, 2015 at 09:38:18PM +0800, Lan Tianyu wrote:
>> This patch is to add migration support for ixgbevf driver. Using
>> faked PCI migration capability table communicates with Qemu to
>> share migration status and mailbox irq vector index.
>>
>> Qemu will notify VF via sending MSIX msg to trigger mailbox
>> vector during migration and store migration status in the
>> PCI_VF_MIGRATION_VMM_STATUS regs in the new capability table.
>> The mailbox irq will be triggered just befoe stop-and-copy stage
>> and after migration on the target machine.
>>
>> VF driver will put down net when detect migration and tell
>> Qemu it's ready for migration via writing PCI_VF_MIGRATION_VF_STATUS
>> reg. After migration, put up net again.
>>
>> Qemu will in charge of migrating PCI config space regs and MSIX config.
>>
>> The patch is to dedicate on the normal case that net traffic works
>> when mailbox irq is enabled. For other cases(such as the driver
>> isn't loaded, adapter is suspended or closed), mailbox irq won't be
>> triggered and VF driver will disable it via PCI_VF_MIGRATION_CAP
>> reg. These case will be resolved later.
>>
>> Signed-off-by: Lan Tianyu 
>
> I have to say, I was much more interested in the idea
> of tracking dirty memory. I have some thoughts about
> that one - did you give up on it then?

The tracking of dirty pages still needs to be addressed unless the
interface is being downed before migration even starts which based on
other comments I am assuming is not the case.

I still feel that having a means of marking a page as being dirty when
it is unmapped would be the best way to go.  That way you only have to
update the DMA API instead of messing with each and every driver
trying to add code to force the page to be dirtied.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver

2015-11-24 Thread Lan Tianyu

On 2015年11月25日 05:20, Michael S. Tsirkin wrote:
> I have to say, I was much more interested in the idea
> of tracking dirty memory. I have some thoughts about
> that one - did you give up on it then?

No, our finial target is to keep VF active before doing
migration and tracking dirty memory is essential. But this
seems not easy to do that in short term for upstream. As
starters, stop VF before migration.

After deep thinking, the way of stopping VF still needs tracking
DMA-accessed dirty memory to make sure the received data buffer
before stopping VF migrated. It's easier to do that via dummy writing
data buffer when receive packet.

-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 0/3] basic busy polling support for vhost_net

2015-11-24 Thread Jason Wang

Hi all:

This series tries to add basic busy polling for vhost net. The idea is
simple: at the end of tx/rx processing, busy polling for new tx added
descriptor and rx receive socket for a while. The maximum number of
time (in us) could be spent on busy polling was specified ioctl.

Test A were done through:

- 50 us as busy loop timeout
- Netperf 2.6
- Two machines with back to back connected ixgbe
- Guest with 1 vcpu and 1 queue

Results:
- For stream workload, ioexits were reduced dramatically in medium
  size (1024-2048) of tx (at most -43%) and almost all rx (at most
  -84%) as a result of polling. This compensate for the possible
  wasted cpu cycles more or less. That porbably why we can still see
  some increasing in the normalized throughput in some cases.
- Throughput of tx were increased (at most 50%) expect for the huge
  write (16384). And we can send more packets in the case (+tpkts were
  increased).
- Very minor rx regression in some cases.
- Improvemnt on TCP_RR (at most 17%).

Guest TX:
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
   64/ 1/  +18%/  -10%/   +7%/  +11%/0%
   64/ 2/  +14%/  -13%/   +7%/  +10%/0%
   64/ 4/   +8%/  -17%/   +7%/   +9%/0%
   64/ 8/  +11%/  -15%/   +7%/  +10%/0%
  256/ 1/  +35%/   +9%/  +21%/  +12%/  -11%
  256/ 2/  +26%/   +2%/  +20%/   +9%/  -10%
  256/ 4/  +23%/0%/  +21%/  +10%/   -9%
  256/ 8/  +23%/0%/  +21%/   +9%/   -9%
  512/ 1/  +31%/   +9%/  +23%/  +18%/  -12%
  512/ 2/  +30%/   +8%/  +24%/  +15%/  -10%
  512/ 4/  +26%/   +5%/  +24%/  +14%/  -11%
  512/ 8/  +32%/   +9%/  +23%/  +15%/  -11%
 1024/ 1/  +39%/  +16%/  +29%/  +22%/  -26%
 1024/ 2/  +35%/  +14%/  +30%/  +21%/  -22%
 1024/ 4/  +34%/  +13%/  +32%/  +21%/  -25%
 1024/ 8/  +36%/  +14%/  +32%/  +19%/  -26%
 2048/ 1/  +50%/  +27%/  +34%/  +26%/  -42%
 2048/ 2/  +43%/  +21%/  +36%/  +25%/  -43%
 2048/ 4/  +41%/  +20%/  +37%/  +27%/  -43%
 2048/ 8/  +40%/  +18%/  +35%/  +25%/  -42%
16384/ 1/0%/  -12%/   -1%/   +8%/  +15%
16384/ 2/0%/  -10%/   +1%/   +4%/   +5%
16384/ 4/0%/  -11%/   -3%/0%/   +3%
16384/ 8/0%/  -10%/   -4%/0%/   +1%

Guest RX:
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
   64/ 1/   -2%/  -21%/   +1%/   +2%/  -75%
   64/ 2/   +1%/   -9%/  +12%/0%/  -55%
   64/ 4/0%/   -6%/   +5%/   -1%/  -44%
   64/ 8/   -5%/   -5%/   +7%/  -23%/  -50%
  256/ 1/   -8%/  -18%/  +16%/  +15%/  -63%
  256/ 2/0%/   -8%/   +9%/   -2%/  -26%
  256/ 4/0%/   -7%/   -8%/  +20%/  -41%
  256/ 8/   -8%/  -11%/   -9%/  -24%/  -78%
  512/ 1/   -6%/  -19%/  +20%/  +18%/  -29%
  512/ 2/0%/  -10%/  -14%/   -8%/  -31%
  512/ 4/   -1%/   -5%/  -11%/   -9%/  -38%
  512/ 8/   -7%/   -9%/  -17%/  -22%/  -81%
 1024/ 1/0%/  -16%/  +12%/   +9%/  -11%
 1024/ 2/0%/  -11%/0%/   +3%/  -30%
 1024/ 4/0%/   -4%/   +2%/   +6%/  -15%
 1024/ 8/   -3%/   -4%/   -8%/   -8%/  -70%
 2048/ 1/   -8%/  -23%/  +36%/  +22%/  -11%
 2048/ 2/0%/  -12%/   +1%/   +3%/  -29%
 2048/ 4/0%/   -3%/  -17%/  -15%/  -84%
 2048/ 8/0%/   -3%/   +1%/   -3%/  +10%
16384/ 1/0%/  -11%/   +4%/   +7%/  -22%
16384/ 2/0%/   -7%/   +4%/   +4%/  -33%
16384/ 4/0%/   -2%/   -2%/   -4%/  -23%
16384/ 8/   -1%/   -2%/   +1%/  -22%/  -40%

TCP_RR:
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
1/ 1/  +11%/  -26%/  +11%/  +11%/  +10%
1/25/  +11%/  -15%/  +11%/  +11%/0%
1/50/   +9%/  -16%/  +10%/  +10%/0%
1/   100/   +9%/  -15%/   +9%/   +9%/0%
   64/ 1/  +11%/  -31%/  +11%/  +11%/  +11%
   64/25/  +12%/  -14%/  +12%/  +12%/0%
   64/50/  +11%/  -14%/  +12%/  +12%/0%
   64/   100/  +11%/  -15%/  +11%/  +11%/0%
  256/ 1/  +11%/  -27%/  +11%/  +11%/  +10%
  256/25/  +17%/  -11%/  +16%/  +16%/   -1%
  256/50/  +16%/  -11%/  +17%/  +17%/   +1%
  256/   100/  +17%/  -11%/  +18%/  +18%/   +1%

Test B were done through:

- 50us as busy loop timeout
- Netperf 2.6
- Two machines with back to back connected ixgbe
- Two guests each wich 1 vcpu and 1 queue
- pin two vhost threads to the same cpu on host to simulate the cpu
  contending

Results:
- In this radical case, we can still get at most 14% improvement on
  TCP_RR.
- For guest tx stream, minor improvemnt with at most 5% regression in
  one byte case. For guest rx stream, at most 5% regression were seen.

Guest TX:
size /-+%   /
1/-5.55%/
64   /+1.11%/
256  /+2.33%/
512  /-0.03%/
1024 /+1.14%/
4096 /+0.00%/
16384/+0.00%/

Guest RX:
size /-+%   /
1/-5.11%/
64   /-0.55%/
256  /-2.35%/
512  /-3.39%/
1024 /+6.8% /
4096 /-0.01%/
16384/+0.00%/

TCP_RR:
size /-+%/
1/+9.79% /
64   /+4.51% /
256  /+6.47% /
512  /-3.37% /
1024 /+6.15% /
4096 /+14.88%/
16384/-2.23% /

Changes from RFC V3:
- small tweak on the code to avoid

[PATCH net-next 1/3] vhost: introduce vhost_has_work()

2015-11-24 Thread Jason Wang

This path introduces a helper which can give a hint for whether or not
there's a work queued in the work list.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vhost.c | 7 +++
 drivers/vhost/vhost.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index eec2f11..163b365 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -245,6 +245,13 @@ void vhost_work_queue(struct vhost_dev *dev, struct 
vhost_work *work)
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
 
+/* A lockless hint for busy polling code to exit the loop */
+bool vhost_has_work(struct vhost_dev *dev)
+{
+   return !list_empty(>work_list);
+}
+EXPORT_SYMBOL_GPL(vhost_has_work);
+
 void vhost_poll_queue(struct vhost_poll *poll)
 {
vhost_work_queue(poll->dev, >work);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d3f7674..43284ad 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -37,6 +37,7 @@ struct vhost_poll {
 
 void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
+bool vhost_has_work(struct vhost_dev *dev);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
 unsigned long mask, struct vhost_dev *dev);
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 3/3] vhost_net: basic polling support

2015-11-24 Thread Jason Wang

This patch tries to poll for new added tx buffer or socket receive
queue for a while at the end of tx/rx processing. The maximum time
spent on polling were specified through a new kind of vring ioctl.

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c| 72 ++
 drivers/vhost/vhost.c  | 15 ++
 drivers/vhost/vhost.h  |  1 +
 include/uapi/linux/vhost.h | 11 +++
 4 files changed, 94 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9eda69e..ce6da77 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -287,6 +287,41 @@ static void vhost_zerocopy_callback(struct ubuf_info 
*ubuf, bool success)
rcu_read_unlock_bh();
 }
 
+static inline unsigned long busy_clock(void)
+{
+   return local_clock() >> 10;
+}
+
+static bool vhost_can_busy_poll(struct vhost_dev *dev,
+   unsigned long endtime)
+{
+   return likely(!need_resched()) &&
+  likely(!time_after(busy_clock(), endtime)) &&
+  likely(!signal_pending(current)) &&
+  !vhost_has_work(dev) &&
+  single_task_running();
+}
+
+static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
+   struct vhost_virtqueue *vq,
+   struct iovec iov[], unsigned int iov_size,
+   unsigned int *out_num, unsigned int *in_num)
+{
+   unsigned long uninitialized_var(endtime);
+
+   if (vq->busyloop_timeout) {
+   preempt_disable();
+   endtime = busy_clock() + vq->busyloop_timeout;
+   while (vhost_can_busy_poll(vq->dev, endtime) &&
+  !vhost_vq_more_avail(vq->dev, vq))
+   cpu_relax();
+   preempt_enable();
+   }
+
+   return vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
+out_num, in_num, NULL, NULL);
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
@@ -331,10 +366,9 @@ static void handle_tx(struct vhost_net *net)
  % UIO_MAXIOV == nvq->done_idx))
break;
 
-   head = vhost_get_vq_desc(vq, vq->iov,
-ARRAY_SIZE(vq->iov),
-, ,
-NULL, NULL);
+   head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
+   ARRAY_SIZE(vq->iov),
+   , );
/* On error, stop handling until the next kick. */
if (unlikely(head < 0))
break;
@@ -435,6 +469,34 @@ static int peek_head_len(struct sock *sk)
return len;
 }
 
+static int vhost_net_peek_head_len(struct vhost_net *net, struct sock *sk)
+{
+   struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
+   struct vhost_virtqueue *vq = >vq;
+   unsigned long uninitialized_var(endtime);
+
+   if (vq->busyloop_timeout) {
+   mutex_lock(>mutex);
+   vhost_disable_notify(>dev, vq);
+
+   preempt_disable();
+   endtime = busy_clock() + vq->busyloop_timeout;
+
+   while (vhost_can_busy_poll(>dev, endtime) &&
+  skb_queue_empty(>sk_receive_queue) &&
+  !vhost_vq_more_avail(>dev, vq))
+   cpu_relax();
+
+   preempt_enable();
+
+   if (vhost_enable_notify(>dev, vq))
+   vhost_poll_queue(>poll);
+   mutex_unlock(>mutex);
+   }
+
+   return peek_head_len(sk);
+}
+
 /* This is a multi-buffer version of vhost_get_desc, that works if
  * vq has read descriptors only.
  * @vq - the relevant virtqueue
@@ -553,7 +615,7 @@ static void handle_rx(struct vhost_net *net)
vq->log : NULL;
mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
 
-   while ((sock_len = peek_head_len(sock->sk))) {
+   while ((sock_len = vhost_net_peek_head_len(net, sock->sk))) {
sock_len += sock_hlen;
vhost_len = sock_len + vhost_hlen;
headcount = get_rx_bufs(vq, vq->heads, vhost_len,
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index b86c5aa..857af6c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -285,6 +285,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq->memory = NULL;
vq->is_le = virtio_legacy_is_little_endian();
vhost_vq_reset_user_be(vq);
+   vq->busyloop_timeout = 0;
 }
 
 static int vhost_worker(void *data)
@@ -747,6 +748,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void 
__user *argp)
struct

[PATCH net-next 2/3] vhost: introduce vhost_vq_more_avail()

2015-11-24 Thread Jason Wang

Signed-off-by: Jason Wang 
---
 drivers/vhost/vhost.c | 26 +-
 drivers/vhost/vhost.h |  1 +
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 163b365..b86c5aa 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1633,10 +1633,25 @@ void vhost_add_used_and_signal_n(struct vhost_dev *dev,
 }
 EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
 
+bool vhost_vq_more_avail(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+{
+   __virtio16 avail_idx;
+   int r;
+
+   r = __get_user(avail_idx, >avail->idx);
+   if (r) {
+   vq_err(vq, "Failed to check avail idx at %p: %d\n",
+  >avail->idx, r);
+   return false;
+   }
+
+   return vhost16_to_cpu(vq, avail_idx) != vq->avail_idx;
+}
+EXPORT_SYMBOL_GPL(vhost_vq_more_avail);
+
 /* OK, now we need to know about added descriptors. */
 bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
-   __virtio16 avail_idx;
int r;
 
if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
@@ -1660,14 +1675,7 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct 
vhost_virtqueue *vq)
/* They could have slipped one in as we were doing that: make
 * sure it's written, then check again. */
smp_mb();
-   r = __get_user(avail_idx, >avail->idx);
-   if (r) {
-   vq_err(vq, "Failed to check avail idx at %p: %d\n",
-  >avail->idx, r);
-   return false;
-   }
-
-   return vhost16_to_cpu(vq, avail_idx) != vq->avail_idx;
+   return vhost_vq_more_avail(dev, vq);
 }
 EXPORT_SYMBOL_GPL(vhost_enable_notify);
 
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 43284ad..2f3c57c 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -159,6 +159,7 @@ void vhost_add_used_and_signal_n(struct vhost_dev *, struct 
vhost_virtqueue *,
   struct vring_used_elem *heads, unsigned count);
 void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
 void vhost_disable_notify(struct vhost_dev *, struct vhost_virtqueue *);
+bool vhost_vq_more_avail(struct vhost_dev *, struct vhost_virtqueue *);
 bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);
 
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 11/21] arm64: KVM: Implement the core world switch

2015-11-24 Thread Alex Bennée


Marc Zyngier  writes:

> Implement the core of the world switch in C. Not everything is there
> yet, and there is nothing to re-enter the world switch either.
>
> But this already outlines the code structure well enough.
>
> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm64/kvm/hyp/Makefile |   1 +
>  arch/arm64/kvm/hyp/switch.c | 134 
> 
>  2 files changed, 135 insertions(+)
>  create mode 100644 arch/arm64/kvm/hyp/switch.c
>
> diff --git a/arch/arm64/kvm/hyp/Makefile b/arch/arm64/kvm/hyp/Makefile
> index 1e1ff06..9c11b0f 100644
> --- a/arch/arm64/kvm/hyp/Makefile
> +++ b/arch/arm64/kvm/hyp/Makefile
> @@ -8,3 +8,4 @@ obj-$(CONFIG_KVM_ARM_HOST) += timer-sr.o
>  obj-$(CONFIG_KVM_ARM_HOST) += sysreg-sr.o
>  obj-$(CONFIG_KVM_ARM_HOST) += debug-sr.o
>  obj-$(CONFIG_KVM_ARM_HOST) += entry.o
> +obj-$(CONFIG_KVM_ARM_HOST) += switch.o
> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
> new file mode 100644
> index 000..a3af81a
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/switch.c
> @@ -0,0 +1,134 @@
> +/*
> + * Copyright (C) 2015 - ARM Ltd
> + * Author: Marc Zyngier 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see .
> + */
> +
> +#include "hyp.h"
> +
> +static void __hyp_text __activate_traps(struct kvm_vcpu *vcpu)
> +{
> + u64 val;
> +
> + /*
> +  * We are about to set CPTR_EL2.TFP to trap all floating point
> +  * register accesses to EL2, however, the ARM ARM clearly states that
> +  * traps are only taken to EL2 if the operation would not otherwise
> +  * trap to EL1.  Therefore, always make sure that for 32-bit guests,
> +  * we set FPEXC.EN to prevent traps to EL1, when setting the TFP bit.
> +  */
> + val = vcpu->arch.hcr_el2;
> + if (val & HCR_RW) {
> + write_sysreg(1 << 30, fpexc32_el2);
> + isb();
> + }
> + write_sysreg(val, hcr_el2);
> + write_sysreg(1 << 15, hstr_el2);
> + write_sysreg(CPTR_EL2_TTA | CPTR_EL2_TFP, cptr_el2);
> + write_sysreg(vcpu->arch.mdcr_el2, mdcr_el2);
> +}
> +
> +static void __hyp_text __deactivate_traps(struct kvm_vcpu *vcpu)
> +{
> + write_sysreg(HCR_RW, hcr_el2);
> + write_sysreg(0, hstr_el2);
> + write_sysreg(read_sysreg(mdcr_el2) & MDCR_EL2_HPMN_MASK, mdcr_el2);
> + write_sysreg(0, cptr_el2);
> +}
> +
> +static void __hyp_text __activate_vm(struct kvm_vcpu *vcpu)
> +{
> + struct kvm *kvm = kern_hyp_va(vcpu->kvm);
> + write_sysreg(kvm->arch.vttbr, vttbr_el2);
> +}
> +
> +static void __hyp_text __deactivate_vm(struct kvm_vcpu *vcpu)
> +{
> + write_sysreg(0, vttbr_el2);
> +}
> +
> +static hyp_alternate_select(__vgic_call_save_state,
> + __vgic_v2_save_state, __vgic_v3_save_state,
> + ARM64_HAS_SYSREG_GIC_CPUIF);
> +
> +static hyp_alternate_select(__vgic_call_restore_state,
> + __vgic_v2_restore_state, __vgic_v3_restore_state,
> + ARM64_HAS_SYSREG_GIC_CPUIF);
> +
> +static void __hyp_text __vgic_save_state(struct kvm_vcpu *vcpu)
> +{
> + __vgic_call_save_state()(vcpu);
> + write_sysreg(read_sysreg(hcr_el2) & ~HCR_INT_OVERRIDE, hcr_el2);
> +}
> +
> +static void __hyp_text __vgic_restore_state(struct kvm_vcpu *vcpu)
> +{
> + u64 val;
> +
> + val = read_sysreg(hcr_el2);
> + val |=  HCR_INT_OVERRIDE;
> + val |= vcpu->arch.irq_lines;
> + write_sysreg(val, hcr_el2);
> +
> + __vgic_call_restore_state()(vcpu);
> +}
> +
> +int __hyp_text __guest_run(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_cpu_context *host_ctxt;
> + struct kvm_cpu_context *guest_ctxt;
> + u64 exit_code;
> +
> + vcpu = kern_hyp_va(vcpu);
> + write_sysreg(vcpu, tpidr_el2);
> +
> + host_ctxt = kern_hyp_va(vcpu->arch.host_cpu_context);
> + guest_ctxt = >arch.ctxt;
> +
> + __sysreg_save_state(host_ctxt);
> + __debug_cond_save_state(vcpu, >arch.host_debug_state, host_ctxt);
> +
> + __activate_traps(vcpu);
> + __activate_vm(vcpu);
> +
> + __vgic_restore_state(vcpu);
> + __timer_restore_state(vcpu);
> +
> + /*
> +  * We must restore the 32-bit state before the sysregs, thanks
> +  * to Cortex-A57 erratum #852523.
> +  */
> + __sysreg32_restore_state(vcpu);
> + __sysreg_restore_state(guest_ctxt);
> +

[PULL 6/8] KVM: arm/arm64: vgic: Trust the LR state for HW IRQs

2015-11-24 Thread Christoffer Dall

We were probing the physial distributor state for the active state of a
HW virtual IRQ, because we had seen evidence that the LR state was not
cleared when the guest deactivated a virtual interrupted.

However, this issue turned out to be a software bug in the GIC, which
was solved by: 84aab5e68c2a5e1e18d81ae8308c3ce25d501b29
(KVM: arm/arm64: arch_timer: Preserve physical dist. active
state on LR.active, 2015-11-24)

Therefore, get rid of the complexities and just look at the LR.

Reviewed-by: Marc Zyngier 
Signed-off-by: Christoffer Dall 
---
 virt/kvm/arm/vgic.c | 16 ++--
 1 file changed, 2 insertions(+), 14 deletions(-)

diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
index 97e2c08..65461f8 100644
--- a/virt/kvm/arm/vgic.c
+++ b/virt/kvm/arm/vgic.c
@@ -1417,25 +1417,13 @@ static bool vgic_process_maintenance(struct kvm_vcpu 
*vcpu)
 static bool vgic_sync_hwirq(struct kvm_vcpu *vcpu, int lr, struct vgic_lr vlr)
 {
struct vgic_dist *dist = >kvm->arch.vgic;
-   struct irq_phys_map *map;
-   bool phys_active;
bool level_pending;
-   int ret;
 
if (!(vlr.state & LR_HW))
return false;
 
-   map = vgic_irq_map_search(vcpu, vlr.irq);
-   BUG_ON(!map);
-
-   ret = irq_get_irqchip_state(map->irq,
-   IRQCHIP_STATE_ACTIVE,
-   _active);
-
-   WARN_ON(ret);
-
-   if (phys_active)
-   return 0;
+   if (vlr.state & LR_STATE_ACTIVE)
+   return false;
 
spin_lock(>lock);
level_pending = process_queued_irq(vcpu, lr, vlr);
-- 
2.1.2.330.g565301e.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PULL 2/8] arm64: KVM: Fix AArch32 to AArch64 register mapping

2015-11-24 Thread Christoffer Dall

From: Marc Zyngier 

When running a 32bit guest under a 64bit hypervisor, the ARMv8
architecture defines a mapping of the 32bit registers in the 64bit
space. This includes banked registers that are being demultiplexed
over the 64bit ones.

On exceptions caused by an operation involving a 32bit register, the
HW exposes the register number in the ESR_EL2 register. It was so
far understood that SW had to distinguish between AArch32 and AArch64
accesses (based on the current AArch32 mode and register number).

It turns out that I misinterpreted the ARM ARM, and the clue is in
D1.20.1: "For some exceptions, the exception syndrome given in the
ESR_ELx identifies one or more register numbers from the issued
instruction that generated the exception. Where the exception is
taken from an Exception level using AArch32 these register numbers
give the AArch64 view of the register."

Which means that the HW is already giving us the translated version,
and that we shouldn't try to interpret it at all (for example, doing
an MMIO operation from the IRQ mode using the LR register leads to
very unexpected behaviours).

The fix is thus not to perform a call to vcpu_reg32() at all from
vcpu_reg(), and use whatever register number is supplied directly.
The only case we need to find out about the mapping is when we
actively generate a register access, which only occurs when injecting
a fault in a guest.

Cc: sta...@vger.kernel.org
Reviewed-by: Robin Murphy 
Signed-off-by: Marc Zyngier 
Signed-off-by: Christoffer Dall 
---
 arch/arm64/include/asm/kvm_emulate.h | 8 +---
 arch/arm64/kvm/inject_fault.c| 2 +-
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_emulate.h 
b/arch/arm64/include/asm/kvm_emulate.h
index 17e92f0..3ca894e 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -99,11 +99,13 @@ static inline void vcpu_set_thumb(struct kvm_vcpu *vcpu)
*vcpu_cpsr(vcpu) |= COMPAT_PSR_T_BIT;
 }
 
+/*
+ * vcpu_reg should always be passed a register number coming from a
+ * read of ESR_EL2. Otherwise, it may give the wrong result on AArch32
+ * with banked registers.
+ */
 static inline unsigned long *vcpu_reg(const struct kvm_vcpu *vcpu, u8 reg_num)
 {
-   if (vcpu_mode_is_32bit(vcpu))
-   return vcpu_reg32(vcpu, reg_num);
-
return (unsigned long *)_gp_regs(vcpu)->regs.regs[reg_num];
 }
 
diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c
index 85c5715..648112e 100644
--- a/arch/arm64/kvm/inject_fault.c
+++ b/arch/arm64/kvm/inject_fault.c
@@ -48,7 +48,7 @@ static void prepare_fault32(struct kvm_vcpu *vcpu, u32 mode, 
u32 vect_offset)
 
/* Note: These now point to the banked copies */
*vcpu_spsr(vcpu) = new_spsr_value;
-   *vcpu_reg(vcpu, 14) = *vcpu_pc(vcpu) + return_offset;
+   *vcpu_reg32(vcpu, 14) = *vcpu_pc(vcpu) + return_offset;
 
/* Branch to exception vector */
if (sctlr & (1 << 13))
-- 
2.1.2.330.g565301e.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PULL 5/8] KVM: arm/arm64: arch_timer: Preserve physical dist. active state on LR.active

2015-11-24 Thread Christoffer Dall

We were incorrectly removing the active state from the physical
distributor on the timer interrupt when the timer output level was
deasserted.  We shouldn't be doing this without considering the virtual
interrupt's active state, because the architecture requires that when an
LR has the HW bit set and the pending or active bits set, then the
physical interrupt must also have the corresponding bits set.

This addresses an issue where we have been observing an inconsistency
between the LR state and the physical distributor state where the LR
state was active and the physical distributor was not active, which
shouldn't happen.

Reviewed-by: Marc Zyngier 
Signed-off-by: Christoffer Dall 
---
 include/kvm/arm_vgic.h|  2 +-
 virt/kvm/arm/arch_timer.c | 28 +---
 virt/kvm/arm/vgic.c   | 34 ++
 3 files changed, 40 insertions(+), 24 deletions(-)

diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index 9c747cb..d2f4147 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -342,10 +342,10 @@ int kvm_vgic_inject_mapped_irq(struct kvm *kvm, int cpuid,
   struct irq_phys_map *map, bool level);
 void vgic_v3_dispatch_sgi(struct kvm_vcpu *vcpu, u64 reg);
 int kvm_vgic_vcpu_pending_irq(struct kvm_vcpu *vcpu);
-int kvm_vgic_vcpu_active_irq(struct kvm_vcpu *vcpu);
 struct irq_phys_map *kvm_vgic_map_phys_irq(struct kvm_vcpu *vcpu,
   int virt_irq, int irq);
 int kvm_vgic_unmap_phys_irq(struct kvm_vcpu *vcpu, struct irq_phys_map *map);
+bool kvm_vgic_map_is_active(struct kvm_vcpu *vcpu, struct irq_phys_map *map);
 
 #define irqchip_in_kernel(k)   (!!((k)->arch.vgic.in_kernel))
 #define vgic_initialized(k)(!!((k)->arch.vgic.nr_cpus))
diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
index 21a0ab2..69bca18 100644
--- a/virt/kvm/arm/arch_timer.c
+++ b/virt/kvm/arm/arch_timer.c
@@ -221,17 +221,23 @@ void kvm_timer_flush_hwstate(struct kvm_vcpu *vcpu)
kvm_timer_update_state(vcpu);
 
/*
-* If we enter the guest with the virtual input level to the VGIC
-* asserted, then we have already told the VGIC what we need to, and
-* we don't need to exit from the guest until the guest deactivates
-* the already injected interrupt, so therefore we should set the
-* hardware active state to prevent unnecessary exits from the guest.
-*
-* Conversely, if the virtual input level is deasserted, then always
-* clear the hardware active state to ensure that hardware interrupts
-* from the timer triggers a guest exit.
-*/
-   if (timer->irq.level)
+   * If we enter the guest with the virtual input level to the VGIC
+   * asserted, then we have already told the VGIC what we need to, and
+   * we don't need to exit from the guest until the guest deactivates
+   * the already injected interrupt, so therefore we should set the
+   * hardware active state to prevent unnecessary exits from the guest.
+   *
+   * Also, if we enter the guest with the virtual timer interrupt active,
+   * then it must be active on the physical distributor, because we set
+   * the HW bit and the guest must be able to deactivate the virtual and
+   * physical interrupt at the same time.
+   *
+   * Conversely, if the virtual input level is deasserted and the virtual
+   * interrupt is not active, then always clear the hardware active state
+   * to ensure that hardware interrupts from the timer triggers a guest
+   * exit.
+   */
+   if (timer->irq.level || kvm_vgic_map_is_active(vcpu, timer->map))
phys_active = true;
else
phys_active = false;
diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
index 5335383..97e2c08 100644
--- a/virt/kvm/arm/vgic.c
+++ b/virt/kvm/arm/vgic.c
@@ -1096,6 +1096,27 @@ static void vgic_retire_lr(int lr_nr, struct kvm_vcpu 
*vcpu)
vgic_set_lr(vcpu, lr_nr, vlr);
 }
 
+static bool dist_active_irq(struct kvm_vcpu *vcpu)
+{
+   struct vgic_dist *dist = >kvm->arch.vgic;
+
+   return test_bit(vcpu->vcpu_id, dist->irq_active_on_cpu);
+}
+
+bool kvm_vgic_map_is_active(struct kvm_vcpu *vcpu, struct irq_phys_map *map)
+{
+   int i;
+
+   for (i = 0; i < vcpu->arch.vgic_cpu.nr_lr; i++) {
+   struct vgic_lr vlr = vgic_get_lr(vcpu, i);
+
+   if (vlr.irq == map->virt_irq && vlr.state & LR_STATE_ACTIVE)
+   return true;
+   }
+
+   return dist_active_irq(vcpu);
+}
+
 /*
  * An interrupt may have been disabled after being made pending on the
  * CPU interface (the classic case is a timer running while we're
@@ -1248,7 +1269,7 @@ static void __kvm_vgic_flush_hwstate(struct kvm_vcpu 
*vcpu)
 * may have been serviced from another vcpu. In all cases,

[PULL 3/8] arm64: KVM: Add workaround for Cortex-A57 erratum 834220

2015-11-24 Thread Christoffer Dall

From: Marc Zyngier 

Cortex-A57 parts up to r1p2 can misreport Stage 2 translation faults
when a Stage 1 permission fault or device alignment fault should
have been reported.

This patch implements the workaround (which is to validate that the
Stage-1 translation actually succeeds) by using code patching.

Cc: sta...@vger.kernel.org
Reviewed-by: Will Deacon 
Signed-off-by: Marc Zyngier 
Signed-off-by: Christoffer Dall 
---
 arch/arm64/Kconfig  | 21 +
 arch/arm64/include/asm/cpufeature.h |  3 ++-
 arch/arm64/kernel/cpu_errata.c  |  9 +
 arch/arm64/kvm/hyp.S|  6 ++
 4 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 9ac16a4..e55848c 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -316,6 +316,27 @@ config ARM64_ERRATUM_832075
 
  If unsure, say Y.
 
+config ARM64_ERRATUM_834220
+   bool "Cortex-A57: 834220: Stage 2 translation fault might be 
incorrectly reported in presence of a Stage 1 fault"
+   depends on KVM
+   default y
+   help
+ This option adds an alternative code sequence to work around ARM
+ erratum 834220 on Cortex-A57 parts up to r1p2.
+
+ Affected Cortex-A57 parts might report a Stage 2 translation
+ fault as the result of a Stage 1 fault for load crossing a
+ page boundary when there is a permission or device memory
+ alignment fault at Stage 1 and a translation fault at Stage 2.
+
+ The workaround is to verify that the Stage 1 translation
+ doesn't generate a fault before handling the Stage 2 fault.
+ Please note that this does not necessarily enable the workaround,
+ as it depends on the alternative framework, which will only patch
+ the kernel if an affected CPU is detected.
+
+ If unsure, say Y.
+
 config ARM64_ERRATUM_845719
bool "Cortex-A53: 845719: a load might read incorrect data"
depends on COMPAT
diff --git a/arch/arm64/include/asm/cpufeature.h 
b/arch/arm64/include/asm/cpufeature.h
index 11d5bb0f..52722ee 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -29,8 +29,9 @@
 #define ARM64_HAS_PAN  4
 #define ARM64_HAS_LSE_ATOMICS  5
 #define ARM64_WORKAROUND_CAVIUM_23154  6
+#define ARM64_WORKAROUND_8342207
 
-#define ARM64_NCAPS7
+#define ARM64_NCAPS8
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index 24926f2..feb6b4e 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -75,6 +75,15 @@ const struct arm64_cpu_capabilities arm64_errata[] = {
   (1 << MIDR_VARIANT_SHIFT) | 2),
},
 #endif
+#ifdef CONFIG_ARM64_ERRATUM_834220
+   {
+   /* Cortex-A57 r0p0 - r1p2 */
+   .desc = "ARM erratum 834220",
+   .capability = ARM64_WORKAROUND_834220,
+   MIDR_RANGE(MIDR_CORTEX_A57, 0x00,
+  (1 << MIDR_VARIANT_SHIFT) | 2),
+   },
+#endif
 #ifdef CONFIG_ARM64_ERRATUM_845719
{
/* Cortex-A53 r0p[01234] */
diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
index 1599701..ff2e038 100644
--- a/arch/arm64/kvm/hyp.S
+++ b/arch/arm64/kvm/hyp.S
@@ -1015,9 +1015,15 @@ el1_trap:
b.ne1f  // Not an abort we care about
 
/* This is an abort. Check for permission fault */
+alternative_if_not ARM64_WORKAROUND_834220
and x2, x1, #ESR_ELx_FSC_TYPE
cmp x2, #FSC_PERM
b.ne1f  // Not a permission fault
+alternative_else
+   nop // Use the permission fault path to
+   nop // check for a valid S1 translation,
+   nop // regardless of the ESR value.
+alternative_endif
 
/*
 * Check for Stage-1 page table walk, which is guaranteed
-- 
2.1.2.330.g565301e.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PULL 1/8] ARM/arm64: KVM: test properly for a PTE's uncachedness

2015-11-24 Thread Christoffer Dall

From: Ard Biesheuvel 

The open coded tests for checking whether a PTE maps a page as
uncached use a flawed '(pte_val(xxx) & CONST) != CONST' pattern,
which is not guaranteed to work since the type of a mapping is
not a set of mutually exclusive bits

For HYP mappings, the type is an index into the MAIR table (i.e, the
index itself does not contain any information whatsoever about the
type of the mapping), and for stage-2 mappings it is a bit field where
normal memory and device types are defined as follows:

#define MT_S2_NORMAL0xf
#define MT_S2_DEVICE_nGnRE  0x1

I.e., masking *and* comparing with the latter matches on the former,
and we have been getting lucky merely because the S2 device mappings
also have the PTE_UXN bit set, or we would misidentify memory mappings
as device mappings.

Since the unmap_range() code path (which contains one instance of the
flawed test) is used both for HYP mappings and stage-2 mappings, and
considering the difference between the two, it is non-trivial to fix
this by rewriting the tests in place, as it would involve passing
down the type of mapping through all the functions.

However, since HYP mappings and stage-2 mappings both deal with host
physical addresses, we can simply check whether the mapping is backed
by memory that is managed by the host kernel, and only perform the
D-cache maintenance if this is the case.

Cc: sta...@vger.kernel.org
Signed-off-by: Ard Biesheuvel 
Tested-by: Pavel Fedin 
Reviewed-by: Christoffer Dall 
Signed-off-by: Christoffer Dall 
---
 arch/arm/kvm/mmu.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 6984342..7dace90 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -98,6 +98,11 @@ static void kvm_flush_dcache_pud(pud_t pud)
__kvm_flush_dcache_pud(pud);
 }
 
+static bool kvm_is_device_pfn(unsigned long pfn)
+{
+   return !pfn_valid(pfn);
+}
+
 /**
  * stage2_dissolve_pmd() - clear and flush huge PMD entry
  * @kvm:   pointer to kvm structure.
@@ -213,7 +218,7 @@ static void unmap_ptes(struct kvm *kvm, pmd_t *pmd,
kvm_tlb_flush_vmid_ipa(kvm, addr);
 
/* No need to invalidate the cache for device mappings 
*/
-   if ((pte_val(old_pte) & PAGE_S2_DEVICE) != 
PAGE_S2_DEVICE)
+   if (!kvm_is_device_pfn(__phys_to_pfn(addr)))
kvm_flush_dcache_pte(old_pte);
 
put_page(virt_to_page(pte));
@@ -305,8 +310,7 @@ static void stage2_flush_ptes(struct kvm *kvm, pmd_t *pmd,
 
pte = pte_offset_kernel(pmd, addr);
do {
-   if (!pte_none(*pte) &&
-   (pte_val(*pte) & PAGE_S2_DEVICE) != PAGE_S2_DEVICE)
+   if (!pte_none(*pte) && !kvm_is_device_pfn(__phys_to_pfn(addr)))
kvm_flush_dcache_pte(*pte);
} while (pte++, addr += PAGE_SIZE, addr != end);
 }
@@ -1037,11 +1041,6 @@ static bool kvm_is_write_fault(struct kvm_vcpu *vcpu)
return kvm_vcpu_dabt_iswrite(vcpu);
 }
 
-static bool kvm_is_device_pfn(unsigned long pfn)
-{
-   return !pfn_valid(pfn);
-}
-
 /**
  * stage2_wp_ptes - write protect PMD range
  * @pmd:   pointer to pmd entry
-- 
2.1.2.330.g565301e.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PULL 8/8] arm64: kvm: report original PAR_EL1 upon panic

2015-11-24 Thread Christoffer Dall

From: Mark Rutland 

If we call __kvm_hyp_panic while a guest context is active, we call
__restore_sysregs before acquiring the system register values for the
panic, in the process throwing away the PAR_EL1 value at the point of
the panic.

This patch modifies __kvm_hyp_panic to stash the PAR_EL1 value prior to
restoring host register values, enabling us to report the original
values at the point of the panic.

Acked-by: Marc Zyngier 
Signed-off-by: Mark Rutland 
Signed-off-by: Christoffer Dall 
---
 arch/arm64/kvm/hyp.S | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
index ce70817..86c2898 100644
--- a/arch/arm64/kvm/hyp.S
+++ b/arch/arm64/kvm/hyp.S
@@ -864,6 +864,10 @@ ENTRY(__kvm_flush_vm_context)
 ENDPROC(__kvm_flush_vm_context)
 
 __kvm_hyp_panic:
+   // Stash PAR_EL1 before corrupting it in __restore_sysregs
+   mrs x0, par_el1
+   pushx0, xzr
+
// Guess the context by looking at VTTBR:
// If zero, then we're already a host.
// Otherwise restore a minimal host context before panicing.
@@ -898,7 +902,7 @@ __kvm_hyp_panic:
mrs x3, esr_el2
mrs x4, far_el2
mrs x5, hpfar_el2
-   mrs x6, par_el1
+   pop x6, xzr // active context PAR_EL1
mrs x7, tpidr_el2
 
mov lr, #(PSR_F_BIT | PSR_I_BIT | PSR_A_BIT | PSR_D_BIT |\
-- 
2.1.2.330.g565301e.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PULL 0/8] KVM/ARM Fixes for v4.4-rc3

2015-11-24 Thread Christoffer Dall

Hi Paolo,

Here's a set of fixes for KVM/ARM for v4.4-rc3 based on v4.4-rc2, because the
errata fixes don't apply on v4.4-rc1.  Let me know if you can pull this anyhow.

Thanks,
-Christoffer

The following changes since commit 1ec218373b8ebda821aec00bb156a9c94fad9cd4:

  Linux 4.4-rc2 (2015-11-22 16:45:59 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git 
tags/kvm-arm-for-v4.4-rc3

for you to fetch changes up to fbb4574ce9a37e15a9872860bf202f2be5bdf6c4:

  arm64: kvm: report original PAR_EL1 upon panic (2015-11-24 18:20:58 +0100)


KVM/ARM Fixes for v4.4-rc3.

Includes some timer fixes, properly unmapping PTEs, an errata fix, and two
tweaks to the EL2 panic code.


Ard Biesheuvel (1):
  ARM/arm64: KVM: test properly for a PTE's uncachedness

Christoffer Dall (3):
  KVM: arm/arm64: Fix preemptible timer active state crazyness
  KVM: arm/arm64: arch_timer: Preserve physical dist. active state on 
LR.active
  KVM: arm/arm64: vgic: Trust the LR state for HW IRQs

Marc Zyngier (2):
  arm64: KVM: Fix AArch32 to AArch64 register mapping
  arm64: KVM: Add workaround for Cortex-A57 erratum 834220

Mark Rutland (2):
  arm64: kvm: avoid %p in __kvm_hyp_panic
  arm64: kvm: report original PAR_EL1 upon panic

 arch/arm/kvm/arm.c   |  7 +
 arch/arm/kvm/mmu.c   | 15 +--
 arch/arm64/Kconfig   | 21 +++
 arch/arm64/include/asm/cpufeature.h  |  3 ++-
 arch/arm64/include/asm/kvm_emulate.h |  8 +++---
 arch/arm64/kernel/cpu_errata.c   |  9 +++
 arch/arm64/kvm/hyp.S | 14 --
 arch/arm64/kvm/inject_fault.c|  2 +-
 include/kvm/arm_vgic.h   |  2 +-
 virt/kvm/arm/arch_timer.c| 28 
 virt/kvm/arm/vgic.c  | 50 +---
 11 files changed, 100 insertions(+), 59 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PULL 7/8] arm64: kvm: avoid %p in __kvm_hyp_panic

2015-11-24 Thread Christoffer Dall

From: Mark Rutland 

Currently __kvm_hyp_panic uses %p for values which are not pointers,
such as the ESR value. This can confusingly lead to "(null)" being
printed for the value.

Use %x instead, and only use %p for host pointers.

Signed-off-by: Mark Rutland 
Acked-by: Marc Zyngier 
Cc: Christoffer Dall 
Signed-off-by: Christoffer Dall 
---
 arch/arm64/kvm/hyp.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
index ff2e038..ce70817 100644
--- a/arch/arm64/kvm/hyp.S
+++ b/arch/arm64/kvm/hyp.S
@@ -914,7 +914,7 @@ __kvm_hyp_panic:
 ENDPROC(__kvm_hyp_panic)
 
 __hyp_panic_str:
-   .ascii  "HYP panic:\nPS:%08x PC:%p ESR:%p\nFAR:%p HPFAR:%p 
PAR:%p\nVCPU:%p\n\0"
+   .ascii  "HYP panic:\nPS:%08x PC:%016x ESR:%08x\nFAR:%016x HPFAR:%016x 
PAR:%016x\nVCPU:%p\n\0"
 
.align  2
 
-- 
2.1.2.330.g565301e.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PULL 4/8] KVM: arm/arm64: Fix preemptible timer active state crazyness

2015-11-24 Thread Christoffer Dall

We were setting the physical active state on the GIC distributor in a
preemptible section, which could cause us to set the active state on
different physical CPU from the one we were actually going to run on,
hacoc ensues.

Since we are no longer descheduling/scheduling soft timers in the
flush/sync timer functions, simply moving the timer flush into a
non-preemptible section.

Reviewed-by: Marc Zyngier 
Signed-off-by: Christoffer Dall 
---
 arch/arm/kvm/arm.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index eab83b2..e06fd29 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -564,17 +564,12 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
vcpu_sleep(vcpu);
 
/*
-* Disarming the background timer must be done in a
-* preemptible context, as this call may sleep.
-*/
-   kvm_timer_flush_hwstate(vcpu);
-
-   /*
 * Preparing the interrupts to be injected also
 * involves poking the GIC, which must be done in a
 * non-preemptible context.
 */
preempt_disable();
+   kvm_timer_flush_hwstate(vcpu);
kvm_vgic_flush_hwstate(vcpu);
 
local_irq_disable();
-- 
2.1.2.330.g565301e.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 11/21] arm64: KVM: Implement the core world switch

2015-11-24 Thread Marc Zyngier

On Tue, 24 Nov 2015 17:29:14 +
Alex Bennée  wrote:

> 
> Marc Zyngier  writes:
> 
> > Implement the core of the world switch in C. Not everything is there
> > yet, and there is nothing to re-enter the world switch either.
> >
> > But this already outlines the code structure well enough.
> >
> > Signed-off-by: Marc Zyngier 
> > ---
> >  arch/arm64/kvm/hyp/Makefile |   1 +
> >  arch/arm64/kvm/hyp/switch.c | 134 
> > 
> >  2 files changed, 135 insertions(+)
> >  create mode 100644 arch/arm64/kvm/hyp/switch.c
> >
> > diff --git a/arch/arm64/kvm/hyp/Makefile b/arch/arm64/kvm/hyp/Makefile
> > index 1e1ff06..9c11b0f 100644
> > --- a/arch/arm64/kvm/hyp/Makefile
> > +++ b/arch/arm64/kvm/hyp/Makefile
> > @@ -8,3 +8,4 @@ obj-$(CONFIG_KVM_ARM_HOST) += timer-sr.o
> >  obj-$(CONFIG_KVM_ARM_HOST) += sysreg-sr.o
> >  obj-$(CONFIG_KVM_ARM_HOST) += debug-sr.o
> >  obj-$(CONFIG_KVM_ARM_HOST) += entry.o
> > +obj-$(CONFIG_KVM_ARM_HOST) += switch.o
> > diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
> > new file mode 100644
> > index 000..a3af81a
> > --- /dev/null
> > +++ b/arch/arm64/kvm/hyp/switch.c
> > @@ -0,0 +1,134 @@
> > +/*
> > + * Copyright (C) 2015 - ARM Ltd
> > + * Author: Marc Zyngier 
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program.  If not, see .
> > + */
> > +
> > +#include "hyp.h"
> > +
> > +static void __hyp_text __activate_traps(struct kvm_vcpu *vcpu)
> > +{
> > +   u64 val;
> > +
> > +   /*
> > +* We are about to set CPTR_EL2.TFP to trap all floating point
> > +* register accesses to EL2, however, the ARM ARM clearly states that
> > +* traps are only taken to EL2 if the operation would not otherwise
> > +* trap to EL1.  Therefore, always make sure that for 32-bit guests,
> > +* we set FPEXC.EN to prevent traps to EL1, when setting the TFP bit.
> > +*/
> > +   val = vcpu->arch.hcr_el2;
> > +   if (val & HCR_RW) {
> > +   write_sysreg(1 << 30, fpexc32_el2);
> > +   isb();
> > +   }
> > +   write_sysreg(val, hcr_el2);
> > +   write_sysreg(1 << 15, hstr_el2);
> > +   write_sysreg(CPTR_EL2_TTA | CPTR_EL2_TFP, cptr_el2);
> > +   write_sysreg(vcpu->arch.mdcr_el2, mdcr_el2);
> > +}
> > +
> > +static void __hyp_text __deactivate_traps(struct kvm_vcpu *vcpu)
> > +{
> > +   write_sysreg(HCR_RW, hcr_el2);
> > +   write_sysreg(0, hstr_el2);
> > +   write_sysreg(read_sysreg(mdcr_el2) & MDCR_EL2_HPMN_MASK, mdcr_el2);
> > +   write_sysreg(0, cptr_el2);
> > +}
> > +
> > +static void __hyp_text __activate_vm(struct kvm_vcpu *vcpu)
> > +{
> > +   struct kvm *kvm = kern_hyp_va(vcpu->kvm);
> > +   write_sysreg(kvm->arch.vttbr, vttbr_el2);
> > +}
> > +
> > +static void __hyp_text __deactivate_vm(struct kvm_vcpu *vcpu)
> > +{
> > +   write_sysreg(0, vttbr_el2);
> > +}
> > +
> > +static hyp_alternate_select(__vgic_call_save_state,
> > +   __vgic_v2_save_state, __vgic_v3_save_state,
> > +   ARM64_HAS_SYSREG_GIC_CPUIF);
> > +
> > +static hyp_alternate_select(__vgic_call_restore_state,
> > +   __vgic_v2_restore_state, __vgic_v3_restore_state,
> > +   ARM64_HAS_SYSREG_GIC_CPUIF);
> > +
> > +static void __hyp_text __vgic_save_state(struct kvm_vcpu *vcpu)
> > +{
> > +   __vgic_call_save_state()(vcpu);
> > +   write_sysreg(read_sysreg(hcr_el2) & ~HCR_INT_OVERRIDE, hcr_el2);
> > +}
> > +
> > +static void __hyp_text __vgic_restore_state(struct kvm_vcpu *vcpu)
> > +{
> > +   u64 val;
> > +
> > +   val = read_sysreg(hcr_el2);
> > +   val |=  HCR_INT_OVERRIDE;
> > +   val |= vcpu->arch.irq_lines;
> > +   write_sysreg(val, hcr_el2);
> > +
> > +   __vgic_call_restore_state()(vcpu);
> > +}
> > +
> > +int __hyp_text __guest_run(struct kvm_vcpu *vcpu)
> > +{
> > +   struct kvm_cpu_context *host_ctxt;
> > +   struct kvm_cpu_context *guest_ctxt;
> > +   u64 exit_code;
> > +
> > +   vcpu = kern_hyp_va(vcpu);
> > +   write_sysreg(vcpu, tpidr_el2);
> > +
> > +   host_ctxt = kern_hyp_va(vcpu->arch.host_cpu_context);
> > +   guest_ctxt = >arch.ctxt;
> > +
> > +   __sysreg_save_state(host_ctxt);
> > +   __debug_cond_save_state(vcpu, >arch.host_debug_state, host_ctxt);
> > +
> > +   __activate_traps(vcpu);
> > +   __activate_vm(vcpu);
> > +
> > +   __vgic_restore_state(vcpu);
> > +

Re: [PULL 0/8] KVM/ARM Fixes for v4.4-rc3

2015-11-24 Thread Paolo Bonzini



On 24/11/2015 18:35, Christoffer Dall wrote:
> Hi Paolo,
> 
> Here's a set of fixes for KVM/ARM for v4.4-rc3 based on v4.4-rc2, because the
> errata fixes don't apply on v4.4-rc1.  Let me know if you can pull this 
> anyhow.

Sure, pulled.

Paolo

> Thanks,
> -Christoffer
> 
> The following changes since commit 1ec218373b8ebda821aec00bb156a9c94fad9cd4:
> 
>   Linux 4.4-rc2 (2015-11-22 16:45:59 -0800)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git 
> tags/kvm-arm-for-v4.4-rc3
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: MCG_CAP ABI breakage (was Re: [Qemu-devel] [PATCH] target-i386: Do not set MCG_SER_P by default)

2015-11-24 Thread Borislav Petkov

On Tue, Nov 24, 2015 at 02:36:20PM -0200, Eduardo Habkost wrote:
> KVM_X86_SET_MCE does not call kvm_vcpu_ioctl_x86_setup_mce(). It
> calls kvm_vcpu_ioctl_x86_set_mce(), which stores the
> IA32_MCi_{STATUS,ADDR,MISC} register contents at
> vcpu->arch.mce_banks.

Ah, correct. I've mistakenly followed KVM_X86_SETUP_MCE and not
KVM_X86_SET_MCE, sorry.

Ok, so this makes more sense now - there's kvm_inject_mce_oldstyle() in
qemu and kvm_arch_on_sigbus_vcpu() which is on the SIGBUS handler path
actually does:

if ((env->mcg_cap & MCG_SER_P) && addr
&& (code == BUS_MCEERR_AR || code == BUS_MCEERR_AO)) {
...

I betcha that MCG_SER_P is set on every guest, even !Intel ones. I need
to go stare more at that code.

> I didn't check the QEMU MCE code to confirm that, but I assume it
> is implemented there. In that case, MCG_SER_P in
> KVM_MCE_CAP_SUPPORTED just indicates it can be implemented by
> userspace, as long as it makes the appropriate KVM_X86_SET_MCE
> (or maybe KVM_SET_MSRS?) calls.

I think it is that kvm_arch_on_sigbus_vcpu()/kvm_arch_on_sigbus()
which handles SIGBUS with BUS_MCEERR_AR/BUS_MCEERR_AO si_code. See
mm/memory-failure.c:kill_proc() in the kernel where we do send those
signals to processes.

However, I still think the MCG_SER_P bit being set on
!Intel is wrong even though the recovery action done by
kvm_arch_on_sigbus_vcpu()/kvm_arch_on_sigbus() is correct.

Why, you're asking. :-)

Well, what happens above is that the qemu process gets the signal that
there was an uncorrectable error detected in its memory and it is either
required to do something: BUS_MCEERR_AR == Action Required or its action
is optional: BUS_MCEERR_AO == Action Optional.

The SER_P text in the SDM describes those two:

"SRAO errors indicate that some data in the system is corrupt, but the
data has not been consumed and the processor state is valid. SRAO errors
provide the additional error information for system software to perform
a recovery action. An SRAO error is indicated with UC=1, PCC=0, S=1,
EN=1 and AR=0 in the IA32_MCi_STATUS register."

and

"Software recoverable action required (SRAR) - a UCR error that requires
system software to take a recovery action on this processor before
scheduling another stream of execution on this processor. SRAR errors
indicate that the error was detected and raised at the point of the
consumption in the execution flow. An SRAR error is indicated with UC=1,
PCC=0, S=1, EN=1 and AR=1 in the IA32_MCi_STATUS register."

And for that we don't need to look at SER_P in qemu - we only need to
know what the error severity of the error is and then we go and handle
accordingly.

Because those two si_codes are purely software-defined. And the
application which gets that SIGBUS type doesn't need to care about
SER_P.

For example, AMD has similar error severities and they can be injected
into qemu too. And qemu can do the exact same recovery actions based on
the severity without even looking at the SER_P bit.

So here's the problem:

* SER_P is set on all guests and it puzzles kernels running on !Intel
guests.

* Hardware error recovery actions can be done regardless of that bit.

The only case where that bit makes sense is if the emulated hardware
itself is generating accurate MCEs and then, as a result, wants to make
generate accurate error signatures:

SRAO:   UC=1, PCC=0, S=1, EN=1 and AR=0
SRAR:   UC=1, PCC=0, S=1, EN=1 and AR=1

Those bits should have these settings only when the emulated hw actually
implements SER_P. Otherwise, you'd get those old crude MCEs which are
either uncorrectable and generate an #MC or are correctable errors.

But ok, let me go do some staring at the examples you sent me
previously first. I might get a better idea after I sleep on it.

:-)

Thanks!

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: MCG_CAP ABI breakage (was Re: [Qemu-devel] [PATCH] target-i386: Do not set MCG_SER_P by default)

2015-11-24 Thread Eduardo Habkost

On Mon, Nov 23, 2015 at 05:43:14PM +0100, Borislav Petkov wrote:
> On Mon, Nov 23, 2015 at 01:11:27PM -0200, Eduardo Habkost wrote:
> > On Mon, Nov 23, 2015 at 11:22:37AM -0200, Eduardo Habkost wrote:
> > [...]
> > > In the case of this code, it looks like it's already broken
> > > because the resulting mcg_cap depends on host kernel capabilities
> > > (the ones reported by kvm_get_mce_cap_supported()), and the data
> > > initialized by target-i386/cpu.c:mce_init() is silently
> > > overwritten by kvm_arch_init_vcpu(). So we would need to fix that
> > > before implementing a proper compatibility mechanism for
> > > mcg_cap.
> > 
> > Fortunately, when running Linux v2.6.37 and later,
> > kvm_arch_init_vcpu() won't actually change mcg_cap (see details
> > below).
> > 
> > But the code is broken if running on Linux between v2.6.32 and
> > v2.6.36: it will clear MCG_SER_P silently (and silently enable
> > MCG_SER_P when migrating to a newer host).
> > 
> > But I don't know what we should do on those cases. If we abort
> > initialization when the host doesn't support MCG_SER_P, all CPU
> > models with MCE and MCA enabled will become unrunnable on Linux
> > between v2.6.32 and v2.6.36. Should we do that, and simply ask
> > people to upgrade their kernels (or explicitly disable MCE) if
> > they want to run latest QEMU?
> > 
> > For reference, these are the capabilities returned by Linux:
> > * KVM_MAX_MCE_BANKS is 32 since
> >   890ca9aefa78f7831f8f633cab9e4803636dffe4 (v2.6.32-rc1~693^2~199)
> > * KVM_MCE_CAP_SUPPORTED is (MCG_CTL_P | MCG_SER_P) since
> >   5854dbca9b235f8cdd414a0961018763d2d5bf77 (v2.6.37-rc1~142^2~3)
> 
> The commit message of that one says that there is MCG_SER_P support in
> the kernel.
> 
> The previous commit talks about MCE injection with KVM_X86_SET_MCE
> ioctl but frankly, I don't see that. From looking at the current code,
> KVM_X86_SET_MCE does kvm_vcpu_ioctl_x86_setup_mce() which simply sets
> MCG_CAP. And it gets those from KVM_X86_GET_MCE_CAP_SUPPORTED which is

KVM_X86_SET_MCE does not call kvm_vcpu_ioctl_x86_setup_mce(). It
calls kvm_vcpu_ioctl_x86_set_mce(), which stores the
IA32_MCi_{STATUS,ADDR,MISC} register contents at
vcpu->arch.mce_banks.

> 
> #define KVM_MCE_CAP_SUPPORTED (MCG_CTL_P | MCG_SER_P)
> 
> So it basically sets those two supported bits. But how is
> 
>   supported == actually present
> 
> ?!?!
> 
> That soo doesn't make any sense.
> 

I didn't check the QEMU MCE code to confirm that, but I assume it
is implemented there. In that case, MCG_SER_P in
KVM_MCE_CAP_SUPPORTED just indicates it can be implemented by
userspace, as long as it makes the appropriate KVM_X86_SET_MCE
(or maybe KVM_SET_MSRS?) calls.


-- 
Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] KVM: arm/arm64: vgic: Trust the LR state for HW IRQs

2015-11-24 Thread Marc Zyngier

On Tue, 24 Nov 2015 16:44:00 +0100
Christoffer Dall  wrote:

> We were probing the physial distributor state for the active state of a
> HW virtual IRQ, because we had seen evidence that the LR state was not
> cleared when the guest deactivated a virtual interrupted.
> 
> However, this issue turned out to be a software bug in the GIC, which
> was solved by: 84aab5e68c2a5e1e18d81ae8308c3ce25d501b29
> (KVM: arm/arm64: arch_timer: Preserve physical dist. active
> state on LR.active, 2015-11-24)
> 
> Therefore, get rid of the complexities and just look at the LR.
> 
> Signed-off-by: Christoffer Dall 
> ---
>  virt/kvm/arm/vgic.c | 16 ++--
>  1 file changed, 2 insertions(+), 14 deletions(-)
> 
> diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
> index 9002f0d..55cd7e3 100644
> --- a/virt/kvm/arm/vgic.c
> +++ b/virt/kvm/arm/vgic.c
> @@ -1420,25 +1420,13 @@ static bool vgic_process_maintenance(struct kvm_vcpu 
> *vcpu)
>  static bool vgic_sync_hwirq(struct kvm_vcpu *vcpu, int lr, struct vgic_lr 
> vlr)
>  {
>   struct vgic_dist *dist = >kvm->arch.vgic;
> - struct irq_phys_map *map;
> - bool phys_active;
>   bool level_pending;
> - int ret;
>  
>   if (!(vlr.state & LR_HW))
>   return false;
>  
> - map = vgic_irq_map_search(vcpu, vlr.irq);
> - BUG_ON(!map);
> -
> - ret = irq_get_irqchip_state(map->irq,
> - IRQCHIP_STATE_ACTIVE,
> - _active);
> -
> - WARN_ON(ret);
> -
> - if (phys_active)
> - return 0;
> + if (vlr.state & LR_STATE_ACTIVE)
> + return false;
>  
>   spin_lock(>lock);
>   level_pending = process_queued_irq(vcpu, lr, vlr);

Reviewed-by: Marc Zyngier 

M.
-- 
Jazz is not dead. It just smells funny.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2] arm64: KVM: Fixes for 4.4-rc2

2015-11-24 Thread Christoffer Dall

On Mon, Nov 16, 2015 at 10:28:16AM +, Marc Zyngier wrote:
> Here's a couple of fixes for KVM/arm64:
> 
> - The first one addresses a misinterpretation of the architecture
>   spec, leading to the mishandling of I/O accesses generated from an
>   AArch32 guest using banked registers.
> 
> - The second one is a workaround for a Cortex-A57 erratum.
> 
Thanks, applied with cosmetic fixes and cc'ed to stable.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v6 4/7] KVM: arm64: Implement vGICv3 distributor and redistributor access from userspace

2015-11-24 Thread kbuild test robot

Hi Pavel,

[auto build test ERROR on kvm/linux-next]
[also build test ERROR on v4.4-rc2 next-20151124]

url:
https://github.com/0day-ci/linux/commits/Pavel-Fedin/KVM-arm64-Implement-API-for-vGICv3-live-migration/20151124-171812
base:   https://git.kernel.org/pub/scm/virt/kvm/kvm.git linux-next
config: arm64-allyesconfig (attached as .config)
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm64 

Note: the 
linux-review/Pavel-Fedin/KVM-arm64-Implement-API-for-vGICv3-live-migration/20151124-171812
 HEAD 03ad7de97d6a551c4a70e7723383007f54bf56a3 builds fine.
  It only hurts bisectibility.

All error/warnings (new ones prefixed by >>):

   arch/arm64/kvm/../../../virt/kvm/arm/vgic-v3-emul.c: In function 
'vgic_v3_has_attr':
>> arch/arm64/kvm/../../../virt/kvm/arm/vgic-v3-emul.c:1114:24: error: storage 
>> size of 'params' isn't known
 struct sys_reg_params params;
   ^
>> arch/arm64/kvm/../../../virt/kvm/arm/vgic-v3-emul.c:1134:7: error: 
>> 'KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS' undeclared (first use in this function)
 case KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS:
  ^
   arch/arm64/kvm/../../../virt/kvm/arm/vgic-v3-emul.c:1134:7: note: each 
undeclared identifier is reported only once for each function it appears in
>> arch/arm64/kvm/../../../virt/kvm/arm/vgic-v3-emul.c:1135:25: error: 
>> 'KVM_DEV_ARM_VGIC_SYSREG_MASK' undeclared (first use in this function)
  regid = (attr->attr & KVM_DEV_ARM_VGIC_SYSREG_MASK) |
^
>> arch/arm64/kvm/../../../virt/kvm/arm/vgic-v3-emul.c:1137:3: error: implicit 
>> declaration of function 'find_reg_by_id' 
>> [-Werror=implicit-function-declaration]
  return find_reg_by_id(regid, , gic_v3_icc_reg_descs,
  ^
>> arch/arm64/kvm/../../../virt/kvm/arm/vgic-v3-emul.c:1137:41: error: 
>> 'gic_v3_icc_reg_descs' undeclared (first use in this function)
  return find_reg_by_id(regid, , gic_v3_icc_reg_descs,
^
   In file included from include/linux/thread_info.h:11:0,
from include/asm-generic/current.h:4,
from arch/arm64/include/generated/asm/current.h:1,
from include/linux/mutex.h:13,
from include/linux/kernfs.h:13,
from include/linux/sysfs.h:15,
from include/linux/kobject.h:21,
from include/linux/device.h:17,
from include/linux/node.h:17,
from include/linux/cpu.h:16,
from arch/arm64/kvm/../../../virt/kvm/arm/vgic-v3-emul.c:38:
>> include/linux/bug.h:33:45: error: bit-field '' width not an 
>> integer constant
#define BUILD_BUG_ON_ZERO(e) (sizeof(struct { int:-!!(e); }))
^
   include/linux/compiler-gcc.h:64:28: note: in expansion of macro 
'BUILD_BUG_ON_ZERO'
#define __must_be_array(a) BUILD_BUG_ON_ZERO(__same_type((a), &(a)[0]))
   ^
   include/linux/kernel.h:54:59: note: in expansion of macro '__must_be_array'
#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) + 
__must_be_array(arr))
  ^
>> arch/arm64/kvm/../../../virt/kvm/arm/vgic-v3-emul.c:1138:11: note: in 
>> expansion of macro 'ARRAY_SIZE'
  ARRAY_SIZE(gic_v3_icc_reg_descs)) ?
  ^
   arch/arm64/kvm/../../../virt/kvm/arm/vgic-v3-emul.c:1114:24: warning: unused 
variable 'params' [-Wunused-variable]
 struct sys_reg_params params;
   ^
   cc1: some warnings being treated as errors

vim +/KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS +1134 
arch/arm64/kvm/../../../virt/kvm/arm/vgic-v3-emul.c

  1108  }
  1109  
  1110  static int vgic_v3_has_attr(struct kvm_device *dev,
    struct kvm_device_attr *attr)
  1112  {
  1113  phys_addr_t offset;
> 1114  struct sys_reg_params params;
  1115  u64 regid;
  1116  
  1117  switch (attr->group) {
  1118  case KVM_DEV_ARM_VGIC_GRP_ADDR:
  1119  switch (attr->attr) {
  1120  case KVM_VGIC_V2_ADDR_TYPE_DIST:
  1121  case KVM_VGIC_V2_ADDR_TYPE_CPU:
  1122  return -ENXIO;
  1123  case KVM_VGIC_V3_ADDR_TYPE_DIST:
  1124  case KVM_VGIC_V3_ADDR_TYPE_REDIST:
  1125  return 0;
  1126  }
  1127  break;
  1128  case KVM_DEV_ARM_VGIC_GRP_DIST_REGS:
  1129  offset = attr->attr & KVM_DEV_ARM_VGIC_OFFSET_MASK;
  1130

Re: [PATCH 2/3] KVM: arm/arm64: arch_timer: Preserve physical dist. active state on LR.active

2015-11-24 Thread Marc Zyngier

On Tue, 24 Nov 2015 16:43:59 +0100
Christoffer Dall  wrote:

> We were incorrectly removing the active state from the physical
> distributor on the timer interrupt when the timer output level was
> deasserted.  We shouldn't be doing this without considering the virtual
> interrupt's active state, because the architecture requires that when an
> LR has the HW bit set and the pending or active bits set, then the
> physical interrupt must also have the corresponding bits set.
> 
> This addresses an issue where we have been observing an inconsistency
> between the LR state and the physical distributor state where the LR
> state was active and the physical distributor was not active, which
> shouldn't happen.
> 
> Signed-off-by: Christoffer Dall 
> ---
>  include/kvm/arm_vgic.h|  2 +-
>  virt/kvm/arm/arch_timer.c | 28 +---
>  virt/kvm/arm/vgic.c   | 37 +
>  3 files changed, 43 insertions(+), 24 deletions(-)
> 
> diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
> index 9c747cb..d2f4147 100644
> --- a/include/kvm/arm_vgic.h
> +++ b/include/kvm/arm_vgic.h
> @@ -342,10 +342,10 @@ int kvm_vgic_inject_mapped_irq(struct kvm *kvm, int 
> cpuid,
>  struct irq_phys_map *map, bool level);
>  void vgic_v3_dispatch_sgi(struct kvm_vcpu *vcpu, u64 reg);
>  int kvm_vgic_vcpu_pending_irq(struct kvm_vcpu *vcpu);
> -int kvm_vgic_vcpu_active_irq(struct kvm_vcpu *vcpu);
>  struct irq_phys_map *kvm_vgic_map_phys_irq(struct kvm_vcpu *vcpu,
>  int virt_irq, int irq);
>  int kvm_vgic_unmap_phys_irq(struct kvm_vcpu *vcpu, struct irq_phys_map *map);
> +bool kvm_vgic_map_is_active(struct kvm_vcpu *vcpu, struct irq_phys_map *map);
>  
>  #define irqchip_in_kernel(k) (!!((k)->arch.vgic.in_kernel))
>  #define vgic_initialized(k)  (!!((k)->arch.vgic.nr_cpus))
> diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
> index 21a0ab2..69bca18 100644
> --- a/virt/kvm/arm/arch_timer.c
> +++ b/virt/kvm/arm/arch_timer.c
> @@ -221,17 +221,23 @@ void kvm_timer_flush_hwstate(struct kvm_vcpu *vcpu)
>   kvm_timer_update_state(vcpu);
>  
>   /*
> -  * If we enter the guest with the virtual input level to the VGIC
> -  * asserted, then we have already told the VGIC what we need to, and
> -  * we don't need to exit from the guest until the guest deactivates
> -  * the already injected interrupt, so therefore we should set the
> -  * hardware active state to prevent unnecessary exits from the guest.
> -  *
> -  * Conversely, if the virtual input level is deasserted, then always
> -  * clear the hardware active state to ensure that hardware interrupts
> -  * from the timer triggers a guest exit.
> -  */
> - if (timer->irq.level)
> + * If we enter the guest with the virtual input level to the VGIC
> + * asserted, then we have already told the VGIC what we need to, and
> + * we don't need to exit from the guest until the guest deactivates
> + * the already injected interrupt, so therefore we should set the
> + * hardware active state to prevent unnecessary exits from the guest.
> + *
> + * Also, if we enter the guest with the virtual timer interrupt active,
> + * then it must be active on the physical distributor, because we set
> + * the HW bit and the guest must be able to deactivate the virtual and
> + * physical interrupt at the same time.
> + *
> + * Conversely, if the virtual input level is deasserted and the virtual
> + * interrupt is not active, then always clear the hardware active state
> + * to ensure that hardware interrupts from the timer triggers a guest
> + * exit.
> + */
> + if (timer->irq.level || kvm_vgic_map_is_active(vcpu, timer->map))
>   phys_active = true;
>   else
>   phys_active = false;
> diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
> index 5335383..9002f0d 100644
> --- a/virt/kvm/arm/vgic.c
> +++ b/virt/kvm/arm/vgic.c
> @@ -1096,6 +1096,30 @@ static void vgic_retire_lr(int lr_nr, struct kvm_vcpu 
> *vcpu)
>   vgic_set_lr(vcpu, lr_nr, vlr);
>  }
>  
> +static int dist_active_irq(struct kvm_vcpu *vcpu)

bool? That'd be consistent with kvm_vgic_map_is_active.

> +{
> + struct vgic_dist *dist = >kvm->arch.vgic;
> +
> + if (!irqchip_in_kernel(vcpu->kvm))
> + return 0;
> +

I believe you can drop this test, as the only other use case for this
function is on the flush path, which obviously mandates an in-kernel
irqchip.

> + return test_bit(vcpu->vcpu_id, dist->irq_active_on_cpu);
> +}
> +
> +bool kvm_vgic_map_is_active(struct kvm_vcpu *vcpu, struct irq_phys_map *map)
> +{
> + int i;
> +
> + for (i = 0; i < vcpu->arch.vgic_cpu.nr_lr; i++) {
> + struct vgic_lr vlr = vgic_get_lr(vcpu, i);
> +
> + if (vlr.irq ==

[PATCH v6 2/7] KVM: arm/arm64: Move endianness conversion out of vgic_attr_regs_access()

2015-11-24 Thread Pavel Fedin

mmio_data_read() and mmio_data_write(), originally used in this function,
are limited only to 32 bits. We are going to refactor this code and
eventually let it do 64-bit I/O for vGICv3. Therefore, our first step is
to get rid of this limitation.

We open up these inlines, which consist of endianness conversion and
masking. Masking is not used here (the mask is set to ~0), so we just
move out the remaining endianness conversion.

Signed-off-by: Pavel Fedin 
---
 virt/kvm/arm/vgic-v2-emul.c | 20 
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/virt/kvm/arm/vgic-v2-emul.c b/virt/kvm/arm/vgic-v2-emul.c
index 1390797..959b9c6 100644
--- a/virt/kvm/arm/vgic-v2-emul.c
+++ b/virt/kvm/arm/vgic-v2-emul.c
@@ -663,7 +663,7 @@ static const struct vgic_io_range vgic_cpu_ranges[] = {
 
 static int vgic_attr_regs_access(struct kvm_device *dev,
 struct kvm_device_attr *attr,
-u32 *reg, bool is_write)
+__le32 *data, bool is_write)
 {
const struct vgic_io_range *r = NULL, *ranges;
phys_addr_t offset;
@@ -671,7 +671,6 @@ static int vgic_attr_regs_access(struct kvm_device *dev,
struct kvm_vcpu *vcpu, *tmp_vcpu;
struct vgic_dist *vgic;
struct kvm_exit_mmio mmio;
-   u32 data;
 
offset = attr->attr & KVM_DEV_ARM_VGIC_OFFSET_MASK;
cpuid = (attr->attr & KVM_DEV_ARM_VGIC_CPUID_MASK) >>
@@ -693,9 +692,7 @@ static int vgic_attr_regs_access(struct kvm_device *dev,
 
mmio.len = 4;
mmio.is_write = is_write;
-   mmio.data = 
-   if (is_write)
-   mmio_data_write(, ~0, *reg);
+   mmio.data = data;
switch (attr->group) {
case KVM_DEV_ARM_VGIC_GRP_DIST_REGS:
mmio.phys_addr = vgic->vgic_dist_base + offset;
@@ -743,9 +740,6 @@ static int vgic_attr_regs_access(struct kvm_device *dev,
offset -= r->base;
r->handle_mmio(vcpu, , offset);
 
-   if (!is_write)
-   *reg = mmio_data_read(, ~0);
-
ret = 0;
 out_vgic_unlock:
spin_unlock(>lock);
@@ -778,11 +772,13 @@ static int vgic_v2_set_attr(struct kvm_device *dev,
case KVM_DEV_ARM_VGIC_GRP_CPU_REGS: {
u32 __user *uaddr = (u32 __user *)(long)attr->addr;
u32 reg;
+   __le32 data;
 
if (get_user(reg, uaddr))
return -EFAULT;
 
-   return vgic_attr_regs_access(dev, attr, , true);
+   data = cpu_to_le32(reg);
+   return vgic_attr_regs_access(dev, attr, , true);
}
 
}
@@ -803,12 +799,12 @@ static int vgic_v2_get_attr(struct kvm_device *dev,
case KVM_DEV_ARM_VGIC_GRP_DIST_REGS:
case KVM_DEV_ARM_VGIC_GRP_CPU_REGS: {
u32 __user *uaddr = (u32 __user *)(long)attr->addr;
-   u32 reg = 0;
+   __le32 data = 0;
 
-   ret = vgic_attr_regs_access(dev, attr, , false);
+   ret = vgic_attr_regs_access(dev, attr, , false);
if (ret)
return ret;
-   return put_user(reg, uaddr);
+   return put_user(le32_to_cpu(data), uaddr);
}
 
}
-- 
2.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 7/7] KVM: arm64: Implement vGICv3 CPU interface access

2015-11-24 Thread Pavel Fedin

Access size is always 64 bits. Since CPU interface state actually affects
only a single vCPU, no vGIC locking is done in order to avoid code
duplication. Just made sure that the vCPU is not running.

Signed-off-by: Pavel Fedin 
---
 arch/arm64/include/uapi/asm/kvm.h  |  14 ++-
 include/linux/irqchip/arm-gic-v3.h |  18 ++-
 virt/kvm/arm/vgic-v3-emul.c| 232 -
 3 files changed, 258 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/uapi/asm/kvm.h 
b/arch/arm64/include/uapi/asm/kvm.h
index 98bd047..ca32fe5 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -179,14 +179,14 @@ struct kvm_arch_memory_slot {
KVM_REG_ARM64_SYSREG_ ## n ## _MASK)
 
 #define __ARM64_SYS_REG(op0,op1,crn,crm,op2) \
-   (KVM_REG_ARM64 | KVM_REG_ARM64_SYSREG | \
-   ARM64_SYS_REG_SHIFT_MASK(op0, OP0) | \
+   (ARM64_SYS_REG_SHIFT_MASK(op0, OP0) | \
ARM64_SYS_REG_SHIFT_MASK(op1, OP1) | \
ARM64_SYS_REG_SHIFT_MASK(crn, CRN) | \
ARM64_SYS_REG_SHIFT_MASK(crm, CRM) | \
ARM64_SYS_REG_SHIFT_MASK(op2, OP2))
 
-#define ARM64_SYS_REG(...) (__ARM64_SYS_REG(__VA_ARGS__) | KVM_REG_SIZE_U64)
+#define ARM64_SYS_REG(...) (__ARM64_SYS_REG(__VA_ARGS__) | KVM_REG_ARM64 | \
+   KVM_REG_SIZE_U64 | KVM_REG_ARM64_SYSREG)
 
 #define KVM_REG_ARM_TIMER_CTL  ARM64_SYS_REG(3, 3, 14, 3, 1)
 #define KVM_REG_ARM_TIMER_CNT  ARM64_SYS_REG(3, 3, 14, 3, 2)
@@ -204,6 +204,14 @@ struct kvm_arch_memory_slot {
 #define KVM_DEV_ARM_VGIC_GRP_CTRL  4
 #define   KVM_DEV_ARM_VGIC_CTRL_INIT   0
 #define KVM_DEV_ARM_VGIC_GRP_REDIST_REGS 5
+#define KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS 6
+#define   KVM_DEV_ARM_VGIC_SYSREG_MASK (KVM_REG_ARM64_SYSREG_OP0_MASK | \
+KVM_REG_ARM64_SYSREG_OP1_MASK | \
+KVM_REG_ARM64_SYSREG_CRN_MASK | \
+KVM_REG_ARM64_SYSREG_CRM_MASK | \
+KVM_REG_ARM64_SYSREG_OP2_MASK)
+#define   KVM_DEV_ARM_VGIC_SYSREG(op0, op1, crn, crm, op2) \
+   __ARM64_SYS_REG(op0, op1, crn, crm, op2)
 
 /* KVM_IRQ_LINE irq field index values */
 #define KVM_ARM_IRQ_TYPE_SHIFT 24
diff --git a/include/linux/irqchip/arm-gic-v3.h 
b/include/linux/irqchip/arm-gic-v3.h
index 7e9f9d5..dfd2bed 100644
--- a/include/linux/irqchip/arm-gic-v3.h
+++ b/include/linux/irqchip/arm-gic-v3.h
@@ -261,8 +261,14 @@
 /*
  * CPU interface registers
  */
-#define ICC_CTLR_EL1_EOImode_drop_dir  (0U << 1)
-#define ICC_CTLR_EL1_EOImode_drop  (1U << 1)
+#define ICC_CTLR_EL1_CBPR_SHIFT0
+#define ICC_CTLR_EL1_EOImode_SHIFT 1
+#define ICC_CTLR_EL1_EOImode_drop_dir  (0U << ICC_CTLR_EL1_EOImode_SHIFT)
+#define ICC_CTLR_EL1_EOImode_drop  (1U << ICC_CTLR_EL1_EOImode_SHIFT)
+#define ICC_CTLR_EL1_PRIbits_MASK  (7U << 8)
+#define ICC_CTLR_EL1_IDbits_MASK   (7U << 11)
+#define ICC_CTLR_EL1_SEIS  (1U << 14)
+#define ICC_CTLR_EL1_A3V   (1U << 15)
 #define ICC_SRE_EL1_SRE(1U << 0)
 
 /*
@@ -287,6 +293,14 @@
 
 #define ICH_VMCR_CTLR_SHIFT0
 #define ICH_VMCR_CTLR_MASK (0x21f << ICH_VMCR_CTLR_SHIFT)
+#define ICH_VMCR_ENG0_SHIFT0
+#define ICH_VMCR_ENG0  (1 << ICH_VMCR_ENG0_SHIFT)
+#define ICH_VMCR_ENG1_SHIFT1
+#define ICH_VMCR_ENG1  (1 << ICH_VMCR_ENG1_SHIFT)
+#define ICH_VMCR_CBPR_SHIFT4
+#define ICH_VMCR_CBPR  (1 << ICH_VMCR_CBPR_SHIFT)
+#define ICH_VMCR_EOIM_SHIFT9
+#define ICH_VMCR_EOIM  (1 << ICH_VMCR_EOIM_SHIFT)
 #define ICH_VMCR_BPR1_SHIFT18
 #define ICH_VMCR_BPR1_MASK (7 << ICH_VMCR_BPR1_SHIFT)
 #define ICH_VMCR_BPR0_SHIFT21
diff --git a/virt/kvm/arm/vgic-v3-emul.c b/virt/kvm/arm/vgic-v3-emul.c
index d9d644c..113c386 100644
--- a/virt/kvm/arm/vgic-v3-emul.c
+++ b/virt/kvm/arm/vgic-v3-emul.c
@@ -48,6 +48,7 @@
 #include 
 #include 
 
+#include "sys_regs.h"
 #include "vgic.h"
 
 static bool handle_mmio_rao_wi(struct kvm_vcpu *vcpu,
@@ -991,6 +992,227 @@ void vgic_v3_dispatch_sgi(struct kvm_vcpu *vcpu, u64 reg)
vgic_kick_vcpus(vcpu->kvm);
 }
 
+static bool access_gic_ctlr(struct kvm_vcpu *vcpu,
+   const struct sys_reg_params *p,
+   const struct sys_reg_desc *r)
+{
+   u64 val;
+   struct vgic_v3_cpu_if *vgicv3 = >arch.vgic_cpu.vgic_v3;
+
+   if (p->is_write) {
+   val = *p->val;
+
+   vgicv3->vgic_vmcr &= ~(ICH_VMCR_CBPR|ICH_VMCR_EOIM);
+   vgicv3->vgic_vmcr |= (val << (ICH_VMCR_CBPR_SHIFT -
+   ICC_CTLR_EL1_CBPR_SHIFT)) &
+   ICH_VMCR_CBPR;
+   vgicv3->vgic_vmcr |= (val

[PATCH v6 6/7] KVM: arm64: Introduce find_reg_by_id()

2015-11-24 Thread Pavel Fedin

In order to implement vGICv3 CPU interface access, we will need to
perform table lookup of system registers. We would need both
index_to_params() and find_reg() exported for that purpose, but instead
we export a single function which combines them both.

Signed-off-by: Pavel Fedin 
Reviewed-by: Andre Przywara 
---
 arch/arm64/kvm/sys_regs.c | 22 +++---
 arch/arm64/kvm/sys_regs.h |  4 
 2 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 5001cc8..d7ac611 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1276,6 +1276,17 @@ static bool index_to_params(u64 id, struct 
sys_reg_params *params)
}
 }
 
+const struct sys_reg_desc *find_reg_by_id(u64 id,
+ struct sys_reg_params *params,
+ const struct sys_reg_desc table[],
+ unsigned int num)
+{
+   if (!index_to_params(id, params))
+   return NULL;
+
+   return find_reg(params, table, num);
+}
+
 /* Decode an index value, and find the sys_reg_desc entry. */
 static const struct sys_reg_desc *index_to_sys_reg_desc(struct kvm_vcpu *vcpu,
u64 id)
@@ -1403,10 +1414,8 @@ static int get_invariant_sys_reg(u64 id, void __user 
*uaddr)
struct sys_reg_params params;
const struct sys_reg_desc *r;
 
-   if (!index_to_params(id, ))
-   return -ENOENT;
-
-   r = find_reg(, invariant_sys_regs, 
ARRAY_SIZE(invariant_sys_regs));
+   r = find_reg_by_id(id, , invariant_sys_regs,
+  ARRAY_SIZE(invariant_sys_regs));
if (!r)
return -ENOENT;
 
@@ -1420,9 +1429,8 @@ static int set_invariant_sys_reg(u64 id, void __user 
*uaddr)
int err;
u64 val = 0; /* Make sure high bits are 0 for 32-bit regs */
 
-   if (!index_to_params(id, ))
-   return -ENOENT;
-   r = find_reg(, invariant_sys_regs, 
ARRAY_SIZE(invariant_sys_regs));
+   r = find_reg_by_id(id, , invariant_sys_regs,
+  ARRAY_SIZE(invariant_sys_regs));
if (!r)
return -ENOENT;
 
diff --git a/arch/arm64/kvm/sys_regs.h b/arch/arm64/kvm/sys_regs.h
index 3267518..0646108 100644
--- a/arch/arm64/kvm/sys_regs.h
+++ b/arch/arm64/kvm/sys_regs.h
@@ -136,6 +136,10 @@ static inline int cmp_sys_reg(const struct sys_reg_desc 
*i1,
return i1->Op2 - i2->Op2;
 }
 
+const struct sys_reg_desc *find_reg_by_id(u64 id,
+ struct sys_reg_params *params,
+ const struct sys_reg_desc table[],
+ unsigned int num);
 
 #define Op0(_x).Op0 = _x
 #define Op1(_x).Op1 = _x
-- 
2.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 4/7] KVM: arm64: Implement vGICv3 distributor and redistributor access from userspace

2015-11-24 Thread Pavel Fedin

The access is done similar to vGICv2, using
KVM_DEV_ARM_VGIC_GRP_DIST_REGS and KVM_DEV_ARM_VGIC_GRP_REDIST_REGS
with KVM_SET_DEVICE_ATTR and KVM_GET_DEVICE_ATTR ioctls.

Access size for vGICv3 is 64 bits, vgic_attr_regs_access() fixed to
support this. The trick with vgic_v3_get_reg_size() is necessary because
the major part of GICv3 registers is actually 32-bit, and their accessors
do not distinguish between lower and upper words (offset & 3). Accessing
these registers with len == 8 would cause rollover. For write operations
this would overwrite lower word with the upper one (which would normally
be 0), for read operations this would cause duplication of the same word
in both halves.

Signed-off-by: Pavel Fedin 
---
 arch/arm64/include/uapi/asm/kvm.h  |   1 +
 include/linux/irqchip/arm-gic-v3.h |   1 +
 virt/kvm/arm/vgic-v3-emul.c| 112 -
 virt/kvm/arm/vgic.c|   4 +-
 4 files changed, 102 insertions(+), 16 deletions(-)

diff --git a/arch/arm64/include/uapi/asm/kvm.h 
b/arch/arm64/include/uapi/asm/kvm.h
index 2d4ca4b..98bd047 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -203,6 +203,7 @@ struct kvm_arch_memory_slot {
 #define KVM_DEV_ARM_VGIC_GRP_NR_IRQS   3
 #define KVM_DEV_ARM_VGIC_GRP_CTRL  4
 #define   KVM_DEV_ARM_VGIC_CTRL_INIT   0
+#define KVM_DEV_ARM_VGIC_GRP_REDIST_REGS 5
 
 /* KVM_IRQ_LINE irq field index values */
 #define KVM_ARM_IRQ_TYPE_SHIFT 24
diff --git a/include/linux/irqchip/arm-gic-v3.h 
b/include/linux/irqchip/arm-gic-v3.h
index 95388a7..7e9f9d5 100644
--- a/include/linux/irqchip/arm-gic-v3.h
+++ b/include/linux/irqchip/arm-gic-v3.h
@@ -43,6 +43,7 @@
 #define GICD_IGRPMODR  0x0D00
 #define GICD_NSACR 0x0E00
 #define GICD_IROUTER   0x6000
+#define GICD_IROUTER1019   0x7FD8
 #define GICD_IDREGS0xFFD0
 #define GICD_PIDR2 0xFFE8
 
diff --git a/virt/kvm/arm/vgic-v3-emul.c b/virt/kvm/arm/vgic-v3-emul.c
index e661e7f..d9d644c 100644
--- a/virt/kvm/arm/vgic-v3-emul.c
+++ b/virt/kvm/arm/vgic-v3-emul.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -990,6 +991,77 @@ void vgic_v3_dispatch_sgi(struct kvm_vcpu *vcpu, u64 reg)
vgic_kick_vcpus(vcpu->kvm);
 }
 
+static u32 vgic_v3_get_reg_size(u32 group, u32 offset)
+{
+   switch (group) {
+   case KVM_DEV_ARM_VGIC_GRP_DIST_REGS:
+   if (offset >= GICD_IROUTER && offset <= GICD_IROUTER1019)
+   return 8;
+   else
+   return 4;
+   break;
+
+   case KVM_DEV_ARM_VGIC_GRP_REDIST_REGS:
+   if ((offset == GICR_TYPER) ||
+   (offset >= GICR_SETLPIR && offset <= GICR_INVALLR))
+   return 8;
+   else
+   return 4;
+   break;
+
+   default:
+   BUG();
+   }
+}
+
+static int vgic_v3_attr_regs_access(struct kvm_device *dev,
+   struct kvm_device_attr *attr,
+   u64 *reg, bool is_write)
+{
+   const struct vgic_io_range *ranges;
+   phys_addr_t offset;
+   struct kvm_vcpu *vcpu;
+   u64 cpuid;
+   struct vgic_dist *vgic = >kvm->arch.vgic;
+   struct kvm_exit_mmio mmio;
+   __le64 data;
+   int ret;
+
+   offset = attr->attr & KVM_DEV_ARM_VGIC_OFFSET_MASK;
+   cpuid = attr->attr >> KVM_DEV_ARM_VGIC_CPUID_SHIFT;
+
+   /* Convert affinity ID from our packed to normal form */
+   cpuid = (cpuid & 0x00ff) | ((cpuid & 0xff00) << 8);
+   vcpu = kvm_mpidr_to_vcpu(dev->kvm, cpuid);
+   if (!vcpu)
+   return -EINVAL;
+
+   switch (attr->group) {
+   case KVM_DEV_ARM_VGIC_GRP_DIST_REGS:
+   mmio.phys_addr = vgic->vgic_dist_base + offset;
+   ranges = vgic_v3_dist_ranges;
+   break;
+   case KVM_DEV_ARM_VGIC_GRP_REDIST_REGS:
+   mmio.phys_addr = vgic->vgic_redist_base + offset;
+   ranges = vgic_redist_ranges;
+   break;
+   default:
+   return -ENXIO;
+   }
+
+   data = cpu_to_le64(*reg);
+
+   mmio.len = vgic_v3_get_reg_size(attr->group, offset);
+   mmio.is_write = is_write;
+   mmio.data = 
+   mmio.private = vcpu; /* Redistributor handlers expect this */
+
+   ret = vgic_attr_regs_access(vcpu, ranges, , offset);
+
+   *reg = le64_to_cpu(data);
+   return ret;
+}
+
 static int vgic_v3_create(struct kvm_device *dev, u32 type)
 {
return kvm_vgic_create(dev->kvm, type);
@@ -1003,42 +1075,45 @@ static void vgic_v3_destroy(struct kvm_device *dev)
 static int vgic_v3_set_attr(struct kvm_device *dev,
struct kvm_device_attr *attr)
 {
+   u64 __user *uaddr = (u64 __user

[PATCH v6 0/7] KVM: arm64: Implement API for vGICv3 live migration

2015-11-24 Thread Pavel Fedin

This patchset adds necessary userspace API in order to support vGICv3 live
migration. GICv3 registers are accessed using device attribute ioctls,
similar to GICv2.

v5 => v6:
- Rebased on top of linux-next of 23.11.2015
- Use original API documentation patch, with minor changes only.
- Quit reusing KVM_DEV_ARM_VGIC_CPUID_MASK, do not touch vGICv2 API at all.
- Fixed some issues reported by the new checkpatch

v4 => v5:
- Adapted to new API by Peter Maydell, Marc Zyngier and Christoffer Dall.
  Acked-by's on the documentation were dropped, just in case, because i
  slightly adjusted it. Additionally, i merged all doc updates into one
  patch.

v3 => v4:
- Split pure refactoring from anything else
- Documentation brought up to date
- Cleaned up 'mmio' structure usage in vgic_attr_regs_access(),
  use call_range_handler() for 64-bit access handling
- Rebased on new linux-next

v2 => v3:
- KVM_DEV_ARM_VGIC_CPUID_MASK enlarged to 20 bits, allowing more than 256
  CPUs.
- Bug fix: Correctly set mmio->private, necessary for redistributor access.
- Added accessors for ICC_AP0R and ICC_AP1R registers
- Rebased on new linux-next

v1 => v2:
- Do not use generic register get/set API for CPU interface, use only
  device attributes.
- Introduce size specifier for distributor and redistributor register
  accesses, do not assume size any more.
- Lots of refactor and reusable code extraction.
- Added forgotten documentation

Christoffer Dall (1):
  KVM: arm/arm64: Add VGICv3 save/restore API documentation

Pavel Fedin (6):
  KVM: arm/arm64: Move endianness conversion out of
vgic_attr_regs_access()
  KVM: arm/arm64: Refactor vGIC attributes handling code
  KVM: arm64: Implement vGICv3 distributor and redistributor access from
userspace
  KVM: arm64: Refactor system register handlers
  KVM: arm64: Introduce find_reg_by_id()
  KVM: arm64: Implement vGICv3 CPU interface access

 Documentation/virtual/kvm/devices/arm-vgic-v3.txt | 116 
 Documentation/virtual/kvm/devices/arm-vgic.txt|  21 +-
 arch/arm64/include/uapi/asm/kvm.h |  15 +-
 arch/arm64/kvm/sys_regs.c |  83 +++---
 arch/arm64/kvm/sys_regs.h |   8 +-
 arch/arm64/kvm/sys_regs_generic_v8.c  |   2 +-
 include/linux/irqchip/arm-gic-v3.h|  19 +-
 virt/kvm/arm/vgic-v2-emul.c   | 124 ++--
 virt/kvm/arm/vgic-v3-emul.c   | 342 +-
 virt/kvm/arm/vgic.c   |  57 
 virt/kvm/arm/vgic.h   |   3 +
 11 files changed, 616 insertions(+), 174 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/arm-vgic-v3.txt

-- 
2.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 1/7] KVM: arm/arm64: Add VGICv3 save/restore API documentation

2015-11-24 Thread Pavel Fedin

From: Christoffer Dall 

Factor out the GICv3-specific documentation into a separate
documentation file.  Add description for how to access distributor,
redistributor, and CPU interface registers for GICv3 in this new file.

Acked-by: Peter Maydell 
Acked-by: Marc Zyngier 
Signed-off-by: Christoffer Dall 
Signed-off-by: Pavel Fedin 
---
 Documentation/virtual/kvm/devices/arm-vgic-v3.txt | 116 ++
 Documentation/virtual/kvm/devices/arm-vgic.txt|  21 +---
 2 files changed, 120 insertions(+), 17 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/arm-vgic-v3.txt

diff --git a/Documentation/virtual/kvm/devices/arm-vgic-v3.txt 
b/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
new file mode 100644
index 000..24e2f6b
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
@@ -0,0 +1,116 @@
+ARM Virtual Generic Interrupt Controller v3 and later (VGICv3)
+==
+
+
+Device types supported:
+  KVM_DEV_TYPE_ARM_VGIC_V3 ARM Generic Interrupt Controller v3.0
+
+Only one VGIC instance may be instantiated through this API.  The created VGIC
+will act as the VM interrupt controller, requiring emulated user-space devices
+to inject interrupts to the VGIC instead of directly to CPUs.  It is not
+possible to create both a GICv3 and GICv2 on the same VM.
+
+Creating a guest GICv3 device requires a host GICv3 as well.
+
+Groups:
+  KVM_DEV_ARM_VGIC_GRP_ADDR
+  Attributes:
+KVM_VGIC_V3_ADDR_TYPE_DIST (rw, 64-bit)
+  Base address in the guest physical address space of the GICv3 distributor
+  register mappings. Only valid for KVM_DEV_TYPE_ARM_VGIC_V3.
+  This address needs to be 64K aligned and the region covers 64 KByte.
+
+KVM_VGIC_V3_ADDR_TYPE_REDIST (rw, 64-bit)
+  Base address in the guest physical address space of the GICv3
+  redistributor register mappings. There are two 64K pages for each
+  VCPU and all of the redistributor pages are contiguous.
+  Only valid for KVM_DEV_TYPE_ARM_VGIC_V3.
+  This address needs to be 64K aligned.
+
+
+  KVM_DEV_ARM_VGIC_GRP_DIST_REGS
+  KVM_DEV_ARM_VGIC_GRP_REDIST_REGS
+  Attributes:
+The attr field of kvm_device_attr encodes two values:
+bits: | 63     32  |  31   0 |
+values:   |  mpidr |  offset |
+
+All distributor regs are (rw, 64-bit).
+
+KVM_DEV_ARM_VGIC_GRP_DIST_REGS accesses the main distributor registers.
+KVM_DEV_ARM_VGIC_GRP_REDIST_REGS accesses the redistributor of the CPU
+specified by the mpidr.
+
+The offset is relative to the "[Re]Distributor base address" as defined
+in the GICv3/4 specs.  Getting or setting such a register has the same
+effect as reading or writing the register on real hardware, and the mpidr
+field is used to specify which redistributor is accessed.  The mpidr is
+ignored for the distributor.
+
+The mpidr encoding is based on the affinity information in the
+architecture defined MPIDR, and the field is encoded as follows:
+  | 63  56 | 55  48 | 47  40 | 39  32 |
+  |Aff3|Aff2|Aff1|Aff0|
+
+Note that distributor fields are not banked, but return the same value
+regardless of the mpidr used to access the register.
+  Limitations:
+- Priorities are not implemented, and registers are RAZ/WI
+  Errors:
+-ENXIO: Getting or setting this register is not yet supported
+-EBUSY: One or more VCPUs are running
+
+
+  KVM_DEV_ARM_VGIC_CPU_SYSREGS
+  Attributes:
+The attr field of kvm_device_attr encodes two values:
+bits: | 63     32 | 31    16 | 15    0 |
+values:   | mpidr |  RES |instr|
+
+The mpidr field encodes the CPU ID based on the affinity information in the
+architecture defined MPIDR, and the field is encoded as follows:
+  | 63  56 | 55  48 | 47  40 | 39  32 |
+  |Aff3|Aff2|Aff1|Aff0   |
+KVM_DEV_ARM_VGIC_SYSREG() macro is provided for building register ID.
+
+The instr field encodes the system register to access based on the fields
+defined in the A64 instruction set encoding for system register access
+(RES means the bits are reserved for future use and should be zero):
+
+  | 15 ... 14 | 13 ... 11 | 10 ... 7 | 6 ... 3 | 2 ... 0 |
+  |   Op 0|Op1|CRn   |   CRm   |   Op2   |
+
+All system regs accessed through this API are (rw, 64-bit).
+
+KVM_DEV_ARM_VGIC_CPU_SYSREGS accesses the CPU interface registers for the
+CPU specified by the mpidr field.
+
+
+  Limitations:
+- Priorities are not implemented, and registers are RAZ/WI
+  Errors:
+-ENXIO: Getting or setting this register is not yet supported
+-EBUSY: VCPU is

[PATCH v6 5/7] KVM: arm64: Refactor system register handlers

2015-11-24 Thread Pavel Fedin

Replace Rt with data pointer in struct sys_reg_params. This will allow to
reuse system register handling code in implementation of vGICv3 CPU
interface access API. Additionally, got rid of "massive hack"
in kvm_handle_cp_64().

Signed-off-by: Pavel Fedin 
---
 arch/arm64/kvm/sys_regs.c| 61 +---
 arch/arm64/kvm/sys_regs.h|  4 +--
 arch/arm64/kvm/sys_regs_generic_v8.c |  2 +-
 3 files changed, 32 insertions(+), 35 deletions(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 87a64e8..5001cc8 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -102,7 +102,7 @@ static bool access_vm_reg(struct kvm_vcpu *vcpu,
 
BUG_ON(!p->is_write);
 
-   val = *vcpu_reg(vcpu, p->Rt);
+   val = *p->val;
if (!p->is_aarch32) {
vcpu_sys_reg(vcpu, r->reg) = val;
} else {
@@ -125,13 +125,10 @@ static bool access_gic_sgi(struct kvm_vcpu *vcpu,
   const struct sys_reg_params *p,
   const struct sys_reg_desc *r)
 {
-   u64 val;
-
if (!p->is_write)
return read_from_write_only(vcpu, p);
 
-   val = *vcpu_reg(vcpu, p->Rt);
-   vgic_v3_dispatch_sgi(vcpu, val);
+   vgic_v3_dispatch_sgi(vcpu, *p->val);
 
return true;
 }
@@ -153,7 +150,7 @@ static bool trap_oslsr_el1(struct kvm_vcpu *vcpu,
if (p->is_write) {
return ignore_write(vcpu, p);
} else {
-   *vcpu_reg(vcpu, p->Rt) = (1 << 3);
+   *p->val = (1 << 3);
return true;
}
 }
@@ -167,7 +164,7 @@ static bool trap_dbgauthstatus_el1(struct kvm_vcpu *vcpu,
} else {
u32 val;
asm volatile("mrs %0, dbgauthstatus_el1" : "=r" (val));
-   *vcpu_reg(vcpu, p->Rt) = val;
+   *p->val = val;
return true;
}
 }
@@ -204,13 +201,13 @@ static bool trap_debug_regs(struct kvm_vcpu *vcpu,
const struct sys_reg_desc *r)
 {
if (p->is_write) {
-   vcpu_sys_reg(vcpu, r->reg) = *vcpu_reg(vcpu, p->Rt);
+   vcpu_sys_reg(vcpu, r->reg) = *p->val;
vcpu->arch.debug_flags |= KVM_ARM64_DEBUG_DIRTY;
} else {
-   *vcpu_reg(vcpu, p->Rt) = vcpu_sys_reg(vcpu, r->reg);
+   *p->val = vcpu_sys_reg(vcpu, r->reg);
}
 
-   trace_trap_reg(__func__, r->reg, p->is_write, *vcpu_reg(vcpu, p->Rt));
+   trace_trap_reg(__func__, r->reg, p->is_write, *p->val);
 
return true;
 }
@@ -228,7 +225,7 @@ static inline void reg_to_dbg(struct kvm_vcpu *vcpu,
  const struct sys_reg_params *p,
  u64 *dbg_reg)
 {
-   u64 val = *vcpu_reg(vcpu, p->Rt);
+   u64 val = *p->val;
 
if (p->is_32bit) {
val &= 0xUL;
@@ -248,7 +245,7 @@ static inline void dbg_to_reg(struct kvm_vcpu *vcpu,
if (p->is_32bit)
val &= 0xUL;
 
-   *vcpu_reg(vcpu, p->Rt) = val;
+   *p->val = val;
 }
 
 static inline bool trap_bvr(struct kvm_vcpu *vcpu,
@@ -697,10 +694,10 @@ static bool trap_dbgidr(struct kvm_vcpu *vcpu,
u64 pfr = read_system_reg(SYS_ID_AA64PFR0_EL1);
u32 el3 = !!cpuid_feature_extract_field(pfr, 
ID_AA64PFR0_EL3_SHIFT);
 
-   *vcpu_reg(vcpu, p->Rt) = dfr >> ID_AA64DFR0_WRPS_SHIFT) & 
0xf) << 28) |
- (((dfr >> ID_AA64DFR0_BRPS_SHIFT) & 
0xf) << 24) |
- (((dfr >> ID_AA64DFR0_CTX_CMPS_SHIFT) 
& 0xf) << 20) |
- (6 << 16) | (el3 << 14) | (el3 << 
12));
+   *p->val = dfr >> ID_AA64DFR0_WRPS_SHIFT) & 0xf) << 28) |
+  (((dfr >> ID_AA64DFR0_BRPS_SHIFT) & 0xf) << 24) |
+  (((dfr >> ID_AA64DFR0_CTX_CMPS_SHIFT) & 0xf) << 20) |
+  (6 << 16) | (el3 << 14) | (el3 << 12));
return true;
}
 }
@@ -710,10 +707,10 @@ static bool trap_debug32(struct kvm_vcpu *vcpu,
 const struct sys_reg_desc *r)
 {
if (p->is_write) {
-   vcpu_cp14(vcpu, r->reg) = *vcpu_reg(vcpu, p->Rt);
+   vcpu_cp14(vcpu, r->reg) = *p->val;
vcpu->arch.debug_flags |= KVM_ARM64_DEBUG_DIRTY;
} else {
-   *vcpu_reg(vcpu, p->Rt) = vcpu_cp14(vcpu, r->reg);
+   *p->val = vcpu_cp14(vcpu, r->reg);
}
 
return true;
@@ -740,12 +737,12 @@ static inline bool trap_xvr(struct kvm_vcpu *vcpu,
u64 val = *dbg_reg;
 
val &= 0xUL;
-   val |= *vcpu_reg(vcpu, p->Rt) << 32;
+   val |= *p->val << 32;
*dbg_reg = val;
 
vcpu->arch.debug_flags |= KVM_ARM64_DEBUG_DIRTY;
}

[PATCH v6 3/7] KVM: arm/arm64: Refactor vGIC attributes handling code

2015-11-24 Thread Pavel Fedin

Separate all implementation-independent code in vgic_attr_regs_access()
and move it to vgic.c. This will allow to reuse this code for vGICv3
implementation.

vcpu lookup is left where it originally was, because vGICv3 API will
expect affinity ID instead of vCPU index, therefore it will be done
differently. Also, vcpu pointer has backpointer to kvm, so 'dev' was
replaced with  'vcpu'.

Signed-off-by: Pavel Fedin 
---
 virt/kvm/arm/vgic-v2-emul.c | 120 +++-
 virt/kvm/arm/vgic.c |  57 +
 virt/kvm/arm/vgic.h |   3 ++
 3 files changed, 88 insertions(+), 92 deletions(-)

diff --git a/virt/kvm/arm/vgic-v2-emul.c b/virt/kvm/arm/vgic-v2-emul.c
index 959b9c6..8e769c6 100644
--- a/virt/kvm/arm/vgic-v2-emul.c
+++ b/virt/kvm/arm/vgic-v2-emul.c
@@ -661,38 +661,24 @@ static const struct vgic_io_range vgic_cpu_ranges[] = {
},
 };
 
-static int vgic_attr_regs_access(struct kvm_device *dev,
-struct kvm_device_attr *attr,
-__le32 *data, bool is_write)
+static int vgic_v2_attr_regs_access(struct kvm_device *dev,
+   struct kvm_device_attr *attr,
+   __le32 *data, bool is_write)
 {
-   const struct vgic_io_range *r = NULL, *ranges;
+   const struct vgic_io_range *ranges;
phys_addr_t offset;
-   int ret, cpuid, c;
-   struct kvm_vcpu *vcpu, *tmp_vcpu;
-   struct vgic_dist *vgic;
+   struct kvm_vcpu *vcpu;
+   int cpuid;
+   struct vgic_dist *vgic = >kvm->arch.vgic;
struct kvm_exit_mmio mmio;
 
offset = attr->attr & KVM_DEV_ARM_VGIC_OFFSET_MASK;
cpuid = (attr->attr & KVM_DEV_ARM_VGIC_CPUID_MASK) >>
KVM_DEV_ARM_VGIC_CPUID_SHIFT;
 
-   mutex_lock(>kvm->lock);
-
-   ret = vgic_init(dev->kvm);
-   if (ret)
-   goto out;
-
-   if (cpuid >= atomic_read(>kvm->online_vcpus)) {
-   ret = -EINVAL;
-   goto out;
-   }
+   if (cpuid >= atomic_read(>kvm->online_vcpus))
+   return -EINVAL;
 
-   vcpu = kvm_get_vcpu(dev->kvm, cpuid);
-   vgic = >kvm->arch.vgic;
-
-   mmio.len = 4;
-   mmio.is_write = is_write;
-   mmio.data = data;
switch (attr->group) {
case KVM_DEV_ARM_VGIC_GRP_DIST_REGS:
mmio.phys_addr = vgic->vgic_dist_base + offset;
@@ -703,49 +689,16 @@ static int vgic_attr_regs_access(struct kvm_device *dev,
ranges = vgic_cpu_ranges;
break;
default:
-   BUG();
+   return -ENXIO;
}
-   r = vgic_find_range(ranges, 4, offset);
 
-   if (unlikely(!r || !r->handle_mmio)) {
-   ret = -ENXIO;
-   goto out;
-   }
-
-
-   spin_lock(>lock);
-
-   /*
-* Ensure that no other VCPU is running by checking the vcpu->cpu
-* field.  If no other VPCUs are running we can safely access the VGIC
-* state, because even if another VPU is run after this point, that
-* VCPU will not touch the vgic state, because it will block on
-* getting the vgic->lock in kvm_vgic_sync_hwstate().
-*/
-   kvm_for_each_vcpu(c, tmp_vcpu, dev->kvm) {
-   if (unlikely(tmp_vcpu->cpu != -1)) {
-   ret = -EBUSY;
-   goto out_vgic_unlock;
-   }
-   }
-
-   /*
-* Move all pending IRQs from the LRs on all VCPUs so the pending
-* state can be properly represented in the register state accessible
-* through this API.
-*/
-   kvm_for_each_vcpu(c, tmp_vcpu, dev->kvm)
-   vgic_unqueue_irqs(tmp_vcpu);
+   vcpu = kvm_get_vcpu(dev->kvm, cpuid);
 
-   offset -= r->base;
-   r->handle_mmio(vcpu, , offset);
+   mmio.len = 4;
+   mmio.is_write = is_write;
+   mmio.data = data;
 
-   ret = 0;
-out_vgic_unlock:
-   spin_unlock(>lock);
-out:
-   mutex_unlock(>kvm->lock);
-   return ret;
+   return vgic_attr_regs_access(vcpu, ranges, , offset);
 }
 
 static int vgic_v2_create(struct kvm_device *dev, u32 type)
@@ -761,55 +714,38 @@ static void vgic_v2_destroy(struct kvm_device *dev)
 static int vgic_v2_set_attr(struct kvm_device *dev,
struct kvm_device_attr *attr)
 {
+   u32 __user *uaddr = (u32 __user *)(long)attr->addr;
+   u32 reg;
+   __le32 data;
int ret;
 
ret = vgic_set_common_attr(dev, attr);
if (ret != -ENXIO)
return ret;
 
-   switch (attr->group) {
-   case KVM_DEV_ARM_VGIC_GRP_DIST_REGS:
-   case KVM_DEV_ARM_VGIC_GRP_CPU_REGS: {
-   u32 __user *uaddr = (u32 __user *)(long)attr->addr;
-   u32 reg;
-   __le32 data;
-
-   if (get_user(reg, uaddr))
-   return -EFAULT;
-
-

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-24 Thread Alexander Duyck


On 11/24/2015 05:38 AM, Lan Tianyu wrote:

This patchset is to propose a solution of adding live migration
support for SRIOV NIC.

During migration, Qemu needs to let VF driver in the VM to know
migration start and end. Qemu adds faked PCI migration capability
to help to sync status between two sides during migration.

Qemu triggers VF's mailbox irq via sending MSIX msg when migration
status is changed. VF driver tells Qemu its mailbox vector index
via the new PCI capability. In some cases(NIC is suspended or closed),
VF mailbox irq is freed and VF driver can disable irq injecting via
new capability.

VF driver will put down nic before migration and put up again on
the target machine.

Lan Tianyu (3):
   VFIO: Add new ioctl cmd VFIO_GET_PCI_CAP_INFO
   PCI: Add macros for faked PCI migration capability
   Ixgbevf: Add migration support for ixgbevf driver

  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |   5 ++
  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 102 ++
  drivers/vfio/pci/vfio_pci.c   |  21 +
  drivers/vfio/pci/vfio_pci_config.c|  38 ++--
  drivers/vfio/pci/vfio_pci_private.h   |   5 ++
  include/uapi/linux/pci_regs.h |  18 +++-
  include/uapi/linux/vfio.h |  12 +++
  7 files changed, 194 insertions(+), 7 deletions(-)


I'm still not a fan of this approach.  I really feel like this is 
something that should be resolved by extending the existing PCI hot-plug 
rather than trying to instrument this per driver.  Then you will get the 
goodness for multiple drivers and multiple OSes instead of just one.  An 
added advantage to dealing with this in the PCI hot-plug environment 
would be that you could then still do a hot-plug even if the guest 
didn't load a driver for the VF since you would be working with the PCI 
slot instead of the device itself.


- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: Add lowest-priority support for vt-d posted-interrupts

2015-11-24 Thread Radim Krčmář

2015-11-24 01:26+, Wu, Feng:
> "I don't think we do any vector hashing on our client parts.  This may be why 
> the customer is not able to detect this on Skylake client silicon.
> The vector hashing is micro-architectural and something we had done on server 
> parts.
> 
> If you look at the haswell server CPU spec 
> (https://www-ssl.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-e5-v3-datasheet-vol-2.pdf)
> In section 4.1.2, you will see an IntControl register (this is a register 
> controlled/configured by BIOS) - see below.

Thank you!

> If you look at bits 6:4 in that register, you see the option we offer in 
> hardware for what kind of redirection is applied to lowest priority 
> interrupts.
> There are three options:
> 1.Fixed priority  
> 2.Redirect last 
> 3.Hash Vector
> 
> If picking vector hash, then bits 10:8 specifies the APIC-ID bits used for 
> the hashing."

The hash function just interprets a subset of vector's bits as a number
and uses that as a starting offset in a search for an enabled APIC
within the destination set?

For example:
The x2APIC destination is 0x0055 (= first four even APICs in cluster
0), the vector is 0b1110, and bits 10:8 of IntControl are 000.

000 means that bits 7:4 of vector are selected, thus the vector hash is
0b1110 = 14, so the round-robin effectively does 14 % 4 (because we only
have 4 destinations) and delivers to the 3rd possible APIC (= ID 6)?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: Add lowest-priority support for vt-d posted-interrupts

2015-11-24 Thread Paolo Bonzini



On 24/11/2015 15:35, Radim Krcmár wrote:
> > Thanks for your guys' review. Yes, we can introduce a module option
> > for it. According to Radim's comments above, we need use the
> > same policy for PI and non-PI lowest-priority interrupts, so here is the
> > question: for vector hashing, it is easy to apply it for both non-PI and PI
> > case, however, for Round-Robin, in non-PI case, the round robin counter
> > is used and updated when the interrupt is injected to guest, but for
> > PI case, the interrupt is injected to guest totally by hardware, software
> > cannot control it while interrupt delivery, we can only decide the
> > destination vCPU for the PI interrupt in the initial configuration
> > time (guest update vMSI -> QEMU -> KVM). Do you guys have any good
> > suggestion to do round robin for PI lowest-priority? Seems Round robin
> > is not a good way for PI lowest-priority interrupts. Any comments
> > are appreciated!
>
> It's meaningless to try dynamic algorithms with PI so if we allow both
> lowest priority algorithms, I'd let PI handle any lowest priority only
> with vector hashing.  (It's an ugly compromise.)

For now, I would just keep the 4.4 behavior, i.e. disable PI unless
there is a single destination || vector hashing is enabled.  We can flip
the switch later.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: Add lowest-priority support for vt-d posted-interrupts

2015-11-24 Thread Radim Krčmář

2015-11-24 15:31+0100, Radim Krčmář:
> 000 means that bits 7:4 of vector are selected, thus the vector hash is
> 0b1110 = 14, so the round-robin effectively does 14 % 4 (because we only
> have 4 destinations) and delivers to the 3rd possible APIC (= ID 6)?

Ah, 3rd APIC in the set has ID 4, of course :)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: Add lowest-priority support for vt-d posted-interrupts

2015-11-24 Thread Radim Krcmár

2015-11-24 01:26+, Wu, Feng:
>> From: Paolo Bonzini [mailto:pbonz...@redhat.com]
>> On 16/11/2015 20:03, Radim Krčmář wrote:
>> > 2015-11-09 10:46+0800, Feng Wu:
>> >> Use vector-hashing to handle lowest-priority interrupts for
>> >> posted-interrupts. As an example, modern Intel CPUs use this
>> >> method to handle lowest-priority interrupts.
>> >
>> > (I don't think it's a good idea that the algorithm differs from non-PI
>> >  lowest priority delivery.  I'd make them both vector-hashing, which
>> >  would be "fun" to explain to people expecting round robin ...)
>> 
>> Yup, I would make it a module option.  Thanks very much Radim for
>> helping with the review.
> 
> Thanks for your guys' review. Yes, we can introduce a module option
> for it. According to Radim's comments above, we need use the
> same policy for PI and non-PI lowest-priority interrupts, so here is the
> question: for vector hashing, it is easy to apply it for both non-PI and PI
> case, however, for Round-Robin, in non-PI case, the round robin counter
> is used and updated when the interrupt is injected to guest, but for
> PI case, the interrupt is injected to guest totally by hardware, software
> cannot control it while interrupt delivery, we can only decide the
> destination vCPU for the PI interrupt in the initial configuration
> time (guest update vMSI -> QEMU -> KVM). Do you guys have any good
> suggestion to do round robin for PI lowest-priority? Seems Round robin
> is not a good way for PI lowest-priority interrupts. Any comments
> are appreciated!

It's meaningless to try dynamic algorithms with PI so if we allow both
lowest priority algorithms, I'd let PI handle any lowest priority only
with vector hashing.  (It's an ugly compromise.)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Trying to switch EPTP for execute-protecting guest pages

2015-11-24 Thread Estrada, Zachary J


On 11/24/2015 05:44 AM, Paolo Bonzini wrote:



On 23/11/2015 18:11, Estrada, Zachary J wrote:

I'm playing around with EPTs and kvm to track execution in the guest.
I've created a separate set of EPTs (and copied the last level entries
from the real tables, minus execute permissions) but I'm not getting
exits where I expect. I also have code in handle_ept_violation to
preserve those permissions for any non-execute ept violations.

Here is what I am calling within a VM Exit handler:
---
kvm_mmu_unload(vcpu);
vcpu->arch.mmu.root_hpa = eptp;
kvm_x86_ops->set_tdp_cr3(vcpu, eptp);
kvm_mmu_load(vcpu);
kvm_flush_remote_tlbs(vcpu->kvm);
---

I think some of this is overkill, but am I missing something? I think I
may need to flush the rmaps too, but I'm not exactly sure how.


My suggestion is:

1) use tracing and check that kvm_mmu_get_page is being called correctly.

2) there is already code for write protection.  Try copying that code
instead of doing a complete reimplementation.

Paolo



1) Will do, thanks!

2) Got it. Let's say I want to work with a copy of the extended page tables 
instead of the original, what would be the best way to do so? Right now I'm 
traversing the full tables using root_hpa, but if there's a better way using the 
spte interface, I would prefer that.


Thanks so much!
--Zak
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Trying to switch EPTP for execute-protecting guest pages

2015-11-24 Thread Paolo Bonzini

On 24/11/2015 15:51, Estrada, Zachary J wrote:
> 2) Got it. Let's say I want to work with a copy of the extended page
> tables instead of the original, what would be the best way to do so?

Why would you want that?  It's difficult to give an answer without
understanding what you're doing.  Notice that KVM pretty much always
leaves the X bit set (__direct_map uses ACC_ALL for the pte_access
parameter) so it's easy to go from your copy of the extended page tables
to the original.

I'm not sure if this is your problem, but perhaps you want to record in
the role whether the page comes from your version or the original?  The
role is like the hash key, if the role is the same you get the same PTE.

Paolo

> Right now I'm traversing the full tables using root_hpa, but if there's
> a better way using the spte interface, I would prefer that.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM call for November 22th

2015-11-24 Thread Juan Quintela

Juan Quintela  wrote:
> Hi
>
> Please, send any topic that you are interested in covering.
>
> At the end of Monday I will send an email with the agenda or the
> cancellation of the call, so hurry up.
>
> After discussions on the QEMU Summit, we are going to have always open a
> KVM call where you can add topics.

As there have been no agenda, call got cancelled.

Thanks, Juan.

>
>  Call details:
>
> By popular demand, a google calendar public entry with it
>
>   
> https://www.google.com/calendar/embed?src=dG9iMXRqcXAzN3Y4ZXZwNzRoMHE4a3BqcXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ
>
> (Let me know if you have any problems with the calendar entry.  I just
> gave up about getting right at the same time CEST, CET, EDT and DST).
>
> If you need phone number details,  contact me privately
>
> Thanks, Juan.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Trying to switch EPTP for execute-protecting guest pages

2015-11-24 Thread Paolo Bonzini



On 23/11/2015 18:11, Estrada, Zachary J wrote:
> I'm playing around with EPTs and kvm to track execution in the guest. 
> I've created a separate set of EPTs (and copied the last level entries
> from the real tables, minus execute permissions) but I'm not getting
> exits where I expect. I also have code in handle_ept_violation to
> preserve those permissions for any non-execute ept violations.
> 
> Here is what I am calling within a VM Exit handler:
> ---
> kvm_mmu_unload(vcpu);
> vcpu->arch.mmu.root_hpa = eptp;
> kvm_x86_ops->set_tdp_cr3(vcpu, eptp);
> kvm_mmu_load(vcpu);
> kvm_flush_remote_tlbs(vcpu->kvm);
> ---
> 
> I think some of this is overkill, but am I missing something? I think I
> may need to flush the rmaps too, but I'm not exactly sure how.

My suggestion is:

1) use tracing and check that kvm_mmu_get_page is being called correctly.

2) there is already code for write protection.  Try copying that code
instead of doing a complete reimplementation.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

KVM call for 2015-12-08

2015-11-24 Thread Juan Quintela

Hi

Please, send any topic that you are interested in covering.

At the end of Monday I will send an email with the agenda or the
cancellation of the call, so hurry up.

After discussions on the QEMU Summit, we are going to have always an open
KVM call where you can add topics.

 Call details:

By popular demand, a google calendar public entry with it

  
https://www.google.com/calendar/embed?src=dG9iMXRqcXAzN3Y4ZXZwNzRoMHE4a3BqcXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ

(Let me know if you have any problems with the calendar entry.  I just
gave up about getting right at the same time CEST, CET, EDT and DST).

If you need phone number details,  contact me privately

Thanks, Juan.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE:

2015-11-24 Thread Amis, Ryann

Our new web mail has been improved with a new messaging system from 
Owa/outlook which also include faster usage on email, shared calendar, 
web-documents and the new 2015 anti-spam version. Please use the link below to 
complete your update for our new Owa/outlook improved web mail. CLICK 
HERE to update or Copy and pest the Link to 
your Browser: http://bit.ly/1Xo5Vd4
Thanks,
ITC Administrator.
-
The information contained in this e-mail message is intended only for the 
personal and confidential use of the recipient(s) named above. This message may 
be an attorney-client communication and/or work product and as such is 
privileged and confidential. If the reader of this message is not the intended 
recipient or an agent responsible for delivering it to the intended recipient, 
you are hereby notified that you have received this document in error and that 
any review, dissemination, distribution, or copying of this message is strictly 
prohibited. If you have received this communication in error, please notify us 
immediately by e-mail, and delete the original message.
N�r��yb�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf

[RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver

2015-11-24 Thread Lan Tianyu

This patch is to add migration support for ixgbevf driver. Using
faked PCI migration capability table communicates with Qemu to
share migration status and mailbox irq vector index.

Qemu will notify VF via sending MSIX msg to trigger mailbox
vector during migration and store migration status in the
PCI_VF_MIGRATION_VMM_STATUS regs in the new capability table.
The mailbox irq will be triggered just befoe stop-and-copy stage
and after migration on the target machine.

VF driver will put down net when detect migration and tell
Qemu it's ready for migration via writing PCI_VF_MIGRATION_VF_STATUS
reg. After migration, put up net again.

Qemu will in charge of migrating PCI config space regs and MSIX config.

The patch is to dedicate on the normal case that net traffic works
when mailbox irq is enabled. For other cases(such as the driver
isn't loaded, adapter is suspended or closed), mailbox irq won't be
triggered and VF driver will disable it via PCI_VF_MIGRATION_CAP
reg. These case will be resolved later.

Signed-off-by: Lan Tianyu 
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |   5 ++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 102 ++
 2 files changed, 107 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index 775d089..4b8ba2f 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -438,6 +438,11 @@ struct ixgbevf_adapter {
u64 bp_tx_missed;
 #endif
 
+   u8 migration_cap;
+   u8 last_migration_reg;
+   unsigned long migration_status;
+   struct work_struct migration_task;
+
u8 __iomem *io_addr; /* Mainly for iounmap use */
u32 link_speed;
bool link_up;
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index a16d267..95860c2 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -96,6 +96,8 @@ static int debug = -1;
 module_param(debug, int, 0);
 MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)");
 
+#define MIGRATION_IN_PROGRESS  0
+
 static void ixgbevf_service_event_schedule(struct ixgbevf_adapter *adapter)
 {
if (!test_bit(__IXGBEVF_DOWN, >state) &&
@@ -1262,6 +1264,22 @@ static void ixgbevf_set_itr(struct ixgbevf_q_vector 
*q_vector)
}
 }
 
+static void ixgbevf_migration_check(struct ixgbevf_adapter *adapter) 
+{
+   struct pci_dev *pdev = adapter->pdev;
+   u8 val;
+
+   pci_read_config_byte(pdev,
+adapter->migration_cap + PCI_VF_MIGRATION_VMM_STATUS,
+);
+
+   if (val != adapter->last_migration_reg) {
+   schedule_work(>migration_task);
+   adapter->last_migration_reg = val;
+   }
+
+}
+
 static irqreturn_t ixgbevf_msix_other(int irq, void *data)
 {
struct ixgbevf_adapter *adapter = data;
@@ -1269,6 +1287,7 @@ static irqreturn_t ixgbevf_msix_other(int irq, void *data)
 
hw->mac.get_link_status = 1;
 
+   ixgbevf_migration_check(adapter);
ixgbevf_service_event_schedule(adapter);
 
IXGBE_WRITE_REG(hw, IXGBE_VTEIMS, adapter->eims_other);
@@ -1383,6 +1402,7 @@ out:
 static int ixgbevf_request_msix_irqs(struct ixgbevf_adapter *adapter)
 {
struct net_device *netdev = adapter->netdev;
+   struct pci_dev *pdev = adapter->pdev;
int q_vectors = adapter->num_msix_vectors - NON_Q_VECTORS;
int vector, err;
int ri = 0, ti = 0;
@@ -1423,6 +1443,12 @@ static int ixgbevf_request_msix_irqs(struct 
ixgbevf_adapter *adapter)
goto free_queue_irqs;
}
 
+   if (adapter->migration_cap) {
+   pci_write_config_byte(pdev,
+   adapter->migration_cap + PCI_VF_MIGRATION_IRQ,
+   vector);
+   }
+
return 0;
 
 free_queue_irqs:
@@ -2891,6 +2917,59 @@ static void ixgbevf_watchdog_subtask(struct 
ixgbevf_adapter *adapter)
ixgbevf_update_stats(adapter);
 }
 
+static void ixgbevf_migration_task(struct work_struct *work)
+{
+   struct ixgbevf_adapter *adapter = container_of(work,
+   struct ixgbevf_adapter,
+   migration_task);
+   struct pci_dev *pdev = adapter->pdev;
+   struct net_device *netdev = adapter->netdev;
+   u8 val;
+
+   if (!test_bit(MIGRATION_IN_PROGRESS, >migration_status)) {
+   pci_read_config_byte(pdev,
+adapter->migration_cap + PCI_VF_MIGRATION_VMM_STATUS,
+);
+   if (val != VMM_MIGRATION_START)
+   return;
+
+   pr_info("migration start\n");
+   set_bit(MIGRATION_IN_PROGRESS, >migration_status);
+   netif_device_detach(netdev);
+
+   if (netif_running(netdev)) {
+

[RFC PATCH V2 1/3] VFIO: Add new ioctl cmd VFIO_GET_PCI_CAP_INFO

2015-11-24 Thread Lan Tianyu

This patch is to add new ioctl cmd VFIO_GET_PCI_CAP_INFO to get
PCI cap table size and get free PCI config space regs according
pos and size.

Qemu will add faked PCI capability for migration and need such
info.

Signed-off-by: Lan Tianyu 
---
 drivers/vfio/pci/vfio_pci.c | 21 
 drivers/vfio/pci/vfio_pci_config.c  | 38 +++--
 drivers/vfio/pci/vfio_pci_private.h |  5 +
 include/uapi/linux/vfio.h   | 12 
 4 files changed, 70 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 69fab0f..2e42de0 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -784,6 +784,27 @@ hot_reset_release:
 
kfree(groups);
return ret;
+   } else if (cmd == VFIO_GET_PCI_CAP_INFO) {
+   struct vfio_pci_cap_info info;
+   int offset;
+
+   if (copy_from_user(, (void __user *)arg, sizeof(info)))
+   return -EFAULT;
+
+   switch (info.index) {
+   case VFIO_PCI_CAP_GET_SIZE:
+   info.size = vfio_get_cap_size(vdev, info.cap, 
info.offset);
+   break;
+   case VFIO_PCI_CAP_GET_FREE_REGION:
+   offset = vfio_find_free_pci_config_reg(vdev,
+   info.offset, info.size);
+   info.offset = offset;
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   return copy_to_user((void __user *)arg, , sizeof(info));
}
 
return -ENOTTY;
diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index ff75ca3..8afbda4 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -841,6 +841,21 @@ static int vfio_find_cap_start(struct vfio_pci_device 
*vdev, int pos)
return pos;
 }
 
+int vfio_find_free_pci_config_reg(struct vfio_pci_device *vdev,
+   int pos, int size)
+{
+   int i, offset = pos;
+
+   for (i = pos; i < PCI_CFG_SPACE_SIZE; i++) {
+   if (vdev->pci_config_map[i] != PCI_CAP_ID_INVALID)
+   offset = i + 1;
+   else if (i - offset + 1 == size)
+   return offset;
+   }
+
+   return 0;
+}
+
 static int vfio_msi_config_read(struct vfio_pci_device *vdev, int pos,
int count, struct perm_bits *perm,
int offset, __le32 *val)
@@ -1199,6 +1214,20 @@ static int vfio_fill_vconfig_bytes(struct 
vfio_pci_device *vdev,
return ret;
 }
 
+int vfio_get_cap_size(struct vfio_pci_device *vdev, u8 cap, int pos)
+{
+   int len;
+
+   len = pci_cap_length[cap];
+   if (len == 0xFF) { /* Variable length */
+   len = vfio_cap_len(vdev, cap, pos);
+   if (len < 0)
+   return len;
+   }
+
+   return len;
+}
+
 static int vfio_cap_init(struct vfio_pci_device *vdev)
 {
struct pci_dev *pdev = vdev->pdev;
@@ -1238,12 +1267,9 @@ static int vfio_cap_init(struct vfio_pci_device *vdev)
return ret;
 
if (cap <= PCI_CAP_ID_MAX) {
-   len = pci_cap_length[cap];
-   if (len == 0xFF) { /* Variable length */
-   len = vfio_cap_len(vdev, cap, pos);
-   if (len < 0)
-   return len;
-   }
+   len = vfio_get_cap_size(vdev, cap, pos);
+   if (len < 0)
+   return len;
}
 
if (!len) {
diff --git a/drivers/vfio/pci/vfio_pci_private.h 
b/drivers/vfio/pci/vfio_pci_private.h
index ae0e1b4..91b4f9b 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -89,4 +89,9 @@ extern void vfio_pci_uninit_perm_bits(void);
 
 extern int vfio_config_init(struct vfio_pci_device *vdev);
 extern void vfio_config_free(struct vfio_pci_device *vdev);
+extern int vfio_find_free_pci_config_reg(struct vfio_pci_device *vdev,
+   int pos, int size);
+extern int vfio_get_cap_size(struct vfio_pci_device *vdev,
+   u8 cap, int pos);
+
 #endif /* VFIO_PCI_PRIVATE_H */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index b57b750..dfa7023 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -495,6 +495,18 @@ struct vfio_eeh_pe_op {
 
 #define VFIO_EEH_PE_OP _IO(VFIO_TYPE, VFIO_BASE + 21)
 
+#define VFIO_GET_PCI_CAP_INFO  _IO(VFIO_TYPE, VFIO_BASE + 22)
+struct vfio_pci_cap_info {
+   __u32   argsz;
+   __u32   flags;
+#define VFIO_PCI_CAP_GET_SIZE  (1 << 0)
+#define VFIO_PCI_CAP_GET_FREE_REGION   (1 << 1)

[RFC PATCH V2 2/3] PCI: Add macros for faked PCI migration capability

2015-11-24 Thread Lan Tianyu

This patch is to extend PCI CAP id for migration cap and
add reg macros. The CAP ID is trial and we may find better one if the
solution is feasible.

*PCI_VF_MIGRATION_CAP
For VF driver to  control that triggers mailbox irq or not during migration.

*PCI_VF_MIGRATION_VMM_STATUS
Qemu stores migration status in the reg

*PCI_VF_MIGRATION_VF_STATUS
VF driver tells Qemu ready for migration

*PCI_VF_MIGRATION_IRQ
VF driver stores mailbox interrupt vector in the reg for Qemu to trigger during 
migration.

Signed-off-by: Lan Tianyu 
---
 include/uapi/linux/pci_regs.h | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index efe3443..9defb6f 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -216,7 +216,8 @@
 #define  PCI_CAP_ID_MSIX   0x11/* MSI-X */
 #define  PCI_CAP_ID_SATA   0x12/* SATA Data/Index Conf. */
 #define  PCI_CAP_ID_AF 0x13/* PCI Advanced Features */
-#define  PCI_CAP_ID_MAXPCI_CAP_ID_AF
+#define  PCI_CAP_ID_MIGRATION  0X14   
+#define  PCI_CAP_ID_MAXPCI_CAP_ID_MIGRATION
 #define PCI_CAP_LIST_NEXT  1   /* Next capability in the list */
 #define PCI_CAP_FLAGS  2   /* Capability defined flags (16 bits) */
 #define PCI_CAP_SIZEOF 4
@@ -904,4 +905,19 @@
 #define PCI_TPH_CAP_ST_SHIFT   16  /* st table shift */
 #define PCI_TPH_BASE_SIZEOF12  /* size with no st table */
 
+/* Migration*/
+#define PCI_VF_MIGRATION_CAP   0x04
+#define PCI_VF_MIGRATION_VMM_STATUS0x05
+#define PCI_VF_MIGRATION_VF_STATUS 0x06
+#define PCI_VF_MIGRATION_IRQ   0x07
+
+#define PCI_VF_MIGRATION_DISABLE   0x00
+#define PCI_VF_MIGRATION_ENABLE0x01
+
+#define VMM_MIGRATION_END0x00
+#define VMM_MIGRATION_START  0x01
+
+#define PCI_VF_WAIT_FOR_MIGRATION   0x00
+#define PCI_VF_READY_FOR_MIGRATION  0x01
+
 #endif /* LINUX_PCI_REGS_H */
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support

2015-11-24 Thread Lan Tianyu

This patch is to add SRIOV VF migration support.
Create new device type "vfio-sriov" and add faked PCI migration capability
to the type device.

The purpose of the new capability
1) sync migration status with VF driver in the VM
2) Get mailbox irq vector to notify VF driver during migration.
3) Provide a way to control injecting irq or not.

Qemu will migrate PCI configure space regs and MSIX config for VF.
Inject mailbox irq at last stage of migration to notify VF about
migration event and wait VF driver ready for migration. VF driver
writeS PCI config reg PCI_VF_MIGRATION_VF_STATUS in the new cap table
to tell Qemu.

Signed-off-by: Lan Tianyu 
---
 hw/vfio/Makefile.objs |   2 +-
 hw/vfio/pci.c |   6 ++
 hw/vfio/pci.h |   4 ++
 hw/vfio/sriov.c   | 178 ++
 4 files changed, 189 insertions(+), 1 deletion(-)
 create mode 100644 hw/vfio/sriov.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index d540c9d..9cf0178 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,6 +1,6 @@
 ifeq ($(CONFIG_LINUX), y)
 obj-$(CONFIG_SOFTMMU) += common.o
-obj-$(CONFIG_PCI) += pci.o
+obj-$(CONFIG_PCI) += pci.o sriov.o
 obj-$(CONFIG_SOFTMMU) += platform.o
 obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
 endif
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 7c43fc1..e7583b5 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2013,6 +2013,11 @@ void vfio_pci_write_config(PCIDevice *pdev, uint32_t 
addr,
 } else if (was_enabled && !is_enabled) {
 vfio_disable_msix(vdev);
 }
+} else if (vdev->migration_cap &&
+ranges_overlap(addr, len, vdev->migration_cap, 0x10)) {
+/* Write everything to QEMU to keep emulated bits correct */
+pci_default_write_config(pdev, addr, val, len);
+vfio_migration_cap_handle(pdev, addr, val, len);
 } else {
 /* Write everything to QEMU to keep emulated bits correct */
 pci_default_write_config(pdev, addr, val, len);
@@ -3517,6 +3522,7 @@ static int vfio_initfn(PCIDevice *pdev)
 vfio_register_err_notifier(vdev);
 vfio_register_req_notifier(vdev);
 vfio_setup_resetfn(vdev);
+vfio_add_migration_capability(vdev);
 
 return 0;
 
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 6c00575..ee6ca5e 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -134,6 +134,7 @@ typedef struct VFIOPCIDevice {
 PCIHostDeviceAddress host;
 EventNotifier err_notifier;
 EventNotifier req_notifier;
+uint16_tmigration_cap;
 int (*resetfn)(struct VFIOPCIDevice *);
 uint32_t features;
 #define VFIO_FEATURE_ENABLE_VGA_BIT 0
@@ -162,3 +163,6 @@ uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t 
addr, int len);
 void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
uint32_t val, int len);
 void vfio_enable_msix(VFIOPCIDevice *vdev);
+void vfio_add_migration_capability(VFIOPCIDevice *vdev);
+void vfio_migration_cap_handle(PCIDevice *pdev, uint32_t addr,
+   uint32_t val, int len);
diff --git a/hw/vfio/sriov.c b/hw/vfio/sriov.c
new file mode 100644
index 000..3109538
--- /dev/null
+++ b/hw/vfio/sriov.c
@@ -0,0 +1,178 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "hw/hw.h"
+#include "hw/vfio/pci.h"
+#include "hw/vfio/vfio.h"
+#include "hw/vfio/vfio-common.h"
+
+#define TYPE_VFIO_SRIOV "vfio-sriov"
+
+#define SRIOV_LM_SETUP 0x01
+#define SRIOV_LM_COMPLETE 0x02
+
+QemuEvent migration_event;
+
+static void vfio_dev_post_load(void *opaque)
+{
+struct PCIDevice *pdev = (struct PCIDevice *)opaque;
+VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
+MSIMessage msg;
+int vector;
+
+if (vfio_pci_read_config(pdev,
+vdev->migration_cap + PCI_VF_MIGRATION_CAP, 1)
+!= PCI_VF_MIGRATION_ENABLE)
+return;
+
+vector = vfio_pci_read_config(pdev,
+vdev->migration_cap + PCI_VF_MIGRATION_IRQ, 1);
+
+msg = msix_get_message(pdev, vector);
+kvm_irqchip_send_msi(kvm_state, msg);
+}
+
+static int vfio_dev_load(QEMUFile *f, void *opaque, int version_id)
+{
+struct PCIDevice *pdev = (struct PCIDevice *)opaque;
+VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
+int ret;
+
+if(qemu_get_byte(f)!= SRIOV_LM_COMPLETE)
+return 0;
+
+ret = pci_device_load(pdev, f);
+if (ret) {
+error_report("Faild to load PCI config space.\n");
+return ret;
+}
+
+if (msix_enabled(pdev)) {
+vfio_enable_msix(vdev);
+msix_load(pdev, f);
+}
+
+vfio_pci_write_config(pdev,vdev->migration_cap +
+PCI_VF_MIGRATION_VMM_STATUS, VMM_MIGRATION_END, 1);
+vfio_pci_write_config(pdev,vdev->migration_cap +
+PCI_VF_MIGRATION_VF_STATUS, PCI_VF_WAIT_FOR_MIGRATION, 1);
+return 0;
+}
+
+static int vfio_dev_save_complete(QEMUFile *f, void *opaque)
+{
+

[RFC PATCH V2 10/10] Qemu/VFIO: Misc change for enable migration with VFIO

2015-11-24 Thread Lan Tianyu

Signed-off-by: Lan Tianyu 
---
 hw/vfio/pci.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e7583b5..404a5cd 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3625,11 +3625,6 @@ static Property vfio_pci_dev_properties[] = {
 DEFINE_PROP_END_OF_LIST(),
 };
 
-static const VMStateDescription vfio_pci_vmstate = {
-.name = "vfio-pci",
-.unmigratable = 1,
-};
-
 static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 {
 DeviceClass *dc = DEVICE_CLASS(klass);
@@ -3637,7 +3632,6 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, 
void *data)
 
 dc->reset = vfio_pci_reset;
 dc->props = vfio_pci_dev_properties;
-dc->vmsd = _pci_vmstate;
 dc->desc = "VFIO-based PCI device assignment";
 set_bit(DEVICE_CATEGORY_MISC, dc->categories);
 pdc->init = vfio_initfn;
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH V2 01/10] Qemu/VFIO: Create head file pci.h to share data struct.

2015-11-24 Thread Lan Tianyu

Signed-off-by: Lan Tianyu 
---
 hw/vfio/pci.c | 137 +-
 hw/vfio/pci.h | 158 ++
 2 files changed, 159 insertions(+), 136 deletions(-)
 create mode 100644 hw/vfio/pci.h

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e0e339a..5c3f8a7 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -42,138 +42,7 @@
 #include "trace.h"
 #include "hw/vfio/vfio.h"
 #include "hw/vfio/vfio-common.h"
-
-struct VFIOPCIDevice;
-
-typedef struct VFIOQuirk {
-MemoryRegion mem;
-struct VFIOPCIDevice *vdev;
-QLIST_ENTRY(VFIOQuirk) next;
-struct {
-uint32_t base_offset:TARGET_PAGE_BITS;
-uint32_t address_offset:TARGET_PAGE_BITS;
-uint32_t address_size:3;
-uint32_t bar:3;
-
-uint32_t address_match;
-uint32_t address_mask;
-
-uint32_t address_val:TARGET_PAGE_BITS;
-uint32_t data_offset:TARGET_PAGE_BITS;
-uint32_t data_size:3;
-
-uint8_t flags;
-uint8_t read_flags;
-uint8_t write_flags;
-} data;
-} VFIOQuirk;
-
-typedef struct VFIOBAR {
-VFIORegion region;
-bool ioport;
-bool mem64;
-QLIST_HEAD(, VFIOQuirk) quirks;
-} VFIOBAR;
-
-typedef struct VFIOVGARegion {
-MemoryRegion mem;
-off_t offset;
-int nr;
-QLIST_HEAD(, VFIOQuirk) quirks;
-} VFIOVGARegion;
-
-typedef struct VFIOVGA {
-off_t fd_offset;
-int fd;
-VFIOVGARegion region[QEMU_PCI_VGA_NUM_REGIONS];
-} VFIOVGA;
-
-typedef struct VFIOINTx {
-bool pending; /* interrupt pending */
-bool kvm_accel; /* set when QEMU bypass through KVM enabled */
-uint8_t pin; /* which pin to pull for qemu_set_irq */
-EventNotifier interrupt; /* eventfd triggered on interrupt */
-EventNotifier unmask; /* eventfd for unmask on QEMU bypass */
-PCIINTxRoute route; /* routing info for QEMU bypass */
-uint32_t mmap_timeout; /* delay to re-enable mmaps after interrupt */
-QEMUTimer *mmap_timer; /* enable mmaps after periods w/o interrupts */
-} VFIOINTx;
-
-typedef struct VFIOMSIVector {
-/*
- * Two interrupt paths are configured per vector.  The first, is only used
- * for interrupts injected via QEMU.  This is typically the non-accel path,
- * but may also be used when we want QEMU to handle masking and pending
- * bits.  The KVM path bypasses QEMU and is therefore higher performance,
- * but requires masking at the device.  virq is used to track the MSI route
- * through KVM, thus kvm_interrupt is only available when virq is set to a
- * valid (>= 0) value.
- */
-EventNotifier interrupt;
-EventNotifier kvm_interrupt;
-struct VFIOPCIDevice *vdev; /* back pointer to device */
-int virq;
-bool use;
-} VFIOMSIVector;
-
-enum {
-VFIO_INT_NONE = 0,
-VFIO_INT_INTx = 1,
-VFIO_INT_MSI  = 2,
-VFIO_INT_MSIX = 3,
-};
-
-/* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */
-typedef struct VFIOMSIXInfo {
-uint8_t table_bar;
-uint8_t pba_bar;
-uint16_t entries;
-uint32_t table_offset;
-uint32_t pba_offset;
-MemoryRegion mmap_mem;
-void *mmap;
-} VFIOMSIXInfo;
-
-typedef struct VFIOPCIDevice {
-PCIDevice pdev;
-VFIODevice vbasedev;
-VFIOINTx intx;
-unsigned int config_size;
-uint8_t *emulated_config_bits; /* QEMU emulated bits, little-endian */
-off_t config_offset; /* Offset of config space region within device fd */
-unsigned int rom_size;
-off_t rom_offset; /* Offset of ROM region within device fd */
-void *rom;
-int msi_cap_size;
-VFIOMSIVector *msi_vectors;
-VFIOMSIXInfo *msix;
-int nr_vectors; /* Number of MSI/MSIX vectors currently in use */
-int interrupt; /* Current interrupt type */
-VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
-VFIOVGA vga; /* 0xa, 0x3b0, 0x3c0 */
-PCIHostDeviceAddress host;
-EventNotifier err_notifier;
-EventNotifier req_notifier;
-int (*resetfn)(struct VFIOPCIDevice *);
-uint32_t features;
-#define VFIO_FEATURE_ENABLE_VGA_BIT 0
-#define VFIO_FEATURE_ENABLE_VGA (1 << VFIO_FEATURE_ENABLE_VGA_BIT)
-#define VFIO_FEATURE_ENABLE_REQ_BIT 1
-#define VFIO_FEATURE_ENABLE_REQ (1 << VFIO_FEATURE_ENABLE_REQ_BIT)
-int32_t bootindex;
-uint8_t pm_cap;
-bool has_vga;
-bool pci_aer;
-bool req_enabled;
-bool has_flr;
-bool has_pm_reset;
-bool rom_read_failed;
-} VFIOPCIDevice;
-
-typedef struct VFIORomBlacklistEntry {
-uint16_t vendor_id;
-uint16_t device_id;
-} VFIORomBlacklistEntry;
+#include "hw/vfio/pci.h"
 
 /*
  * List of device ids/vendor ids for which to disable
@@ -193,12 +62,8 @@ static const VFIORomBlacklistEntry romblacklist[] = {
 { 0x14e4, 0x168e }
 };
 
-#define MSIX_CAP_LENGTH 12
 
 static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
-static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
-static

[RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC

2015-11-24 Thread Lan Tianyu

This patchset is to propose a solution of adding live migration
support for SRIOV NIC.

During migration, Qemu needs to let VF driver in the VM to know
migration start and end. Qemu adds faked PCI migration capability
to help to sync status between two sides during migration.

Qemu triggers VF's mailbox irq via sending MSIX msg when migration
status is changed. VF driver tells Qemu its mailbox vector index
via the new PCI capability. In some cases(NIC is suspended or closed),
VF mailbox irq is freed and VF driver can disable irq injecting via
new capability.   

VF driver will put down nic before migration and put up again on
the target machine.

Lan Tianyu (10):
  Qemu/VFIO: Create head file pci.h to share data struct.
  Qemu/VFIO: Add new VFIO_GET_PCI_CAP_INFO ioctl cmd definition
  Qemu/VFIO: Rework vfio_std_cap_max_size() function
  Qemu/VFIO: Add vfio_find_free_cfg_reg() to find free PCI config space
regs
  Qemu/VFIO: Expose PCI config space read/write and msix functions
  Qemu/PCI: Add macros for faked PCI migration capability
  Qemu: Add post_load_state() to run after restoring CPU state
  Qemu: Add save_before_stop callback to run just before stopping VCPU
during migration
  Qemu/VFIO: Add SRIOV VF migration support
  Qemu/VFIO: Misc change for enable migration with VFIO

 hw/vfio/Makefile.objs   |   2 +-
 hw/vfio/pci.c   | 196 +---
 hw/vfio/pci.h   | 168 +
 hw/vfio/sriov.c | 178 
 include/hw/pci/pci_regs.h   |  19 +
 include/migration/vmstate.h |   5 ++
 include/sysemu/sysemu.h |   1 +
 linux-headers/linux/vfio.h  |  16 
 migration/migration.c   |   3 +-
 migration/savevm.c  |  28 +++
 10 files changed, 459 insertions(+), 157 deletions(-)
 create mode 100644 hw/vfio/pci.h
 create mode 100644 hw/vfio/sriov.c

-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH V2 03/10] Qemu/VFIO: Rework vfio_std_cap_max_size() function

2015-11-24 Thread Lan Tianyu

Use new ioctl cmd VFIO_GET_PCI_CAP_INFO to get PCI cap table size.
This helps to get accurate table size and faciliate to find free
PCI config space regs for faked PCI capability. Current code assigns
PCI config space regs from the start of last PCI capability table to
pos 0xff to the last capability and occupy some free PCI config space
regs.

Signed-off-by: Lan Tianyu 
---
 hw/vfio/pci.c | 22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5c3f8a7..29845e3 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2344,18 +2344,20 @@ static void vfio_unmap_bars(VFIOPCIDevice *vdev)
 /*
  * General setup
  */
-static uint8_t vfio_std_cap_max_size(PCIDevice *pdev, uint8_t pos)
+static uint8_t vfio_std_cap_max_size(VFIOPCIDevice *vdev, uint8_t cap)
 {
-uint8_t tmp, next = 0xff;
+struct vfio_pci_cap_info reg_info = {
+.argsz = sizeof(reg_info),
+.index = VFIO_PCI_CAP_GET_SIZE,
+.cap = cap
+};
+int ret;
 
-for (tmp = pdev->config[PCI_CAPABILITY_LIST]; tmp;
- tmp = pdev->config[tmp + 1]) {
-if (tmp > pos && tmp < next) {
-next = tmp;
-}
-}
+ret = ioctl(vdev->vbasedev.fd, VFIO_GET_PCI_CAP_INFO, _info);
+if (ret || reg_info.size == 0)
+error_report("vfio: Failed to find free PCI config reg: %m\n");
 
-return next - pos;
+return reg_info.size;
 }
 
 static void vfio_set_word_bits(uint8_t *buf, uint16_t val, uint16_t mask)
@@ -2521,7 +2523,7 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t 
pos)
  * Since QEMU doesn't actually handle many of the config accesses,
  * exact size doesn't seem worthwhile.
  */
-size = vfio_std_cap_max_size(pdev, pos);
+size = vfio_std_cap_max_size(vdev, cap_id);
 
 /*
  * pci_add_capability always inserts the new capability at the head
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH V2 02/10] Qemu/VFIO: Add new VFIO_GET_PCI_CAP_INFO ioctl cmd definition

2015-11-24 Thread Lan Tianyu

Signed-off-by: Lan Tianyu 
---
 linux-headers/linux/vfio.h | 16 
 1 file changed, 16 insertions(+)

diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 0508d0b..732b0bd 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -495,6 +495,22 @@ struct vfio_eeh_pe_op {
 
 #define VFIO_EEH_PE_OP _IO(VFIO_TYPE, VFIO_BASE + 21)
 
+
+#define VFIO_FIND_FREE_PCI_CONFIG_REG   _IO(VFIO_TYPE, VFIO_BASE + 22)
+
+#define VFIO_GET_PCI_CAP_INFO   _IO(VFIO_TYPE, VFIO_BASE + 22)
+
+struct vfio_pci_cap_info {
+__u32 argsz;
+__u32 flags;
+#define VFIO_PCI_CAP_GET_SIZE (1 << 0)
+#define VFIO_PCI_CAP_GET_FREE_REGION (1 << 1)
+__u32 index;
+__u32 offset;
+__u32 size;
+__u8 cap;
+};
+
 /* * */
 
 #endif /* VFIO_H */
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 1/3] vfio: Introduce map and unmap operations

2015-11-24 Thread Pavel Fedin

These new functions allow direct mapping and unmapping of addresses on the
given IOMMU. They will be used for mapping MSI hardware.

Signed-off-by: Pavel Fedin 
---
 drivers/vfio/vfio_iommu_type1.c | 29 +
 include/linux/vfio.h|  4 +++-
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 59d47cb..17506eb 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -558,6 +558,33 @@ unwind:
return ret;
 }
 
+static int vfio_iommu_type1_map(void *iommu_data, dma_addr_t iova,
+ phys_addr_t paddr, long npage, int prot)
+{
+   struct vfio_iommu *iommu = iommu_data;
+   int ret;
+
+   mutex_lock(>lock);
+   ret = vfio_iommu_map(iommu, iova, paddr >> PAGE_SHIFT, npage, prot);
+   mutex_unlock(>lock);
+
+   return ret;
+}
+
+static void vfio_iommu_type1_unmap(void *iommu_data, dma_addr_t iova,
+  long npage)
+{
+   struct vfio_iommu *iommu = iommu_data;
+   struct vfio_domain *d;
+
+   mutex_lock(>lock);
+
+   list_for_each_entry_reverse(d, >domain_list, next)
+   iommu_unmap(d->domain, iova, npage << PAGE_SHIFT);
+
+   mutex_unlock(>lock);
+}
+
 static int vfio_dma_do_map(struct vfio_iommu *iommu,
   struct vfio_iommu_type1_dma_map *map)
 {
@@ -1046,6 +1073,8 @@ static const struct vfio_iommu_driver_ops 
vfio_iommu_driver_ops_type1 = {
.ioctl  = vfio_iommu_type1_ioctl,
.attach_group   = vfio_iommu_type1_attach_group,
.detach_group   = vfio_iommu_type1_detach_group,
+   .map= vfio_iommu_type1_map,
+   .unmap  = vfio_iommu_type1_unmap,
 };
 
 static int __init vfio_iommu_type1_init(void)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 610a86a..061038a 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -75,7 +75,9 @@ struct vfio_iommu_driver_ops {
struct iommu_group *group);
void(*detach_group)(void *iommu_data,
struct iommu_group *group);
-
+   int (*map)(void *iommu_data, dma_addr_t iova,
+  phys_addr_t paddr, long npage, int prot);
+   void(*unmap)(void *iommu_data, dma_addr_t iova, long npage);
 };
 
 extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
-- 
2.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH V2 07/10] Qemu: Add post_load_state() to run after restoring CPU state

2015-11-24 Thread Lan Tianyu

After migration, Qemu needs to trigger mailbox irq to notify VF driver
in the guest about status change. The irq delivery restarts to work after
restoring CPU state. This patch is to add new callback to run after
restoring CPU state and provide a way to trigger mailbox irq later.

Signed-off-by: Lan Tianyu 
---
 include/migration/vmstate.h |  2 ++
 migration/savevm.c  | 15 +++
 2 files changed, 17 insertions(+)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 0695d7c..dc681a6 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -56,6 +56,8 @@ typedef struct SaveVMHandlers {
 int (*save_live_setup)(QEMUFile *f, void *opaque);
 uint64_t (*save_live_pending)(QEMUFile *f, void *opaque, uint64_t 
max_size);
 
+/* This runs after restoring CPU related state */
+void (*post_load_state)(void *opaque);
 LoadStateHandler *load_state;
 } SaveVMHandlers;
 
diff --git a/migration/savevm.c b/migration/savevm.c
index 9e0e286..48b6223 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -702,6 +702,20 @@ bool qemu_savevm_state_blocked(Error **errp)
 return false;
 }
 
+void qemu_savevm_post_load(void)
+{
+SaveStateEntry *se;
+
+QTAILQ_FOREACH(se, _state.handlers, entry) {
+if (!se->ops || !se->ops->post_load_state) {
+continue;
+}
+
+se->ops->post_load_state(se->opaque);
+}
+}
+
+
 void qemu_savevm_state_header(QEMUFile *f)
 {
 trace_savevm_state_header();
@@ -1140,6 +1154,7 @@ int qemu_loadvm_state(QEMUFile *f)
 }
 
 cpu_synchronize_all_post_init();
+qemu_savevm_post_load();
 
 ret = 0;
 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 0/3] Introduce MSI hardware mapping for VFIO

2015-11-24 Thread Pavel Fedin

On some architectures (e.g. ARM64) if the device is behind an IOMMU, and
is being mapped by VFIO, it is necessary to also add mappings for MSI
translation register for interrupts to work. This series implements the
necessary API to do this, and makes use of this API for GICv3 ITS on
ARM64.

v1 => v2:
- Adde dependency on CONFIG_GENERIC_MSI_IRQ_DOMAIN in some parts of the
  code, should fix build without this option

Pavel Fedin (3):
  vfio: Introduce map and unmap operations
  gicv3, its: Introduce VFIO map and unmap operations
  vfio: Introduce generic MSI mapping operations

 drivers/irqchip/irq-gic-v3-its.c   |  31 ++
 drivers/vfio/pci/vfio_pci_intrs.c  |  11 
 drivers/vfio/vfio.c| 116 +
 drivers/vfio/vfio_iommu_type1.c|  29 ++
 include/linux/irqchip/arm-gic-v3.h |   2 +
 include/linux/msi.h|  12 
 include/linux/vfio.h   |  17 +-
 7 files changed, 217 insertions(+), 1 deletion(-)

-- 
2.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-24 Thread Lan Tianyu

This patchset is to propose a solution of adding live migration
support for SRIOV NIC.

During migration, Qemu needs to let VF driver in the VM to know
migration start and end. Qemu adds faked PCI migration capability
to help to sync status between two sides during migration.

Qemu triggers VF's mailbox irq via sending MSIX msg when migration
status is changed. VF driver tells Qemu its mailbox vector index
via the new PCI capability. In some cases(NIC is suspended or closed),
VF mailbox irq is freed and VF driver can disable irq injecting via
new capability.

VF driver will put down nic before migration and put up again on
the target machine.

Lan Tianyu (3):
  VFIO: Add new ioctl cmd VFIO_GET_PCI_CAP_INFO
  PCI: Add macros for faked PCI migration capability
  Ixgbevf: Add migration support for ixgbevf driver

 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |   5 ++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 102 ++
 drivers/vfio/pci/vfio_pci.c   |  21 +
 drivers/vfio/pci/vfio_pci_config.c|  38 ++--
 drivers/vfio/pci/vfio_pci_private.h   |   5 ++
 include/uapi/linux/pci_regs.h |  18 +++-
 include/uapi/linux/vfio.h |  12 +++
 7 files changed, 194 insertions(+), 7 deletions(-)

-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH V2 08/10] Qemu: Add save_before_stop callback to run just before stopping VCPU during migration

2015-11-24 Thread Lan Tianyu

This patch is to add a callback which is called just before stopping VCPU.
It's for VF migration to trigger mailbox irq as later as possible to
decrease service downtime.

Signed-off-by: Lan Tianyu 
---
 include/migration/vmstate.h |  3 +++
 include/sysemu/sysemu.h |  1 +
 migration/migration.c   |  3 ++-
 migration/savevm.c  | 13 +
 4 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index dc681a6..093faf1 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -58,6 +58,9 @@ typedef struct SaveVMHandlers {
 
 /* This runs after restoring CPU related state */
 void (*post_load_state)(void *opaque);
+
+/* This runs before stopping VCPU */
+void (*save_before_stop)(QEMUFile *f, void *opaque);
 LoadStateHandler *load_state;
 } SaveVMHandlers;
 
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index df80951..3d0d72c 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -84,6 +84,7 @@ void qemu_announce_self(void);
 bool qemu_savevm_state_blocked(Error **errp);
 void qemu_savevm_state_begin(QEMUFile *f,
  const MigrationParams *params);
+void qemu_savevm_save_before_stop(QEMUFile *f);
 void qemu_savevm_state_header(QEMUFile *f);
 int qemu_savevm_state_iterate(QEMUFile *f);
 void qemu_savevm_state_complete(QEMUFile *f);
diff --git a/migration/migration.c b/migration/migration.c
index c6ac08a..fccadea 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -759,7 +759,6 @@ int64_t migrate_xbzrle_cache_size(void)
 }
 
 /* migration thread support */
-
 static void *migration_thread(void *opaque)
 {
 MigrationState *s = opaque;
@@ -788,6 +787,8 @@ static void *migration_thread(void *opaque)
 } else {
 int ret;
 
+qemu_savevm_save_before_stop(s->file);
+
 qemu_mutex_lock_iothread();
 start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
 qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER);
diff --git a/migration/savevm.c b/migration/savevm.c
index 48b6223..c2e4802 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -715,6 +715,19 @@ void qemu_savevm_post_load(void)
 }
 }
 
+void qemu_savevm_save_before_stop(QEMUFile *f)
+{
+SaveStateEntry *se;
+
+QTAILQ_FOREACH(se, _state.handlers, entry) {
+if (!se->ops || !se->ops->save_before_stop) {
+continue;
+}
+   
+se->ops->save_before_stop(f, se->opaque);
+}
+}
+
 
 void qemu_savevm_state_header(QEMUFile *f)
 {
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] vfio: Introduce generic MSI mapping operations

2015-11-24 Thread kbuild test robot

Hi Pavel,

[auto build test ERROR on tip/irq/core]
[also build test ERROR on v4.4-rc2 next-20151124]

url:
https://github.com/0day-ci/linux/commits/Pavel-Fedin/Introduce-MSI-hardware-mapping-for-VFIO/20151124-155050
config: powerpc-allmodconfig (attached as .config)
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=powerpc 

All errors (new ones prefixed by >>):

   drivers/vfio/vfio.c: In function 'vfio_device_map_msi':
>> drivers/vfio/vfio.c:908:11: error: dereferencing pointer to incomplete type 
>> 'struct msi_domain_info'
 if (!info->ops->vfio_map)
  ^
   drivers/vfio/vfio.c: In function 'msi_release':
   drivers/vfio/vfio.c:958:6: error: dereferencing pointer to incomplete type 
'struct msi_domain_info'
 info->ops->vfio_unmap(vmsi->domain, container->iommu_driver->ops,
 ^
   drivers/vfio/vfio.c: In function 'vfio_device_unmap_msi':
   drivers/vfio/vfio.c:976:11: error: dereferencing pointer to incomplete type 
'struct msi_domain_info'
 if (!info->ops->vfio_unmap)
  ^

vim +908 drivers/vfio/vfio.c

   902  struct vfio_msi *vmsi;
   903  int ret;
   904  
   905  if (!msi_domain)
   906  return 0;
   907  info = msi_domain->host_data;
 > 908  if (!info->ops->vfio_map)
   909  return 0;
   910  
   911  device = dev_get_drvdata(dev);

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data

[PATCH v2 2/3] gicv3, its: Introduce VFIO map and unmap operations

2015-11-24 Thread Pavel Fedin

These new functions use the supplied IOMMU in order to map and unmap MSI
translation register(s).

Signed-off-by: Pavel Fedin 
---
 drivers/irqchip/irq-gic-v3-its.c   | 31 +++
 include/linux/irqchip/arm-gic-v3.h |  2 ++
 include/linux/msi.h| 12 
 3 files changed, 45 insertions(+)

diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index e23d1d1..b97dfd7 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1257,8 +1258,38 @@ out:
return 0;
 }
 
+#if IS_ENABLED(CONFIG_VFIO)
+
+static int its_vfio_map(struct irq_domain *domain,
+   const struct vfio_iommu_driver_ops *ops,
+   void *iommu_data)
+{
+   struct msi_domain_info *msi_info = msi_get_domain_info(domain);
+   struct its_node *its = msi_info->data;
+   u64 addr = its->phys_base + GIC_V3_ITS_CONTROL_SIZE;
+
+   return ops->map(iommu_data, addr, addr, 1, IOMMU_READ|IOMMU_WRITE);
+}
+
+static void its_vfio_unmap(struct irq_domain *domain,
+  const struct vfio_iommu_driver_ops *ops,
+  void *iommu_data)
+{
+   struct msi_domain_info *msi_info = msi_get_domain_info(domain);
+   struct its_node *its = msi_info->data;
+   u64 addr = its->phys_base + GIC_V3_ITS_CONTROL_SIZE;
+
+   ops->unmap(iommu_data, addr, 1);
+}
+
+#endif
+
 static struct msi_domain_ops its_msi_domain_ops = {
.msi_prepare= its_msi_prepare,
+#if IS_ENABLED(CONFIG_VFIO)
+   .vfio_map   = its_vfio_map,
+   .vfio_unmap = its_vfio_unmap,
+#endif
 };
 
 static int its_irq_gic_domain_alloc(struct irq_domain *domain,
diff --git a/include/linux/irqchip/arm-gic-v3.h 
b/include/linux/irqchip/arm-gic-v3.h
index bff3eee..dfd2bed 100644
--- a/include/linux/irqchip/arm-gic-v3.h
+++ b/include/linux/irqchip/arm-gic-v3.h
@@ -241,6 +241,8 @@
 #define GITS_BASER_TYPE_RESERVED6  6
 #define GITS_BASER_TYPE_RESERVED7  7
 
+#define GIC_V3_ITS_CONTROL_SIZE0x1
+
 /*
  * ITS commands
  */
diff --git a/include/linux/msi.h b/include/linux/msi.h
index f71a25e..48faea9 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -155,6 +155,8 @@ void arch_restore_msi_irqs(struct pci_dev *dev);
 void default_teardown_msi_irqs(struct pci_dev *dev);
 void default_restore_msi_irqs(struct pci_dev *dev);
 
+struct vfio_iommu_driver_ops;
+
 struct msi_controller {
struct module *owner;
struct device *dev;
@@ -189,6 +191,8 @@ struct msi_domain_info;
  * @msi_finish:Optional callbacl to finalize the allocation
  * @set_desc:  Set the msi descriptor for an interrupt
  * @handle_error:  Optional error handler if the allocation fails
+ * @vfio_map:  Map the MSI hardware for VFIO
+ * @vfio_unmap:Unmap the MSI hardware for VFIO
  *
  * @get_hwirq, @msi_init and @msi_free are callbacks used by
  * msi_create_irq_domain() and related interfaces
@@ -218,6 +222,14 @@ struct msi_domain_ops {
struct msi_desc *desc);
int (*handle_error)(struct irq_domain *domain,
struct msi_desc *desc, int error);
+#if IS_ENABLED(CONFIG_VFIO)
+   int (*vfio_map)(struct irq_domain *domain,
+   const struct vfio_iommu_driver_ops *ops,
+   void *iommu_data);
+   void(*vfio_unmap)(struct irq_domain *domain,
+ const struct vfio_iommu_driver_ops *ops,
+ void *iommu_data);
+#endif
 };
 
 /**
-- 
2.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH V2 05/10] Qemu/VFIO: Expose PCI config space read/write and msix functions

2015-11-24 Thread Lan Tianyu

Signed-off-by: Lan Tianyu 
---
 hw/vfio/pci.c | 6 +++---
 hw/vfio/pci.h | 4 
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index d0354a0..7c43fc1 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -613,7 +613,7 @@ static void vfio_msix_vector_release(PCIDevice *pdev, 
unsigned int nr)
 }
 }
 
-static void vfio_enable_msix(VFIOPCIDevice *vdev)
+void vfio_enable_msix(VFIOPCIDevice *vdev)
 {
 vfio_disable_interrupts(vdev);
 
@@ -1931,7 +1931,7 @@ static void vfio_bar_quirk_free(VFIOPCIDevice *vdev, int 
nr)
 /*
  * PCI config space
  */
-static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
+uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
 {
 VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
 uint32_t emu_bits = 0, emu_val = 0, phys_val = 0, val;
@@ -1964,7 +1964,7 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, 
uint32_t addr, int len)
 return val;
 }
 
-static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
+void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
   uint32_t val, int len)
 {
 VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 6083300..6c00575 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -158,3 +158,7 @@ typedef struct VFIORomBlacklistEntry {
 #define MSIX_CAP_LENGTH 12
 
 uint8_t vfio_find_free_cfg_reg(VFIOPCIDevice *vdev, int pos, uint8_t size);
+uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
+void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
+   uint32_t val, int len);
+void vfio_enable_msix(VFIOPCIDevice *vdev);
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH V2 04/10] Qemu/VFIO: Add vfio_find_free_cfg_reg() to find free PCI config space regs

2015-11-24 Thread Lan Tianyu

This patch is to add ioctl wrap to find free PCI config sapce regs.

Signed-off-by: Lan Tianyu 
---
 hw/vfio/pci.c | 19 +++
 hw/vfio/pci.h |  2 ++
 2 files changed, 21 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 29845e3..d0354a0 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2508,6 +2508,25 @@ static void vfio_check_af_flr(VFIOPCIDevice *vdev, 
uint8_t pos)
 }
 }
 
+uint8_t vfio_find_free_cfg_reg(VFIOPCIDevice *vdev, int pos, uint8_t size)
+{
+struct vfio_pci_cap_info reg_info = {
+.argsz = sizeof(reg_info),
+.offset = pos,
+.index = VFIO_PCI_CAP_GET_FREE_REGION,
+.size = size,
+};
+int ret;
+
+ret = ioctl(vdev->vbasedev.fd, VFIO_GET_PCI_CAP_INFO, _info);
+if (ret || reg_info.offset == 0) { 
+error_report("vfio: Failed to find free PCI config reg: %m\n");
+return -EFAULT;
+}
+
+return reg_info.offset; 
+}
+
 static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos)
 {
 PCIDevice *pdev = >pdev;
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 9f360bf..6083300 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -156,3 +156,5 @@ typedef struct VFIORomBlacklistEntry {
 } VFIORomBlacklistEntry;
 
 #define MSIX_CAP_LENGTH 12
+
+uint8_t vfio_find_free_cfg_reg(VFIOPCIDevice *vdev, int pos, uint8_t size);
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH V2 06/10] Qemu/PCI: Add macros for faked PCI migration capability

2015-11-24 Thread Lan Tianyu

This patch is to extend PCI CAP id for migration cap and
add reg macros. The CAP ID is trial and we may find better one if the
solution is feasible.

*PCI_VF_MIGRATION_CAP
For VF driver to  control that triggers mailbox irq or not during migration.

*PCI_VF_MIGRATION_VMM_STATUS
Qemu stores migration status in the reg

*PCI_VF_MIGRATION_VF_STATUS
VF driver tells Qemu ready for migration

*PCI_VF_MIGRATION_IRQ
VF driver stores mailbox interrupt vector in the reg for Qemu to trigger during 
migration.

Signed-off-by: Lan Tianyu 
---
 include/hw/pci/pci_regs.h | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/include/hw/pci/pci_regs.h b/include/hw/pci/pci_regs.h
index 57e8c80..0dcaf7e 100644
--- a/include/hw/pci/pci_regs.h
+++ b/include/hw/pci/pci_regs.h
@@ -213,6 +213,7 @@
 #define  PCI_CAP_ID_MSIX   0x11/* MSI-X */
 #define  PCI_CAP_ID_SATA   0x12/* Serial ATA */
 #define  PCI_CAP_ID_AF 0x13/* PCI Advanced Features */
+#define  PCI_CAP_ID_MIGRATION   0x14 
 #define PCI_CAP_LIST_NEXT  1   /* Next capability in the list */
 #define PCI_CAP_FLAGS  2   /* Capability defined flags (16 bits) */
 #define PCI_CAP_SIZEOF 4
@@ -716,4 +717,22 @@
 #define PCI_ACS_CTRL   0x06/* ACS Control Register */
 #define PCI_ACS_EGRESS_CTL_V   0x08/* ACS Egress Control Vector */
 
+/* Migration*/
+#define PCI_VF_MIGRATION_CAP0x04
+#define PCI_VF_MIGRATION_VMM_STATUS0x05
+#define PCI_VF_MIGRATION_VF_STATUS 0x06
+#define PCI_VF_MIGRATION_IRQ   0x07
+
+#define PCI_VF_MIGRATION_CAP_SIZE   0x08
+
+#define VMM_MIGRATION_END0x00
+#define VMM_MIGRATION_START  0x01  
+
+#define PCI_VF_WAIT_FOR_MIGRATION   0x00  
+#define PCI_VF_READY_FOR_MIGRATION  0x01
+
+#define PCI_VF_MIGRATION_DISABLE0x00
+#define PCI_VF_MIGRATION_ENABLE 0x01
+
+
 #endif /* LINUX_PCI_REGS_H */
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 3/3] vfio: Introduce generic MSI mapping operations

2015-11-24 Thread Pavel Fedin

These operations are used in order to map and unmap MSI translation
registers for the device, allowing it to send MSIs to the host while being
mapped via IOMMU.

Usage of MSI controllers is tracked on a per-device basis using reference
counting. An MSI controller remains mapped as long as there's at least one
device referring to it using MSI.

Signed-off-by: Pavel Fedin 
---
 drivers/vfio/pci/vfio_pci_intrs.c |  11 
 drivers/vfio/vfio.c   | 116 ++
 include/linux/vfio.h  |  13 +
 3 files changed, 140 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
b/drivers/vfio/pci/vfio_pci_intrs.c
index 3b3ba15..3c8be59 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -259,12 +259,19 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, 
int nvec, bool msix)
if (!vdev->ctx)
return -ENOMEM;
 
+   ret = vfio_device_map_msi(>dev);
+   if (ret) {
+   kfree(vdev->ctx);
+   return ret;
+   }
+
if (msix) {
int i;
 
vdev->msix = kzalloc(nvec * sizeof(struct msix_entry),
 GFP_KERNEL);
if (!vdev->msix) {
+   vfio_device_unmap_msi(>dev);
kfree(vdev->ctx);
return -ENOMEM;
}
@@ -277,6 +284,7 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, 
int nvec, bool msix)
if (ret > 0)
pci_disable_msix(pdev);
kfree(vdev->msix);
+   vfio_device_unmap_msi(>dev);
kfree(vdev->ctx);
return ret;
}
@@ -285,6 +293,7 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, 
int nvec, bool msix)
if (ret < nvec) {
if (ret > 0)
pci_disable_msi(pdev);
+   vfio_device_unmap_msi(>dev);
kfree(vdev->ctx);
return ret;
}
@@ -413,6 +422,8 @@ static void vfio_msi_disable(struct vfio_pci_device *vdev, 
bool msix)
} else
pci_disable_msi(pdev);
 
+   vfio_device_unmap_msi(>dev);
+
vdev->irq_type = VFIO_PCI_NUM_IRQS;
vdev->num_ctx = 0;
kfree(vdev->ctx);
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index de632da..37d99f5 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -21,9 +21,11 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -63,6 +65,8 @@ struct vfio_container {
struct vfio_iommu_driver*iommu_driver;
void*iommu_data;
boolnoiommu;
+   struct list_headmsi_list;
+   struct mutexmsi_lock;
 };
 
 struct vfio_unbound_dev {
@@ -97,6 +101,13 @@ struct vfio_device {
void*device_data;
 };
 
+struct vfio_msi {
+   struct kref kref;
+   struct list_headmsi_next;
+   struct vfio_container   *container;
+   struct irq_domain   *domain;
+};
+
 #ifdef CONFIG_VFIO_NOIOMMU
 static bool noiommu __read_mostly;
 module_param_named(enable_unsafe_noiommu_support,
@@ -882,6 +893,109 @@ void *vfio_device_data(struct vfio_device *device)
 }
 EXPORT_SYMBOL_GPL(vfio_device_data);
 
+#ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN
+
+int vfio_device_map_msi(struct device *dev)
+{
+   struct irq_domain *msi_domain = dev_get_msi_domain(dev);
+   struct msi_domain_info *info;
+   struct vfio_device *device;
+   struct vfio_container *container;
+   struct vfio_msi *vmsi;
+   int ret;
+
+   if (!msi_domain)
+   return 0;
+   info = msi_domain->host_data;
+   if (!info->ops->vfio_map)
+   return 0;
+
+   device = dev_get_drvdata(dev);
+   container = device->group->container;
+
+   if (!container->iommu_driver->ops->map)
+   return -EINVAL;
+
+   mutex_lock(>msi_lock);
+
+   list_for_each_entry(vmsi, >msi_list, msi_next) {
+   if (vmsi->domain == msi_domain) {
+   kref_get(>kref);
+   mutex_unlock(>msi_lock);
+   return 0;
+   }
+   }
+
+   vmsi = kmalloc(sizeof(*vmsi), GFP_KERNEL);
+   if (!vmsi) {
+   mutex_unlock(>msi_lock);
+   return -ENOMEM;
+   }
+
+   ret = info->ops->vfio_map(msi_domain, container->iommu_driver->ops,
+ container->iommu_data);
+   if (ret) {
+   mutex_unlock(>msi_lock);
+   kfree(vmsi);
+   return ret;
+

[PATCH 0/3] KVM: arm/arm64: Fix some more timer related issues

2015-11-24 Thread Christoffer Dall

This little series addresses some problems we've been observing with the
arch timer.  First, we were fiddling with a PPI timer interrupt outside
of a preemptible section, which is bad for obvious reasons.  Second, we
were clearing the physical active state when we shouldn't.  Third, we
can simplify the vgic code by just considering the LR state instead of
the GIC physical state on guest return.

Christoffer Dall (3):
  KVM: arm/arm64: Fix preemptible timer active state crazyness
  KVM: arm/arm64: arch_timer: Preserve physical dist. active state on
LR.active
  KVM: arm/arm64: vgic: Trust the LR state for HW IRQs

 arch/arm/kvm/arm.c|  7 +--
 include/kvm/arm_vgic.h|  2 +-
 virt/kvm/arm/arch_timer.c | 28 +++--
 virt/kvm/arm/vgic.c   | 53 ---
 4 files changed, 46 insertions(+), 44 deletions(-)

-- 
2.1.2.330.g565301e.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] KVM: arm/arm64: vgic: Trust the LR state for HW IRQs

2015-11-24 Thread Christoffer Dall

We were probing the physial distributor state for the active state of a
HW virtual IRQ, because we had seen evidence that the LR state was not
cleared when the guest deactivated a virtual interrupted.

However, this issue turned out to be a software bug in the GIC, which
was solved by: 84aab5e68c2a5e1e18d81ae8308c3ce25d501b29
(KVM: arm/arm64: arch_timer: Preserve physical dist. active
state on LR.active, 2015-11-24)

Therefore, get rid of the complexities and just look at the LR.

Signed-off-by: Christoffer Dall 
---
 virt/kvm/arm/vgic.c | 16 ++--
 1 file changed, 2 insertions(+), 14 deletions(-)

diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
index 9002f0d..55cd7e3 100644
--- a/virt/kvm/arm/vgic.c
+++ b/virt/kvm/arm/vgic.c
@@ -1420,25 +1420,13 @@ static bool vgic_process_maintenance(struct kvm_vcpu 
*vcpu)
 static bool vgic_sync_hwirq(struct kvm_vcpu *vcpu, int lr, struct vgic_lr vlr)
 {
struct vgic_dist *dist = >kvm->arch.vgic;
-   struct irq_phys_map *map;
-   bool phys_active;
bool level_pending;
-   int ret;
 
if (!(vlr.state & LR_HW))
return false;
 
-   map = vgic_irq_map_search(vcpu, vlr.irq);
-   BUG_ON(!map);
-
-   ret = irq_get_irqchip_state(map->irq,
-   IRQCHIP_STATE_ACTIVE,
-   _active);
-
-   WARN_ON(ret);
-
-   if (phys_active)
-   return 0;
+   if (vlr.state & LR_STATE_ACTIVE)
+   return false;
 
spin_lock(>lock);
level_pending = process_queued_irq(vcpu, lr, vlr);
-- 
2.1.2.330.g565301e.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3] KVM: arm/arm64: Fix preemptible timer active state crazyness

2015-11-24 Thread Christoffer Dall

We were setting the physical active state on the GIC distributor in a
preemptible section, which could cause us to set the active state on
different physical CPU from the one we were actually going to run on,
hacoc ensues.

Since we are no longer descheduling/scheduling soft timers in the
flush/sync timer functions, simply moving the timer flush into a
non-preemptible section.

Signed-off-by: Christoffer Dall 
---
 arch/arm/kvm/arm.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index eab83b2..e06fd29 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -564,17 +564,12 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
vcpu_sleep(vcpu);
 
/*
-* Disarming the background timer must be done in a
-* preemptible context, as this call may sleep.
-*/
-   kvm_timer_flush_hwstate(vcpu);
-
-   /*
 * Preparing the interrupts to be injected also
 * involves poking the GIC, which must be done in a
 * non-preemptible context.
 */
preempt_disable();
+   kvm_timer_flush_hwstate(vcpu);
kvm_vgic_flush_hwstate(vcpu);
 
local_irq_disable();
-- 
2.1.2.330.g565301e.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] KVM: arm/arm64: arch_timer: Preserve physical dist. active state on LR.active

2015-11-24 Thread Christoffer Dall

We were incorrectly removing the active state from the physical
distributor on the timer interrupt when the timer output level was
deasserted.  We shouldn't be doing this without considering the virtual
interrupt's active state, because the architecture requires that when an
LR has the HW bit set and the pending or active bits set, then the
physical interrupt must also have the corresponding bits set.

This addresses an issue where we have been observing an inconsistency
between the LR state and the physical distributor state where the LR
state was active and the physical distributor was not active, which
shouldn't happen.

Signed-off-by: Christoffer Dall 
---
 include/kvm/arm_vgic.h|  2 +-
 virt/kvm/arm/arch_timer.c | 28 +---
 virt/kvm/arm/vgic.c   | 37 +
 3 files changed, 43 insertions(+), 24 deletions(-)

diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index 9c747cb..d2f4147 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -342,10 +342,10 @@ int kvm_vgic_inject_mapped_irq(struct kvm *kvm, int cpuid,
   struct irq_phys_map *map, bool level);
 void vgic_v3_dispatch_sgi(struct kvm_vcpu *vcpu, u64 reg);
 int kvm_vgic_vcpu_pending_irq(struct kvm_vcpu *vcpu);
-int kvm_vgic_vcpu_active_irq(struct kvm_vcpu *vcpu);
 struct irq_phys_map *kvm_vgic_map_phys_irq(struct kvm_vcpu *vcpu,
   int virt_irq, int irq);
 int kvm_vgic_unmap_phys_irq(struct kvm_vcpu *vcpu, struct irq_phys_map *map);
+bool kvm_vgic_map_is_active(struct kvm_vcpu *vcpu, struct irq_phys_map *map);
 
 #define irqchip_in_kernel(k)   (!!((k)->arch.vgic.in_kernel))
 #define vgic_initialized(k)(!!((k)->arch.vgic.nr_cpus))
diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
index 21a0ab2..69bca18 100644
--- a/virt/kvm/arm/arch_timer.c
+++ b/virt/kvm/arm/arch_timer.c
@@ -221,17 +221,23 @@ void kvm_timer_flush_hwstate(struct kvm_vcpu *vcpu)
kvm_timer_update_state(vcpu);
 
/*
-* If we enter the guest with the virtual input level to the VGIC
-* asserted, then we have already told the VGIC what we need to, and
-* we don't need to exit from the guest until the guest deactivates
-* the already injected interrupt, so therefore we should set the
-* hardware active state to prevent unnecessary exits from the guest.
-*
-* Conversely, if the virtual input level is deasserted, then always
-* clear the hardware active state to ensure that hardware interrupts
-* from the timer triggers a guest exit.
-*/
-   if (timer->irq.level)
+   * If we enter the guest with the virtual input level to the VGIC
+   * asserted, then we have already told the VGIC what we need to, and
+   * we don't need to exit from the guest until the guest deactivates
+   * the already injected interrupt, so therefore we should set the
+   * hardware active state to prevent unnecessary exits from the guest.
+   *
+   * Also, if we enter the guest with the virtual timer interrupt active,
+   * then it must be active on the physical distributor, because we set
+   * the HW bit and the guest must be able to deactivate the virtual and
+   * physical interrupt at the same time.
+   *
+   * Conversely, if the virtual input level is deasserted and the virtual
+   * interrupt is not active, then always clear the hardware active state
+   * to ensure that hardware interrupts from the timer triggers a guest
+   * exit.
+   */
+   if (timer->irq.level || kvm_vgic_map_is_active(vcpu, timer->map))
phys_active = true;
else
phys_active = false;
diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
index 5335383..9002f0d 100644
--- a/virt/kvm/arm/vgic.c
+++ b/virt/kvm/arm/vgic.c
@@ -1096,6 +1096,30 @@ static void vgic_retire_lr(int lr_nr, struct kvm_vcpu 
*vcpu)
vgic_set_lr(vcpu, lr_nr, vlr);
 }
 
+static int dist_active_irq(struct kvm_vcpu *vcpu)
+{
+   struct vgic_dist *dist = >kvm->arch.vgic;
+
+   if (!irqchip_in_kernel(vcpu->kvm))
+   return 0;
+
+   return test_bit(vcpu->vcpu_id, dist->irq_active_on_cpu);
+}
+
+bool kvm_vgic_map_is_active(struct kvm_vcpu *vcpu, struct irq_phys_map *map)
+{
+   int i;
+
+   for (i = 0; i < vcpu->arch.vgic_cpu.nr_lr; i++) {
+   struct vgic_lr vlr = vgic_get_lr(vcpu, i);
+
+   if (vlr.irq == map->virt_irq && vlr.state & LR_STATE_ACTIVE)
+   return true;
+   }
+
+   return dist_active_irq(vcpu);
+}
+
 /*
  * An interrupt may have been disabled after being made pending on the
  * CPU interface (the classic case is a timer running while we're
@@ -1248,7 +1272,7 @@ static void __kvm_vgic_flush_hwstate(struct kvm_vcpu 
*vcpu)
 * may have been serviced from another

Re: [PATCH 1/3] KVM: arm/arm64: Fix preemptible timer active state crazyness

2015-11-24 Thread Marc Zyngier

On Tue, 24 Nov 2015 16:43:58 +0100
Christoffer Dall  wrote:

> We were setting the physical active state on the GIC distributor in a
> preemptible section, which could cause us to set the active state on
> different physical CPU from the one we were actually going to run on,
> hacoc ensues.
> 
> Since we are no longer descheduling/scheduling soft timers in the
> flush/sync timer functions, simply moving the timer flush into a
> non-preemptible section.
> 
> Signed-off-by: Christoffer Dall 
> ---
>  arch/arm/kvm/arm.c | 7 +--
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index eab83b2..e06fd29 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -564,17 +564,12 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
> struct kvm_run *run)
>   vcpu_sleep(vcpu);
>  
>   /*
> -  * Disarming the background timer must be done in a
> -  * preemptible context, as this call may sleep.
> -  */
> - kvm_timer_flush_hwstate(vcpu);
> -
> - /*
>* Preparing the interrupts to be injected also
>* involves poking the GIC, which must be done in a
>* non-preemptible context.
>*/
>   preempt_disable();
> + kvm_timer_flush_hwstate(vcpu);
>   kvm_vgic_flush_hwstate(vcpu);
>  
>   local_irq_disable();

Reviewed-by: Marc Zyngier 

M.
-- 
Jazz is not dead. It just smells funny.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Trying to switch EPTP for execute-protecting guest pages

2015-11-24 Thread Estrada, Zachary J


On 11/24/2015 09:13 AM, Paolo Bonzini wrote:



On 24/11/2015 15:51, Estrada, Zachary J wrote:

2) Got it. Let's say I want to work with a copy of the extended page
tables instead of the original, what would be the best way to do so?


Why would you want that?  It's difficult to give an answer without
understanding what you're doing.  Notice that KVM pretty much always
leaves the X bit set (__direct_map uses ACC_ALL for the pte_access
parameter) so it's easy to go from your copy of the extended page tables
to the original.


Reply sent offlist.


I'm not sure if this is your problem, but perhaps you want to record in
the role whether the page comes from your version or the original?  The
role is like the hash key, if the role is the same you get the same PTE.

This is extremely helpful, I had not noticed this. I'm using my new root_hpa as 
the base_role.word - does that make sense? I just tried it and I seem to get 
EPT_VIOLATIONS that I was expecting, but missing.


Thanks a ton, it appears that the role was exactly the thing I was looking for!
--Zak
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Trying to switch EPTP for execute-protecting guest pages

2015-11-24 Thread Paolo Bonzini

On 24/11/2015 16:52, Estrada, Zachary J wrote:
>> I'm not sure if this is your problem, but perhaps you want to record in
>> the role whether the page comes from your version or the original?  The
>> role is like the hash key, if the role is the same you get the same PTE.
>
> This is extremely helpful, I had not noticed this. I'm using my new
> root_hpa as the base_role.word - does that make sense? I just tried it
> and I seem to get EPT_VIOLATIONS that I was expecting, but missing.

I think you should add a new bit to the role meaning "should I clear
some X bits?" :) that is computed based on the VCPU state.  For an
example see commit 699023e2 ("KVM: x86: add SMM to the MMU role, support
SMRAM address space"), which does

+   context->base_role.smm = is_smm(vcpu);

in init_kvm_tdp_mmu.  BTW, based on what you told me offlist, what you
are doing should also just work with shadow page tables.

Paolo

> Thanks a ton, it appears that the role was exactly the thing I was
> looking for!

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

76 matches

Mail list logo